SemWebbing the London Gazette

Jeni Tennison (The Stationery Office)
John Sheridan (The Office of Public Sector Information)

14:00 Thursday, 8 May 2008
Open data Goldsmiths 3
Every day, across England and Wales, councils plan new roads and new towns, change where you can park and how much it costs you, grant planning permission for plants that might have toxic emissions, and make many other decisions that may affect the daily lives of many people. By law, people affected by these decisions can appeal against them, and any such proposals must, by law, be advertised through notices in the London Gazette.

“Published by Authority” since 1665, the London Gazette is the UK Government’s Official Journal and Newspaper of Record. It was set up to provide King Charles II with authoritative news while he and his court were in Oxford avoiding the Great Plague, and it has been in operation ever since. A new issue is published every working day by The Stationery Office (TSO) under contract to the Office of Public Sector Information (OPSI), the part of government responsible for the Gazettes. Each edition contains in the region of 300-500 official notices. It is a uniquely authoritative record of what is happening in the country.

The Gazette is a core part of the nation’s “information infrastructure”: as essential to society’s operation as the road network and the national grid. But the information held in the Gazette is currently mostly locked up in human-readable prose, in a paper version and as PDFs and (for modern notices) HTML on the Gazette website at http://web.archive.org/web/20080405021345/http://www.london-gazette.gov.uk/.

In 2007 the government commissioned Steinberg & Mayo to conduct the independent Power of Information Review, to advise how best to respond to current trends on the web, such as social networking and data mashing. A key point was recognition that a government website is not always the most effective place to provide information to the citizen. Better that the information is where the users are � which means its reuse by others on the web. For example, food hygiene inspection reports have greater impact on restaurant review websites than when they’re hidden on the local Council website.

Into this environment steps The London Gazette, which provides an ideal way to publish semantically enabled official information for reuse, because of its reliability and status. SemWebbing the Gazette lays the foundation for the development of a new official publishing strategy by the government; anytime legislation says that information must be published in the London Gazette, it will be in effect ensuring that information is made publicly available, in a consistent way and in a reusable form. This is an exciting new role for the grand old London Gazette.

To realise this vision, the challenge is to add semantics to the information in the London Gazette in a way that facilitates its potential reuse, allowing others to go on and construct useful web applications using Gazettes data.

We chose to do that using RDFa because:

lowly screen-scraping applications benefit from a well-specified and extensible method for embedding semantics
more advanced Semantic Web applications benefit from the full power of RDF triples
browser extensions such as Operator and Piggy Bank are able to expose the data to users who are simply browsing the website
it provides a close link between semantic information and the provenance of that information
it should be relatively easy to integrate into an existing website
Our experience is that adding RDFa to an existing website is not as straight-forward as might be hoped. The particular difficulties that we face are:

The current Gazette website does not follow best practice for URL design (there is no single URL that identifies a particular notice, for example), so we have had to create a new URL scheme to provide identifiers that could be used in RDF triples. This scheme then has to be integrated into the existing website.
As with many modern websites, the web page for a given notice is a patchwork of HTML authored directly by humans, created by ASP code, and generated from XML using XSLT. We had to ensure that the result was valid XHTML (with embedded RDFa). This requires not only changes to the code, but to the support given to the users who hand-author content for the site.
The notices themselves are represented using two markup languages: a legacy presentation-oriented markup language for older notices and a semantic-oriented markup language used for more recent notices. However, the latter was mostly designed around a subset of notices that are fairly regular and well-structured, whereas the ones that we want to mark up are extremely irregular and loosely structured. This leads to the following problems, which necessitate changing the markup language we are using:
While the markup language allows things like addresses and people’s names to be marked up inline, the content of these structures is fairly strict and doesn’t reflect the structures used in natural language notices; for example, multiple streets may be listed before a town appears in a sentence.
Individual phrases such as a date and an address can be marked up inline, but there’s no support for linking them together (for example to indicate that there is a meeting on that date at that address).
Even with support for using the markup language in the authoring stages, those creating the notices simply do not bother marking up the important structures, and even where they are marked up, the content doesn’t come from a fixed vocabulary. We explored several methods of encouraging users to make use of inline markup, but eventually decided that post-hoc automated markup of the notices was the most reliable method.
These issues, and the methods we used to overcome them, will be explored in more detail in our paper. We will also describe the ontology that we’re using and illustrate how we envision developers using the data exposed in the Gazette to create innovative web applications.