Exploring linked open data for development with a Young Lives dataset…
For the upcoming IKM Workshop on Linked Data, taking place in November in Oxford, I’ve been exploring the process and possibilities of putting a development-focussed social science dataset online as linked open data.
What is linked open data?
We can work through a definition backwards:
Data (at least for our purposes) is a series of facts recorded in a structure way. The dataset I’ve been working with in this project has been records from the Peru instance of the 2010 Health Survey of the Young Lives Longitudinal study. Essentially, it’s a large table of around 650 young people’s responses to a number of health-related questions.
Open. The openness of data has three parts: accessibility (can you access the data easily, generally over the Internet); license (are you allowed by the terms and conditions or license to access and use the data); format (is the data in an ‘open’ format so you can read and work with it without needing proprietary software).
Linked. The ‘four rules of linked data‘ put forward by Tim Berners-Lee set out an approach to publishing data which makes connections between different datasets, generally linking them up across the Internet. For example, instead of simply stating in our dataset that the information is about “Peru”, where Peru is just a word in a table, we can use the URL http://ontologi.es/place/PE (PE being an ISO standard code for Peru) to identify Peru – giving a shared ‘key’ that could be used to link our data about Peru to other people who have used the same key. The vision of linked data goes a step further however, suggesting that those shared identifiers (e.g. http://ontologi.es/place/PE) should return information about the thing they are an identifier for, ideally as ‘data’ and as ‘data’ that makes further links to other sources of data.
In fact, if you go to http://ontologi.es/place/PE you won’t find anything for human readers at all. Instead, you’ll get a downloadable file that looks something like this. You can see that it gives a name for the thing (the ‘resource’) being identified (between the tags), and it lists ‘parts’ of Peru – regions and sub-areas. It also, towards the bottom of the list, tells us that the identifier http://ontologi.es/place/PE is identical () the resource http://www.geonames.org/countries/#PE so we could go and look up some data from there too.
The way this linked data is represented uses RDF – ‘Resource Descriptor Framework’. RDF is not a single file format (there are many different ways of writing an RDF document, just as there are many different file-formats and programmes for creating a spreadsheet with), but it aims to create ‘self-describing data’ and it allows everything in a dataset to be annotated and linked together in different ways. For example, whereas in a spreadsheet if I want to tell you what a column heading means (e.g. ‘SMKPRNR3’), I need to write additional documentation. By contrast, in an RDF document I can annotate the ‘SMKPRNR3’ property with a label, comment and links to other sources of information about that variable.
Linked Open Data
Putting all these elements together we get an approach to publishing data on the Internet that makes data accessible, tries to make connections between different datasets, and allows other people to access and re-use the data that has been shared.
What have we done with the Young Lives Data?
Using one small sub-set of the Young Lives data (the 2010 Health Survey data from Peru) I’ve:
- Created descriptions of all the variables in RDF format;
- Converted the young people’s responses into RDF format;
- Loaded all this data into a server which allows queries to be run against it and for the data to be expored;
- Added extra annotations to some of the questions;
- Created some summary statistics from the data and recorded those in RDF format as well;
- Looked for comparable statistics in open data, and, finding none, converted a few comparison statistics into RDF also;
The Young Lives data is social science research data. This is quite different from much of the linked open data that is already out there. The linked open data cloud diagram shows many sources of linked open data available right now – and you will see that much of the available data is ‘data about things’ (places, events, music, publications etc.), rather than research data. It’s also very UK and US-centric, with very little development sector data (if any), and global data, available.
This has meant having to think about how to represent the Young Lives data and how best to get value out of creating links with other datasets.
We’ve not yet addressed questions of ‘license’ openness with this data demonstrator.
What was involved?
Choosing ontologies and schema
One of the biggest tasks so far has been choosing how to represent the data. Ideally we want:
- To use shared identifiers and terms in representing the data (shared ontologies)
- To structure the data using shared conventions to ease comparison with other data (shared schema);
- To make the data as self-describing as possible; and make the data easy to query
- To use URLs for resources in the data which (a) could provide more information when looked up (dereferenced); and (b) are re-useable across different datasets and projects.
Some of the widest used conventions include RDFS (Resource Descriptor Framework Schema) which provides properties for our data such as rdfs:label and rdfs:comment. So, for example, whenever we represent a question in the data, we give it an rdfs:label property of the question name. You can see this below, where I’ve created the identifier http://data.younglives.org.uk/variables/CMBRFRR3 to refer to a particular variable, and then I’ve given it a label using the label property from RDFS.
rdfs:label "Compared to your brothers you have less freedom to leave the house when you want"@en.
(http://data.younglives.org.uk/variables/CMBRFRR3 doesn’t actually return data right now – as our demonstration server isn’t running there right now, so we’re breaking the law of open data about providing ‘dereferenceable URIs’ that allow you to look up more data – but it’s not ‘illegal’ to use arbitrary URLs as identifiers even if the webserver they point at isn’t returning data.)
Many of the responses in the Young Lives survey data involve young people picking an answer from a list. These lists of responses (code lists) can be represented using SKOS – ‘Simple Knowledge Organising SYSTEM???’.
SKOS is widely used to maintain knowledge bases. Right now we’re creating our own lists of concepts (apart from for gender where we re-use a list from the Statistical Data and Metadata Exchange standard – SDMX). But if someone was maintaining a list of relevant concepts online we could simply point to and re-use their concepts for some questions. We could also use the ‘OWL’ (Ontology Web Language) term ‘sameAs’ to tell any computers or humans who understand OWL that our concept list is the same as someone else’s.
Given many questions in the Young Lives dataset were drawn from other studies – we can imagine an eco-system of re-usable survey concepts and constructs that would allow humans and machines to more easily find and interrogate comparable data.
DDI and SDMX
There are in fact many efforts already to make survey and statistic data more easily comparable – and to set standards for exchanging data, and, importantly, the meta-data that goes with it describing how it was sourced, collected, manipulated etc.
The two main standards are SDMX, primarily for large-scale statistical data and time-series, and DDI (Data Documentation Initiative). These are both XML based standards (rather than linked-data and RDF standards), but efforts are underway to create RDF representations of both, with the SDMX efforts far more advanced so far.
SDMX has a number of useful ‘concepts’ (such as a shared concept for gender), and defines a set of ‘dimensions’ on which data might be analysed (e.g. time period; area covered), so when we define the data structure for our aggregate statistics, we make links with these parts of SDMX.
For example, the fragment below is part of a data structure definition of summary statistics on smoking prevalence (the Prefix statements shows all the different vocabularies being used).
@prefix yls: . @prefix sdmx: . @prefix sdmx-dimension: . @prefix sdmx-concept: . @prefix geo: . @prefix rdfs: . @prefix qb: . yls:refArea a rdf:Property, qb:DimensionProperty; rdfs:label "Area statistic refers to"@en; rdfs:subPropertyOf sdmx-dimension:refArea; rdfs:range geo:Country; qb:concept sdmx-concept:refArea .
Representing statistical data turns out not to be entirely straightforward (If you want to see how tricky it can be – just look at a government spreadsheet and take note of all the footnotes and annotations needed to property describe what the data is saying). You need to find a structure which allows almost everything to be annotated, whilst keeping the data as simple and easy to query as possible (i.e. you want to avoid very deep ‘tree’ structures where the final value you read is nested inside many layers of explanation and annotation).
The RDF data cube vocabulary is currently under development, but seemed to be a fairly good starting point for modelling the Young Lives data, particularly the summary data.
For the individual survey responses (micro-data) each question answer is recorded as a ‘measure’ against an individual data cube ‘Observation’.
For the aggregate data we have generated, a more conventional data cubes consisting of a number of ‘dimensions’ (location, age, gender) and then a measure (e.g. smoking prevalence) has been created, and a data-structure definition created also.
You can see in the fragment below that our observation comes from the ‘health2010-younglives-smoking’ dataset, which has the ‘dsd-smokingStats’ data structure definition.
yls:smoking-PE-2010-14-Female a qb:Observation; qb:dataSet yld:health2010-younglives-smoking ; yls:refAge "14" ; yls:refArea geo:PE ; yls:refPeriod ns0:2010 ; yls:smokingPrevalence "0.11" ; sdmx-dimension:sex sdmx-code:sex-F. yls:health-2010-statistics-smoking a qb:DataSet; qb:structure yls:dsd-smokingStats .
If we could find comparable data with a similar data structure definition (or, ideally, an identical data structure definition), then making comparisons between these two datasets becomes a lot easier.
By publishing our data structure definition we also make it available for others to re-use.
The right representation?
I’ve been on a steep learning curve whilst creating this representation of the data – so there are probably many flaws and things that could be improved. I’ll be continuing to develop the data model over the coming weeks in the run up to Novembers workshop.
Converting the data
The process of finding a way to represent the data was an iterative one – and one that involved a lot of searching, researching and, essentially, looking at what other people had done and at other data resources we might want to link to, in order to select the most appropriate ontologies and structures of data to use.
For the actual conversion of the question descriptions and data I turned to the RAP RDF libraries for PHP, which make it relatively straightforward to write scripts which will convert our data.
In the future more tools for conversion into RDF may be available. For example, Google Refine is developing as a platform which might make authoring RDF easier – but for now the approach was pretty manual.
Displaying the data
Once the data was in RDF format, I needed to make it available to query.
This can be done by just providing the RDF files on a web server for people to download and explore in their own software. However, to make the data query-able across the Internet we needed a SPARQL server.
I’ve made use of Virtuoso, as there was good documentation on how to set it up as a data server to interoperate with OntoWiki as a front-end. OntoWiki is a wiki-like interface for browsing RDF data and editing it – adding new properties to RDF resources, or easily importing new data from a web-based interface.
The demonstration server is a temporary virtual server, so may not be available all the time, but at practicalparticipation.dyndns.org:8890/sparql you should find a SPARQL endpoint which can be used to query the data (which is in the http://data.younglives.org.uk/ graph)*, and at http://practicalparticipation.dyndns.org/ontowiki is an interface for browsing the young lives dataset.
*I’ll explain what this all means in a later post.
So, the data is converted.
It’s represented in a way that makes use of links between different ontologies for describing data.
But, right now, there’s not much related data available in standard formats to link to.
So the next step is to explore the possible linkages to other datasets more – and to build some visualisations on top of the data that demonstrate what is possible when we can draw in linked resources more effectively.
As Richard Cygniak put’s it in a recent review of open government data (slide 32), triplication alone isn’t very useful – it’s what we do with it…
Questions, suggestions and ideas welcome…