Exploring linked open data for development with a Young Lives dataset…

Exploring linked open data for development with a Young Lives dataset…

For the upcoming IKM Workshop on Linked Data, taking place in November in Oxford, I’ve been exploring the process and possibilities of putting a development-focussed social science dataset online as linked open data.

What is linked open data?

We can work through a definition backwards:

Data (at least for our purposes) is a series of facts recorded in a structure way. The dataset I’ve been working with in this project has been records from the Peru instance of the 2010 Health Survey of the Young Lives Longitudinal study. Essentially, it’s a large table of around 650 young people’s responses to a number of health-related questions.

Open. The openness of data has three parts: accessibility (can you access the data easily, generally over the Internet); license (are you allowed by the terms and conditions or license to access and use the data); format (is the data in an ‘open’ format so you can read and work with it without needing proprietary software).

Linked. The ‘four rules of linked data‘ put forward by Tim Berners-Lee set out an approach to publishing data which makes connections between different datasets, generally linking them up across the Internet. For example, instead of simply stating in our dataset that the information is about “Peru”, where Peru is just a word in a table, we can use the URL http://ontologi.es/place/PE (PE being an ISO standard code for Peru) to identify Peru – giving a shared ‘key’ that could be used to link our data about Peru to other people who have used the same key. The vision of linked data goes a step further however, suggesting that those shared identifiers (e.g. http://ontologi.es/place/PE) should return information about the thing they are an identifier for, ideally as ‘data’ and as ‘data’ that makes further links to other sources of data.

In fact, if you go to http://ontologi.es/place/PE you won’t find anything for human readers at all. Instead, you’ll get a downloadable file that looks something like this. You can see that it gives a name for the thing (the ‘resource’) being identified (between the tags), and it lists ‘parts’ of Peru – regions and sub-areas. It also, towards the bottom of the list, tells us that the identifier http://ontologi.es/place/PE is identical () the resource http://www.geonames.org/countries/#PE so we could go and look up some data from there too.

The way this linked data is represented uses RDF – ‘Resource Descriptor Framework’. RDF is not a single file format (there are many different ways of writing an RDF document, just as there are many different file-formats and programmes for creating a spreadsheet with), but it aims to create ‘self-describing data’ and it allows everything in a dataset to be annotated and linked together in different ways. For example, whereas in a spreadsheet if I want to tell you what a column heading means (e.g. ‘SMKPRNR3’), I need to write additional documentation. By contrast, in an RDF document I can annotate the ‘SMKPRNR3’ property with a label, comment and links to other sources of information about that variable.

Linked Open Data
Putting all these elements together we get an approach to publishing data on the Internet that makes data accessible, tries to make connections between different datasets, and allows other people to access and re-use the data that has been shared.

What have we done with the Young Lives Data?

Using one small sub-set of the Young Lives data (the 2010 Health Survey data from Peru) I’ve:

  • Created descriptions of all the variables in RDF format;
  • Converted the young people’s responses into RDF format;
  • Loaded all this data into a server which allows queries to be run against it and for the data to be expored;
  • Added extra annotations to some of the questions;
  • Created some summary statistics from the data and recorded those in RDF format as well;
  • Looked for comparable statistics in open data, and, finding none, converted a few comparison statistics into RDF also;

The Young Lives data is social science research data. This is quite different from much of the linked open data that is already out there. The linked open data cloud diagram shows many sources of linked open data available right now – and you will see that much of the available data is ‘data about things’ (places, events, music, publications etc.), rather than research data. It’s also very UK and US-centric, with very little development sector data (if any), and global data, available.

This has meant having to think about how to represent the Young Lives data and how best to get value out of creating links with other datasets.

We’ve not yet addressed questions of ‘license’ openness with this data demonstrator.

What was involved?

Choosing ontologies and schema
One of the biggest tasks so far has been choosing how to represent the data. Ideally we want:

  • To use shared identifiers and terms in representing the data (shared ontologies)
  • To structure the data using shared conventions to ease comparison with other data (shared schema);
  • To make the data as self-describing as possible; and make the data easy to query
  • To use URLs for resources in the data which (a) could provide more information when looked up (dereferenced); and (b) are re-useable across different datasets and projects.

RDFS
Some of the widest used conventions include RDFS (Resource Descriptor Framework Schema) which provides properties for our data such as rdfs:label and rdfs:comment. So, for example, whenever we represent a question in the data, we give it an rdfs:label property of the question name. You can see this below, where I’ve created the identifier http://data.younglives.org.uk/variables/CMBRFRR3 to refer to a particular variable, and then I’ve given it a label using the label property from RDFS.

 rdfs:label "Compared to your brothers you have less freedom to leave the house when you want"@en.

(http://data.younglives.org.uk/variables/CMBRFRR3 doesn’t actually return data right now – as our demonstration server isn’t running there right now, so we’re breaking the law of open data about providing ‘dereferenceable URIs’ that allow you to look up more data – but it’s not ‘illegal’ to use arbitrary URLs as identifiers even if the webserver they point at isn’t returning data.)

SKOS
Many of the responses in the Young Lives survey data involve young people picking an answer from a list. These lists of responses (code lists) can be represented using SKOS – ‘Simple Knowledge Organising SYSTEM???’.

SKOS is widely used to maintain knowledge bases. Right now we’re creating our own lists of concepts (apart from for gender where we re-use a list from the Statistical Data and Metadata Exchange standard – SDMX). But if someone was maintaining a list of relevant concepts online we could simply point to and re-use their concepts for some questions. We could also use the ‘OWL’ (Ontology Web Language) term ‘sameAs’ to tell any computers or humans who understand OWL that our concept list is the same as someone else’s.

Given many questions in the Young Lives dataset were drawn from other studies – we can imagine an eco-system of re-usable survey concepts and constructs that would allow humans and machines to more easily find and interrogate comparable data.

DDI and SDMX
There are in fact many efforts already to make survey and statistic data more easily comparable – and to set standards for exchanging data, and, importantly, the meta-data that goes with it describing how it was sourced, collected, manipulated etc.

The two main standards are SDMX, primarily for large-scale statistical data and time-series, and DDI (Data Documentation Initiative). These are both XML based standards (rather than linked-data and RDF standards), but efforts are underway to create RDF representations of both, with the SDMX efforts far more advanced so far.

SDMX has a number of useful ‘concepts’ (such as a shared concept for gender), and defines a set of ‘dimensions’ on which data might be analysed (e.g. time period; area covered), so when we define the data structure for our aggregate statistics, we make links with these parts of SDMX.

For example, the fragment below is part of a data structure definition of summary statistics on smoking prevalence (the Prefix statements shows all the different vocabularies being used).

@prefix yls: .
@prefix sdmx: .
@prefix sdmx-dimension:  .
@prefix sdmx-concept:  .
@prefix geo: .
@prefix rdfs:  .
@prefix qb: .

yls:refArea  a rdf:Property, qb:DimensionProperty;
    rdfs:label "Area statistic refers to"@en;
    rdfs:subPropertyOf sdmx-dimension:refArea;
    rdfs:range geo:Country;
    qb:concept sdmx-concept:refArea .

Data Cubes
Representing statistical data turns out not to be entirely straightforward (If you want to see how tricky it can be – just look at a government spreadsheet and take note of all the footnotes and annotations needed to property describe what the data is saying). You need to find a structure which allows almost everything to be annotated, whilst keeping the data as simple and easy to query as possible (i.e. you want to avoid very deep ‘tree’ structures where the final value you read is nested inside many layers of explanation and annotation).

The RDF data cube vocabulary is currently under development, but seemed to be a fairly good starting point for modelling the Young Lives data, particularly the summary data.

For the individual survey responses (micro-data) each question answer is recorded as a ‘measure’ against an individual data cube ‘Observation’.

For the aggregate data we have generated, a more conventional data cubes consisting of a number of ‘dimensions’ (location, age, gender) and then a measure (e.g. smoking prevalence) has been created, and a data-structure definition created also.

You can see in the fragment below that our observation comes from the ‘health2010-younglives-smoking’ dataset, which has the ‘dsd-smokingStats’ data structure definition.

   yls:smoking-PE-2010-14-Female a qb:Observation;
		qb:dataSet yld:health2010-younglives-smoking ;
		yls:refAge "14" ; yls:refArea geo:PE ;
		yls:refPeriod ns0:2010 ;
		yls:smokingPrevalence "0.11" ;
		sdmx-dimension:sex sdmx-code:sex-F.

	yls:health-2010-statistics-smoking a qb:DataSet;
		qb:structure yls:dsd-smokingStats .

If we could find comparable data with a similar data structure definition (or, ideally, an identical data structure definition), then making comparisons between these two datasets becomes a lot easier.

By publishing our data structure definition we also make it available for others to re-use.

The right representation?
I’ve been on a steep learning curve whilst creating this representation of the data – so there are probably many flaws and things that could be improved. I’ll be continuing to develop the data model over the coming weeks in the run up to Novembers workshop.

Converting the data
The process of finding a way to represent the data was an iterative one – and one that involved a lot of searching, researching and, essentially, looking at what other people had done and at other data resources we might want to link to, in order to select the most appropriate ontologies and structures of data to use.

For the actual conversion of the question descriptions and data I turned to the RAP RDF libraries for PHP, which make it relatively straightforward to write scripts which will convert our data.

In the future more tools for conversion into RDF may be available. For example, Google Refine is developing as a platform which might make authoring RDF easier – but for now the approach was pretty manual.

Displaying the data
Once the data was in RDF format, I needed to make it available to query.

This can be done by just providing the RDF files on a web server for people to download and explore in their own software. However, to make the data query-able across the Internet we needed a SPARQL server.

I’ve made use of Virtuoso, as there was good documentation on how to set it up as a data server to interoperate with OntoWiki as a front-end. OntoWiki is a wiki-like interface for browsing RDF data and editing it – adding new properties to RDF resources, or easily importing new data from a web-based interface.

The demonstration server is a temporary virtual server, so may not be available all the time, but at practicalparticipation.dyndns.org:8890/sparql you should find a SPARQL endpoint which can be used to query the data (which is in the http://data.younglives.org.uk/ graph)*, and at http://practicalparticipation.dyndns.org/ontowiki is an interface for browsing the young lives dataset.

*I’ll explain what this all means in a later post.

What now?

So, the data is converted.

It’s represented in a way that makes use of links between different ontologies for describing data.

But, right now, there’s not much related data available in standard formats to link to.

So the next step is to explore the possible linkages to other datasets more – and to build some visualisations on top of the data that demonstrate what is possible when we can draw in linked resources more effectively.

As Richard Cygniak put’s it in a recent review of open government data (slide 32), triplication alone isn’t very useful – it’s what we do with it…

Questions, suggestions and ideas welcome…

Sharing knowledge with blogs

Looking at pingbacks from The Giraffe, I came across this blog post from Joitske Huslebosch about The giraffe blog used by IKM and colleagues. I didn’t know where to store this so I have decided to put a link here. It was posted for a group of Dutch civil servants – the link is to a Google translation of the web discussion on this subject – who are concerned with the implications of Web 2.0 for government.

I love Prezi (and the visualisation of knowledge)

On 10 June 2010, I went to a meeting of the Intellectual Capital Circle of the InHolland University of Applied Sceinces in Hoofddorp, The Netherlands, on the subject of visaulisation of knowledge. As visualisation of knowledge is one of the themes of IKM Emergent, I went along to see if I could gain any new insights to share with my IKM colleagues.

Mind maps and concept maps
The meeting had two main components. The first comprised presentations of how mind maps and concept maps are being used in tertiary education by InHolland and other Dutch universities and organisations. The second part comprised a brainstorm – in a world cafe format – on the power of visualisation on knowledge management.

The first part comprised how software called Inspiration was being used by InHolland. The  second presentation described the use by Leiden University Medical Centre of concept maps, making links between clinical and biomedical knowledge explicit. In the light of IKM Emergent’s attempts to bridge various knowledge divides, I thought this sounded very interesting. Concept maps were useful in this regard because they can be used to make explicit knowledge more visible, based on mapping by multidisciplinary teams. They were also used to link clinical concepts with biomedical ones. Interestingly, it appeared that older people and experts had quite a lot of difficulty with this exercise because they find it difficult to look laterally at things and to access their implicit knowledge.

Brainstorm
In the brainstorm session that followed, one of the findings that most interested me was that visualisation can also play a role in codification and personalisation of knowledge, it supports out-of-the-box thinking, can play a role in negotiations of meaning and, because it replicates more clearly our own non-linear thought processes, it supports non-linear forms of work.

New ways of presentation
My fellow brainstormers told me about new visualisation software. The one that particularly caught my attention was something called Prezi and I thought I’d give it a try because I needed to make a presentation to my colleagues at Context in July.

Prezi is a new sort of presentation and can be used like PowerPoint which, although innovative in its time, is now rather standard and linear. For example, I like the way that PowerPoint reminds me of what I want to say, helps me structure my thoughts and takes the attention away from me but it does look a bit old and stuffy these days (unless you do something really special with it of course.)

Prezi
You can see the presentation I made to my colleagues All you every wanted to know about Sarah’s work and didn’t dare to ask. The presentation worked really well by using the trial version on my laptop but you need to navigate through it with either the arrows at the bottom of the presentation or with your own cursor keys, and not start panning through (that only makes you feel seasick, believe me!) Everyone I have shown the presentation has become terribly excited about Prezi too and I’m going to use it again and again. It’s so much fun and and so creative!

IKM-relevant? Annual programme meeting, days 2 and 3

After exploring and discussing (on day 1) the various pieces of research work that have been undertaken in IKM-Emergent until now, the second day of the workshop started with a world café and continued with a ‘birds of a feather session’ (a marketplace / less-open space method) where we explored some ideas for the end of the programme and a potential IKM-Emergent 2 programme. The last day of the programme put us in action planning mode around crucial activities.

The key points that came up in the strands of work I was involved on this second and third day of the meeting:

  • More attention will be paid to communication: internal communication to increase the awareness of all IKM-E members about each other’s work but also more external communication to engage with a wider group of development actors. The role of social media has been raised as a crucial point in case to leverage the great work of IKM-E to the wider world. Along with the Giraffe, IKM-E blog and the wiki, there is potential to use del.icio.us, to use Twitter (anyways I’ve been tweeting around the hash tag #ikm-e) and perhaps Slideshare or other tools.
  • The various strands of the programme will come together in some respect to reinforce each other: ripples of participation, local content/knowledge, emergence, traducture, multiple knowledges coming together in an approach that recognises complexity and power issues.
  • In the last 18 months of this programme, the IKM group will come up with various practical outputs which can be mobilised and used more easily by development agents to reflect on and adapt values, behaviours and practices: books, guidelines, checklists (of critical questions, questions and more questions), workshops, video explanations of our work, an overall narrative for the IKM programme that offers a comprehensive understanding of the issues we are questioning etc.
  • One interesting output that we will be working on is the apply our work to the current change process in which a couple of development organisations are involved, to see how helpful it is and help these organisations reflect critically on their approach to knowledge-focused development work. This activity will culminate with a learning workshop in 2011 where we may prepare additional action research activities as part of a future programme.
A possible IKM-2 programme (Ewen’s vision)
  • The IKM-2 programme in preparation will not just advocate for multiple knowledges and the likes but will actually consistently practice what it preaches and organise joint action-research work / collective inquiries (including participants from NGOs, knowledge institutes, community members, donor agency representatives, governmental agents, artists and the media) on a number of topics. This will help us: demonstrate the power, potential and strategic value but also the challenges of bringing together multiple knowledges. It will also help us develop a multiple accountability system that stimulates us to change perspectives and practices through joint action and ownership.
  • This future programme will continue operating as a network of passionate and capable individuals creating opportunities for IKM to build upon existing work or new ideas, but it will also establish more firm relations with a wide variety of networks (and journals such as the KM4Dev journal) and institutions to accompany and strengthen our questioning work.

This has been an extremely juicy two final days with a delicious fruit salad of insights and ideas, approaches and concepts. The academic head and practice-oriented arms of IKM-Emergent are still working in a somewhat disjointed fashion, but that is as natural as waking up and not yet having adjusted body coordination; nonetheless the body is wobbling on and indeed moving forward. We have been dreaming profoundly and we are now putting our dreams to action. Let’s hope we can soon take a good walk and later on run to co-create relevant next (not best) practices of sustainable and people-centred development.

By the way, some pictures of this IKM-Emergent meeting are available on: http://www.flickr.com/groups/ikm_emergent/pool/

IKM-convergent? Annual programme meeting, Wageningen, day 1

A while back I blogged about the IKM-Emergent programme and its tendency to dispersion.

The programme has evolved since then and a number of things are coalescing on this first day of the all-peeps IKM-Emergent  workshop (which brings together the three working groups, but also a number of new guests that are working on issues related to IKM-E and/or that will be working for the programme from now on).

IKM participants getting their heads around common issues
IKM participants getting their heads around common issues

A lot of very interesting ideas and insights came out from the wide variety of participants but what stroke me as key converging points are the following:

  • Dynamics of change: A lot of us were wondering how to bring about change? Should we have a very upfront / head-on approach to change or should we rather follow more subversive ways of tilting the development system?
  • Related to this, we seem to agree on the concept of intention as the driving force behind a lot of development work. In a change process, our words (i.e. lip service or love declarations to change) matter much less than our real intention to stimulate change.
  • A lot of IKM-Emergent work seems to be concerned with raising awareness about development dynamics and biases at large and about specific lenses or approaches in particular: multiple knowledges, traducture (more on this later but I would describe this as the socio-cultural translation of concepts and approaches, not just the loss of meaning that is usually part of the linguistic transaction of translation), emergence etc.
  • As in the launch event of the Change Alliance (read this blog post about it), the key difference between agency-driven and civic-driven movements. We need to support civic-driven movements – going beyond the faddism of just supporting them as part of the latest craze. Instead, what do we do to implicitly or explicitly to support these movements?
  • The importance of critical analysis and questioning which can be the only focus area we provide as ‘agency’: we need to move from setting up water pumps and delivering food onto helping all development actors equip themselves with critical reflexivity as part of the survival toolkit that stimulates self-empowerment and (less biased) development. It is this reflexivity that helps us challenge ourselves, our discourse, our practices, our being.
  • Accountability as a central practice that goes way beyond upward accountability towards donors. We need to be aware that we are (or should be) accountable to one another in all our development transactions and it is that accountability that generates the trust necessary to engage in development relationships and to open up a space for joint critical inquiry.

There was actually a lot more content in the discussion but these items stick out as pointers that came back time and again in the presentations and conversations.

This was day one of the workshop and the rest of the workshop sounds very promising! On the menu on day 2: looking back at the legacy of IKM-Emergent, limitations of the programme and the possible foundations of an IKM-Emergent 2. Keep watching this space!

Introducing the work of rural educators in Brazil

Mike Powell had previously come across Dan Baron Cohen’s work and attended a talk he was giving on his Cultural Literacy projects and non-logocentric pedagogies in Brazil. Dan was subsequently invited to attend WG1 of the IKME programme. The idea was to invite Dan to put forward a project within IKME and to encourage the production of local knowledge for sustainable development, based on Freirian principles.

WG1 spent between a year and 18 months working out its own projects and contributing to the overall IKME programme , which was then subsequently submitted to the Dutch government. WG1 participants discussed how to develop different case studies and how to work together assuming that the collaboration between group members would add depth to the work that the whole group did. For Dan intimate contact and profound enquiry would contribute to the sustainability of the work. Dan was also interested in weaving the translation methodologies work of WG1 into the practice of the World Social Forum (WSF) and the 2010 World Congress being organised by the International Drama/Theatre and Education Association (IDEA). Although the group had only just discussed the concepts, a first outline proposal submitted to the WSF was rejected. However, members of WG1 continued to talk about mutual visits and how they might work collectively. Discussions revolved around how to define quality and outcomes and the collective enterprise. An equal amount of money was apportioned between group members to carry out their case studies. Continue reading

Communications 2010 (Part 2)

I’m currently reviewing the Communications Strategy of IKM – originally written in 2007 and published as a Background Paper in 2008 – at the same time as producing a Communications work plan for 2010.

The Communications Strategy in 2007 placed a lot of emphasis on the the ‘stickiness’ of ideas and the need for IKM to develop an elevator pitch. Stickiness relates to how ideas stick and was developed by two brothers, Chip and Dan Heath, in their book, Made to stick: why some ideas survive and others die. In their conception, sticky ideas are those that are simple, unexpected, concrete, credible and emotional. And somehow linked to this in my mind is the elevator pitch: a short overview of an idea for a product, service, or project. The name reflects the fact that an elevator pitch should be possible to deliver in the time span of an elevator ride, meaning in a maximum of 30 seconds and in 130 words or fewer (Source: Wikipedia).

Although the original Communication Strategy was much enamoured of ‘stickiness’ and the elevator pitch, I now have my serious doubts as to whether these are the answer for IKM or for anyone trying to communicate complex messages. In addition, both of these approaches come from the tradition of ‘knowedge as truth’ and what we are increasingly understanding is that knowledge as truth is not important at all. Instead, it the sharing and negotiation of meaning that are important. To quote from Harry Jones’ 2009 joint ODI/IKM working paper:

While the translation and ‘transfer’ of knowledge have become widespread terms and are the focus of a number of initiaitives, some argue that the term is inappropriate. Many point to the complex and contested nature of applied social research which makes claims to stable, ‘objective’ and acontextual knowledge, embedded in some paradigms of evidence-based policy and knowledge-transfer, less appropriate (eg. Brown 2007, Walter et al 2008). Instead, it is important to recognise the contextual nature of knowledge and the complexities of its ‘use’. This means looking at knowledge interaction and the messy nature of engagements between actors with diverse types of knowledge. There is a growing literatuer advocating ínteraction’and collaboration as key activities to link knowledge and policy (Jones 2009, p. 25)

So what does all this mean for IKM’s communications? I think it means a new emphasis on personal interactions with those outside the programme as a way of developing shared meaning, rather than thinking that a elevator pitch or presenting IKM’s messages simply will do the trick. And more emphasis on developing the community of practice around IKM because it is only in the interaction between these individuals – and their interaction with those outside the programme – that current approaches to information and knowledge be changed.

It’s quite interesting that this post is also reflected in the current discussion on monitoing and evaluation (M&E) of km which is taking place on The Giraffe: Monitoring knowledge (management): an impossible task.