One of my favourite subjects at school was Graphic Communication, which in those days was techie drawing with some other stuff. For as long as I can remember I’ve been interested in how data is presented, in fact the very first comment I got on this blog was from Tony Hirst after mentioning his Visual Gadgets course un-unit (sic) blog experiment – now there’s a fixed point in time.

Since the closure of JISC RSC Scotland North & East I’ve taken the opportunity to rediscover this interest looking at network analysis tools like NodeXL and visualisation techniques and libraries like d3.js and the Google Visualization API. As part of this period of Edupunking myself I’ve submitted some (un)assessments including the Guardian Tag Explorer, which I got some great feedback on and an invitation to visit and chat to some of the Guardian engineers/journalists.

This work hasn’t gone unnoticed at JISC CETIS/OER Programme team and I’m very pleased to announce that I’ve been contracted to visualise the OER Programme (gulp) and I’ve got 40 days to do it (gulp).

This post is day 1 of the project and what I wanted to do was introduce what it is I’ll be doing over the next 39 days, jot down some notes of things I’m thinking of or have already found (including some web standards) and give you an opportunity to comment on any/all of this.

OER Programmes Visualisation Project Outline

The UK Open Educational Resources initiative is jointly led by JISC and the Higher Education Academy and now is in it’s third year. You can find out more about the programme on JISC or HEA pages. I would also highly recommend John Robertson’s (CETIS) collection of posts on OER which are a great lead-in to this project.

So what is ‘this project’ about. Here’s a link to an online version the full OER Programmes Visualisation Project proposal. In brief two work packages, one for examples and workflows for visualising OER project data stored in the JISC CETIS PROD database, and the second mainly to produce visualisations around OER content and collections (these include geographic mappings, subject groupings and volumes).

Initial thoughts on PROD

Having dabbled with PROD before I’m familiar with the data and architecture. Something I regularly struggle with is using SPARQL to query the linked datasets and I usually end up asking Wilbert Kraan (CETIS) for help. Something I thinking might be useful, even if it's just form me, is to create a synchronised version of the data stored in Google Spreadsheets (À la Guardian Datastore). My thoughts behind this are spreadsheets are more commonly understood, Google Spreadsheets include a number of visualisation options which can be embedded elsewhere, and as in my TAGSExplorer the data can be read and visualised using other visualisation libraries. CETIS also already have examples for getting PROD into Google Spreadsheets which means I can spend more time on mashing up the data.

Initial thoughts on OER content and collections

Top level data like geographic mappings look straight forward, but delving deeper is going to take some more work. As the project proposal highlights whilst projects were required host resources anywhere only requiring a reference to be submitted to Jorum what has been submitted (if anything) varies greatly.

OER creation is also only part of the story, after all open educational resources is about providing material for reuse/modification. Where is this data stored. This and other points are highlighted in Lorna Campbell's (CETIS) UKOER 3 Technical Reflections post and these issues also feature in Amber Thomas’ (JISC) Digital Infrastructure to Support Open Content for Education post which supports the JISC OER Rapid Innovation Call for Proposals. [Whilst on activity data there’s a new JISC resources which draws on the 9 Activity Data Programme projects (this includes a section on visualisation) I also can’t but help link to this video which visualises contributions (commits) to Python]


What I’ve learned over recent weeks is VML (Vector Markup Language) works for Internet Explorer 5 and later. It was never passed as a W3C standard, instead they opted for SVG (Scalable Vector Graphics) - Microsoft however didn’t incorporate it until IE9. Some libraries/tools like Google Chart API embed scalable graphics using iFrames and then detect the viewer's browser to deliver the graphic in VML or SVG. One of my preferred libraries, d3.js, only renders in SVG so there is a question about whether it should be used.

I’m only just getting up to speed with HTML5 canvas. Being part of the HTML5 spec it’s only viewable on newer browsers. So like VML/SVG there is a question of compatibility. As canvas converts coded shapes into bitmaps it can render large numbers of objects more efficiently, where as SVG draws the bitmap and remembers the shape allowing further interaction – that’s my understanding anyway.

What’s next

Top priority is to find out how much UKOER data there is and in what format. Once I’ve planted the seeds of what I need I can get on with the PROD package and the top level OER information. I’ll also need some answers around standards. My feeling is SVG should be okay as it’s a W3C recognised format, but I’m probably biased as it means I can remash some of my d3.js work.

I am as always on the scrounge for new ideas/comments so please share with what’s on your mind ;)


I’m not intending on blogging about the OER Visualisation Project but I’ve got a new piece in the jigsaw worth sharing. Below should be an embedded map of JISC/HEA funded OER projects (well half of them anyway. If you visit the Yahoo Pipe which was used to generate this there are 51 projects listed but I was only able to automatically get locations for 28). This was generated by using SPARQL to query the CETIS PROD database for projects matching the strand title “open education …”. After the map is the recipe used to make it, which includes some 2-to-1 online tutoring (aren’t PLNs great ;) and a bunch of other tools which I might dip into during this project.

[As part of the processes I also produced this PROD Theme GeoMap as a proof of concept.  As there’s a user defined theme entry on this it would be easy to drop into the CETIS PROD summary pages like this one filtering the querystring parameter into the map embed code highlighted below:

<script src="http://l.yimg.com/a/i/us/pps/mapbadge_1.5.js">{"pipe_id":"810495bf81c7d207400937f2ebfe0ebb","_btype":"map","pipe_params":{"theme":"UKOER"}}</script>

How it was made

It all started last night after reading Owen Stephens’ post on Experimenting with British Museum data in which he uses SPARQL (a query language) to extract data for use elsewhere. As I mentioned in my opening post for this project I’m not a huge fan of SPARQL, part of the problem being I can’t get my head around the markup. As I was venting my frustration ‘via the medium of Twitter’ I started received encouraging words of support from Wilbert Kraan (CETIS) @wilm and Owen @ostephens which spilled over into this Google Doc.

[If you want to see what the query in the doc does Wilbert also pointed me to the Sgvizler library which renders SPARQL SELECT queries using Google Visualization API. A version is hosted on Wilbert’s site and here is the query rendered as a table). Another resource I’ve found helpful that was recommended by Owen is Ed Chamberlain’s SPARQL Tutorial].

With my SPARQL Rosetta in hand I headed back to the CETIS Wiki knowing there was a SPARQL example to produce geographical position of projects from the FSD programme that use Archimate.

From this I produced:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX owl:  <http://www.w3.org/2002/07/owl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc:   <http://purl.org/dc/elements/1.1/>
PREFIX prod: <http://prod.cetis.ac.uk/vocab/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX mu:   <http://www.jiscmu.ac.uk/schema/muweb/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT DISTINCT xsd:float(?lat) xsd:float(?long) ?projectName ?desc ?project
?project a doap:Project .
?project doap:name ?projectName .
?project prod:strand ?strand .
 FILTER regex(?strand, "^open education", "i") .
OPTIONAL { ?project doap:shortdesc ?desc } .
?project doap:vendor ?uni .
 ?uni owl:sameAs ?fbID .
 ?muID owl:sameAs ?fbID .
 ?muID geo:lat ?lat .
 ?muID geo:long ?long .

Which I now translate as SELECT DISTINCT project id, name, latitude, longitude, description and strand WHERE strand begins with ‘open education’ “ignoring case” and OPTIOANLly get a short description (if available) and from project host name OPTIONALly lookup a matching lat/long from JISC Monitoring Unit.

There a couple of things we can do with this query. Putting it through the Rensselaer Polytechnic Institutes SPARQL Proxy we can generate formats easily digestible by Yahoo Pipes, as demonstrated by the example above (if you log in to Pipes you can see there is very little going on with the source). The proxy service also has a Google Visualization API output which could lend itself to further mashineerings, wrapping in widget/gadget etc.

Another option is to publish the results as .csv  and then use the resulting url in the formula =ImportData(“your url”), which I’ve done in this spreadsheet and again you have options for presenting the data in different ways.

So lots of potential around these techniques to extract some data, but notes of caution Yahoo Pipes can be very temperamental (in fact while I hover over the publish button for this post it's showing no results) and whilst the concept of linked data eg using JISC MU data to get geo-location for projects listed in PROD, as the spreadsheet linked above shows it didn't always find a longitude/latitude for the project, it was a 50% hit (more investigation required).

Big thanks for Wilbert and Owen for there help with this!


I’m really not intending on posting each  day … honest (although some people would like that ;), but here’s another quick hack for getting a feed post frequency using Google Spreadsheets.

Looking down Phil Barker’s UKOER pilot phase sources list (and getting over the shock of limited full repository feeds <sigh>) I noticed that Leeds Metropolitan University have a full feed(?) available via Yahoo Pipes <yummy>. The feed contains 341 items and I was interested in the frequency of item submission. My initial thought was to pull the feed into Google Spreadsheet using the ImportFeed function  (e.g. =importFeed("http://pipes.yahoo.com/pipes/pipe.run?_id=286bbb1d8d30f65b54173b3b752fa4d9&_render=rss"; "items"; true;100))  but it was limited to 20 results (looking at the raw feed I saw it had 100 items – I’m sure there must be a switch in Pipes to get more).

imageInstead I switched to getting the Pipe as JSON. JSON input powers a lot of my other Google Spreadsheet toys like TAGS and Google+ network generator so I’ve a growing bank of code snippets I can throw together.

Here’s my ‘Frankenstein’ code to read a JSON source and write to a sheet. Two little tricks in here are to recast the pubDate as a date so that it’s interpreted as such in the spreadsheet and writing the JSON objects to a sheet using normalised column headers to identify object elements (I borrowed this technique from the writing data Apps Script tutorial).

Once the data is in the spreadsheet there are some standard spreadsheety things you can do like using the Frequency function to count occurrences of a dates and generate charts like this one:

or something less conventional like using the data in a visual timeline gadget to produce:


View interactive version of Leeds Metropolitan University Unicycle feed (this link lets you file > make a copy of the spreadsheet for your own use)

And as this data was grabbed using Google Apps Script if I had a list of JSON sources for UKOER repositories I could automate the whole process …

1 Comment

I told you I wasn’t intending on posting every day ;) Yesterday ended up being quite frustrating with a number of dead-ends. It all started following on from Processing a resource feed to find frequency using Google Spreadsheets where I took Leeds Metropolitan University’s Repository feed from Yahoo Pipes to get activity data into a spreadsheet (BTW Nick Sheppard has documented how the Pipe was made here).

For the next step I wanted to do something similar with the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). You can read more about OAI-PMH here but basically its a method for repositories to share information about the resources they’ve got. For example, you can visit the Social Policy and Social Work (SWAP) Subject Centre: SWAPBox repository and see what they’ve got. They’ve got a nice interface for browsing and searching for resources which is great for humans but a struggle for machines. Instead the machines have their own interface using the SWAPBox OAI service (even though this service is designed for machines it’s been beautified with Chris Gutteridge’s OAI to XHTML XSLT). Here you can get access to individual record details or the thing I was more interested in a list of record ids and date stamps.

I had a look around/asked for a desktop OAI Harvester that could export data from. The closest I could get was the suggestion by James Toon to use Google Refine but OAI results usually have pagination and I couldn’t find a way to automatically get the next page of results. I also tried putting PKP’s Open Harvester System on a USB drive by running it on an instance of MoWeS portable webserver (now there’s an idea UKOER on a stick, could be a great conference bag goodie?).  Unfortunately I discovered there wasn’t a way to easily export harvested records (although I may be able to dump from MySQL).

Instead I reverted back to Google Apps Script (something I was avoiding as parsing XML in Apps Script is never fun). Here is a script I wrote to read OAI service ListIdentifiers and write to a spreadsheet. Using this with the SWAPBox repository we can get a distribution of submissions (the spreadsheet for this is here)

[As Leeds Met Unicycle repository also has an OAI service I thought I’d collect data from there to compare the RSS feed but I was getting timeouts]

PS Whilst looking for OAI export solutions  I came across Ghent University Library Download/APIs page which includes: daily/weekly exports of the complete datasets of the Ghent University Academic Bibliography in various formats; every report record can be accessed by HTML, RDF, Dublin Core, METS, MODS,  MPEG-21/DIDL; an Author export; department exports; classification type exports; and OAI-PMH; and other stuff.  All this lovely data, it just makes you want and go and do something with it! It’s a shame to have all these wonderful repositories of open resources but hidden away. Here are some tips from the JISC Digital Infrastructure Team on Making OER visible and findable.

PPS Here are more real world examples of Open Bibliographic Data  H/T @ostephens


One of my ambitions from Day 1 of the OER Visualisation project was to start linking PROD data in to Google Spreadsheets. Whilst this was primarily designed to help me with the project after speaking to some of the JISC/JISC CETIS people it sounds like it would help them and others.  

Here’s a spreadsheet which is the beginnings of a PROD Datastore (Warning: work in progress). Data is currently being populated from a Talis Datastore using a number of different sparql queries which are outputting csv data via a sparqlproxy (this will be switched to the Kasabi store when a suitable proxy has been sorted). You might find the SPARQL queries used to fetch this data useful for making your own so I’ve compiled them in this commentable document.  

What can we do with the spreadsheet? 

Creating pivot table reports

I used to find pivot table creation quite daunting but they are a great way to filter and analyse large sets of data. The PROD spreadsheet contains 2 example pivot reports one for technology and the other for standards (if you want to enable pivot table options you’ll need to File > Make a copy of the spreadsheet then on the pivot sheets select Data > Pivot Table Report).

The example ‘Technology Pivot’ is summarising the data from phase 1&2 of the OER Programme. You can see there is a number of technologies were recorded (over 100), the top three being YouTube, Flash and Slideshare. This data can be shown graphically using Google Spreadsheets chart tools and embedded as an interactive or static graphic.

The charts in Google Spreadsheets aren’t that exciting but there is a framework for extending these, which I’ll come back to later. Something I was hoping to do was link the data from Google Spreadsheet to IBM’s to more powerful visualisation service Many Eyes. For example below are examples of the technology pivot data as a bubble diagram and comparison Treemap and it would have been nice to automatically generate these. Tony had posted on this a couple of years ago using Many Eyes Wikified and Google Spreadsheets, but alas this part of the service was pulled last year) .

UKOER Technologies Phase 1&2 (Bubble chart) Many EyesUKOER Technologies Phase 1&2 (Treemap Comparison) Many Eyes

Visualising project relationships

One of the great things about the PROD data is there is a lot of relationship data already there. For example if you look at the PROD page for the ENABLE project you can see there are details of the projects that ENABLE builds on or was built on by, related projects and even comments that relate to the individual relationships.

This relationship data can all be extracted from PROD and in this case imported to the Spreadsheet. On the Relates_to sheet I’ve imported details of all the JISC funded project ‘relates_to’ relationships. What can we do with this data? Well at a basic level in column B we have a source id and column F has a target id which makes it suitable for using in a force layout diagram. Fortunately I’ve been playing around with online force layout diagrams for a while and most recently created a Google Spreadsheet Gadget to display this info (this is how you can extend the basic chart selection).

Whilst this gadget still needs to be verified by Google for anyone to see the results we can use the spreadsheet as a datasource for the gadget’s big brother EDGESExplorer. Publishing the spreadhseet to the web, using the built in tools to do this, we can reformat the data in EDGESExplorer to see how all JISC funded projects stored in PROD are related (click on the image below for the interactive version, you can explorer individual nodes by clicking on them). 

JISC Project Relationship

I think this graph provides a useful interface for seeing and exploring the relationship between JISC funded work. To become really useful some additional interfacing is required but there’s code I can reuse from my Guardian Tag Explorer and Twitter Conversation explorer tools, which gives you more project info and a link back to the appropriate PROD page (btw interesting post on ouseful.info on Information Literacy, Graphs, Hierarchies and the Structure of Information Networks).


So to recap: a Google Spreadsheet is being populated from PROD (live data). Users can create reports and charts within the Spreadsheet environment (live data) or export data to other services like Many Eyes (dead – as in the live link is broken - data). Finally we can publish live data from the Spreadsheet for use in other tools like EDGESExplorer.

My question for you is what data would you like in the Spreadsheet? Summary of projects by institution? Breakdown of projects by partners? Projects by JISC Programme Managers? Let me know what you would like to see ;)


I should start with the result so that you can see if it’s worth doing this:

The video shows the deposits from institutions and Subject Centres to Jorum tagged ‘ukoer’ from January 2009 to November 2011. In total over 8,000 deposits condensed into 5 minutes (there are more records, but these were the ones that could be reconciled against an institution name).

Here’s the recipe I used to do it, which should be easy to modify for your own and other repositories. As the explanation takes longer than to actually do it I’m going to assume you understand some basic tools and techniques, but you can always leave a comment if something isn’t clear.

Let start by looking at what it is we are trying to achieve. The animation is generated using code from the open source Gource project. Gource uses an input log file to visualise software commits using the format shown below. So for the  Jorum visualisation we need to generate a file with timstamp, creator (in this case the submitters host institution) and title (prefixed by subject classification).

Gource log format

The user andrew adding the file src/main.cpp on Thu, 03 Jun 2010 05:39:55 GMT (1275543595):


Getting the data – building a source

Building the log file we need details of the records from Jorum. Fortunately Jorum implements the OAI Protocol for Metadata Harvesting, which is designed to allow the easy sharing and access of repository data. I say easy but in reality its easy if you have another repository on a server somewhere that can consume OAI data, but its not easy to find a desktop based solution. After a lot of trial and error I’ve arrived at a solution using a combination of MS Excel and Google Refine (BTW “Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another” – it’s also open source).

I had hoped to do all of this in Google Refine but was struggling with the initial data import, recognising the schema and including records with multiple subject classifications, so we briefly start with MS Excel.

In Excel (I’m using 2010, other versions may work) we want to start a new workbook. In the Data ribbon select ‘From Web’. In the dialog that opens in the Address bar enter http://dspace.jorum.ac.uk/oai/request?verb=ListIdentifiers&metadataPrefix=oai_dc .

Excel data ribbon

Once it’s finished loading (which can take a while) click Import. You’ll now get some dialog boxes warning you about the xml import but you can ignore those. You should now have a sheet of List Identifiers, that is a list of all the record identifiers (ns1:identifier4) and the subject set they are attached to (ns1:setSpec6status3) - you’ll find that there are more columns, mainly blank, which we don’t need.

Excel XML data imported

Next we want add some readable subject classification to the data by changing setSpec ids into text equivalents. This data is also available via Jorum’s OAI service and the raw data can be seen by looking at http://dspace.jorum.ac.uk/oai/request?verb=ListSets.

To get this data into Excel we want to follow a similar process to above in the same spreadsheet getting Data – From Web using http://dspace.jorum.ac.uk/oai/request?verb=ListSets as the address. This gives us a sheet similar to below with setSpec ids and associated category name.

Getting subject classification

Next we want to match this data to the sheet of List Identifiers. To do this we first want to sort the data we just captured on the setSpec column. Now in the original sheet add a new column and enter the following formula in the cell immediately beneath the column name (row 2):


This formula looks up the setSpec6 value, matches it against the data we just got and returns a setName. You can now save this spreadsheet.

Getting more data using Google Refine

So far we’ve got a list of record ids from Jorum and the subject category for that record. We still need to get when the record was created, who by and resource title. To do this we are going to use Google Refine. If you haven’t already here’s how to install Google Refine. Open Google Refine and create a new project from the Excel file we just created. The default setting should work just make sure you select the sheet with 19,000 plus rows.

After the project has been created next we want to get more information for each record identifier. From the ns1:identifier4 column drop-down menu select Edit column > Add column by fetching URLSs:

Google Refine - add data from column In the dialog box that opens use the following settings:

  • New column name – record
  • Throttle delay – 500
  • On error – set to blank
  • Language – GREL
  • Expression - "http://dspace.jorum.ac.uk/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier="+value

Google Refine - Add column by fetching URL

When you hit OK is Google Refine will use the row value to fetch even more data from Jorum and enter it into a cell. This is done using another entry point to OAI services using each identifier to get all the record data (here’s an example response). As this has to process over 19,000 requests it can take some time. If you would prefer not to wait here’s an export of my Refine project with the data already collected.

Google Refine - Returned data

So now we have all the information we need but it’s all in one cell, so we need to do a bit more refining.

Extracting a date

You’ll notice that each record has a couple of dates stored in dc:date. Lets look at extracting the first date we find. Google Refine has a couple of ways to parse a cell and get data out. Initially I tried using Jython but didn’t get very far, but thanks to some help from the Google Refine community found I could use Refine’s GREL language. Here’s how.

From the new ‘record’ column dropdown select Edit column > Add column > Add column based on this column.  In the dialog that opens set the new column name as first date and enter the following GREL expression:

forEach(value.parseHtml().select("dc|date"),v,if(length(v.htmlText())>11,v.htmlText().slice(0,10)+" "+v.htmlText().slice(11,19),"bad format")).sort()[0]

What is happening here is within the cell forEach <dc:date> If the result is length than 11 characters slice the text for the first 10 characters (yyyy-mm-dd) and a space then slice characters 11 to 19 (hh:mm:ss). As the dc:dates are temporarily stored in an array we sort this and get the first ([0]) value, which should be the smallest.

Next we want to turn the date, which is being stored as a string, into a UNIX timestamp (the number of seconds or milliseconds since midnight on January 1, 1970). We need a timestamp as this is the date/time format used by Gource.

To get this we want to add a column based on firstDate. In the Add column based on column firstDate enter the name timestamp and switch the language to Jython (I found this the best for this procedure) and the expression:

import time

return int(time.mktime(time.strptime(value, '%Y-%m-%d %H:%M:%S')))

This takes the cell value and turns it into a Jython time object by matching the date/time pattern used in the firstDate column. As Jython times are stored as UNIX timestamps we can just return the value to the new cell.

Some basic timestamp validation

Google Refine Facet/FilterI obviously didn’t start up Refine drop the expression from above in and get to this point. There was a lot of trial and error, testing assumptions like all the dates are in yyyy-mm-ddTHH:MM:SSZ format, and checking the processed data.  For example, if we want to check we’ve got valid timestamps for all the rows from the timestamp column dropdown menu we can select Facet > Customized facet > Facet by blank. To filter the blank rows we have to click on include in the Facet/Filter menu on the left hand side (we can also conveniently see that 3616 rows are blank).

Initial visual inspection of the results show that the status column contains a lot of records marked deleted. From The status column dropdown we can create an addition Facet > Text Facet. In the Facet/Filter section we can see that there are 3616 occurrences of the text ‘delete’, so we can conclude that blank timestamps are because of deleted records, which we can live with.

Important tip: As we have filtered the data if we do any additional column operations it will only be applied to the filtered rows so before moving on remove these facets by click on the little ‘x’ next to them.

Next lets sort the timestamps to check they are in a reasonable range. Do this by clicking the dropdown on timestamp ad using the sort option, sorting the cells as numbers (check both ascending and descending order). You’ll notice some of the dates are in 2004, I’m happy with these as Jorum has been going for some time now.

Google Refine - Numeric Histogram[By turning on the numeric facet for the timestamp column we also get a little histogram which is handy for filtering rows].

Before moving on make sure timestamp is sorted smallest first

So we now have a timestamp next lets extract the resource title.

Extracting a resource title

This is relatively straight forward as each record has a. So from the record column drop down select Edit column > Add column > Add column based on this column. In the dialog box use GREL, name the new column ‘title’ and use the following expression


[Each record only has one <dc:title> so it’s safe to just return the first title we find]

Reconciling who ‘damn’ made these resources

The headache comes from resource creators filling in information about their submission including information about who made it. This means that there are inconsistencies with how the data is entered, some records using a separated creator for the institution name, others including it with their name, or omitting this data altogether.  For the visualisation I wanted to resolve the resource against an institutional name rather than an individual or department. Here’s how the data was reconciled.

Lets start by extracting all the recordsto let use see what we are dealing with. We can do this by again using Edit column > Add column > Add column based on this column from the ‘record’ column. This time lets call the new column ‘creators’ and use the following GREL expression:


This will forEachget the value and store as a comma separated string.

For the early records you’ll notice that it’s a named person and there is little we can do to reconcile the record against an institution. For the later records you’ll see named people and an institutional affiliation. So lets see if we can extract these institutions into their own column.

From the creators column dropdown add a column based on this one calling it inst_id and using the following GREL expression


What this expression is doing is if the value contains the word ‘university’ the string is split into an array using the symbols –,.;() or the word ‘for’ and the array value with ‘university’ is stored, else if the value contains the word centre this value is stored (the OER Programme has projects from Universities and HEA Subject Centres).

Some additional refining via faceted filters and edit cells

Google Refine - Blank facetTo let us refine this data further from the new inst_id column and click the dropdown menu and select Facet > Customized facets > Facet by blank. Click on true so that we are just working with the blank inst_ids.

Scrolling through the records we can see some records the a creator that begins with ‘UKOER,Open Educational Repository in Support of Computer Science’. On the creators column from the dropdown sect ‘Text filter’ and use ‘Open Educational Repository in Support of Computer Science’. With this facet in place we can see there are 669 records. As we are confident these files were submitted as part of the Information and Computer Sciences Subject Centre’s work we can autofill the inst_id column with this data by clicking the dropdown on the inst_id column and selecting Edit cells > Transform. In the expression box enter "Information and Computer Sciences Subject Centre" and click OK.

Google Refine - Cell transformation

Remove the ‘creators’ filter by clicking the small ‘x’ in the top left of the box.

Let add a new text filter to the records column (you should know how to do this by now) with the word ‘university’. This should filter 878 rows or so. To make it easier to see what it is matching press Ctrl+F to bring up you browser Find on page and look for university.

Moving through the data you’ll see things like:

  • 384 rows can have inst_id’s by using the cell transformation filter(cells["record"].value.parseHtml().select("dc|publisher"),v,contains(v.htmlText().toLowercase(),"university"))[0].htmlText()
  • 89 rows include the term “University of Plymouth” in the dc:description, we can filter and fill these using the subject centre method.
  • 81 rows can have university names pulled from dc:subject using filter(cells["record"].value.parseHtml().select("dc|subject"),v,contains(v.htmlText().toLowercase(),"university"))[0].htmlText()

At this point if we just use the blank inst_id facet we’ve got 10,262 true (ie blank inst_id’s) and 9199 false, so a 47% hit rate … not great! But if we add a ‘ukoer’ text filter to the records column this improves to 8433 inst_id’s in 9955 matching rows which is a 84% hit rate. Whilst this isn’t perfect it’s probably the best we can do with this data. Next to turn those institutional id guesses into reusable data.

The real magic reconciling institutional names against CETIS PROD

So far we’ve tried to extract an institutional origin from various parts of the Jorum data and there is a lot of variation in how those ids are represented. For example, the inst_id column might have ‘the university of nottingham’, ‘university of nottingham’ or even ‘nottingham university’. To make further analysis of the data easier we want to match these variations against a common identifier, in the example above the ‘University of Nottingham’.

Google Refine has some very powerful reconciliation tools to help us do it. More information on Google Refine Reconciliation here.

In the inst_id column select Reconcile > Start reconciling.

Google Refine - Reconciliation

Google Refine has existing Freebase databases, which we could use to match institutional names against database ids, but as we are dealing with JISC/HEA projects it makes more sense to try and reconcile the data against the CETIS PROD database (this opens up further analysis down the line).

Kasabi - Reconciliation urlFortunately PROD data is mirrored to Kasabi, which includes a Reconciliation API for use with Google Refine. To use this data you need to register with Kasabi and then subscribe to the PROD data by visiting this page and clicking ‘Subscribe’. Once subscribed if you revisit the previous link and then click on the link to the ‘experimental API explorer’ and copy the url in the blue book including your apikey e.g. http://api.kasabi.com/dataset/jisc-cetis-project-directory/apis/reconciliation/search?apikey=aaaaaaaaaaaaakkkkkkkkkkeeeeeyyy

Back in the Google Refine Reconciliation dialog box click on ‘Add Standard Service …’ and enter the url you just created. Once added click on the new Reconciliation API and select ‘Reconcile against no particular type, then Start Reconciling’.

Google Refine - Using Kasabi data

Google Refine - edit cellOnce complete you should hopefully see from the inst_id judgment facet that the majority of names (all but 131) have been matched to PROD data. Filtering on the ‘none’ you can do mass edits on unmatched inst_ids by clicking the ‘edit’ and ‘Apply to All Identical Cells’. Once you’ve done this you can re-run Reconcile > Start reconciling to get additional matches.

Exporting to Gource using a custom template

Almost there people ;). At the very beginning I mentioned that the visualisation tool Gource has it’s own input log formats, shown below as a refresher:


Another useful feature of Google Refine is Export Templating, which allows us to control how our data can be written out to a separate file.

In Google Refine make sure you have a text facet on the record column filtering for ‘ukoer’ and inst_id: judgement is on ‘matched’ (this means when we export it just include this data). Now select Export > Templating …. Remove any text in Prefix, Row Separator and Suffix and in Row Template use:

{{cells["timestamp"].value}}|{{cells["inst_id"].recon.match.name}}|A|{{replace(cells["subject"].value," / ","/")}}/{{if(length(cells["title"].value)>20,cells["title"].value.slice(0,20)+"...",cells["title"].value)}}

This will write the timestamp cell value, then the reconciled name for the inst_id, then the subject value (stripping whitespace between slashes) and the resource title stripped down to 20 characters.

Finally, Gource

Google Refine will spit out a .txt file with the formatted data. Before we use it with Gource there is one thing we need to do. Initially I was getting log file format errors in Gource and then discovered it was a .txt file encoding problem. So open your newly created .txt file (which is in UTF-8 format) and File > Save As changing the encoding to ANSI.

Save As ANSI encoding

To test you visualisation download gource and extract the files. In the same directory as your extracted files place a copy of your refined log file. To view what you’ve got open your command line, navigate to your extracted gource location and executing:

gource nameoflogfile.txt

The gource site has more instructions on recording videos.


well almost … Here’s:

The bigger picture

This work was undertaken as part of my OER Visualisation work (day 11) and while it’s useful to have the Jorum OER snowflake visualisation in the bag, having a refined data source opens up more opportunities to explore and present OER activity in other ways. For example, I immediate have a decent sized dataset of OER records with subject classification. I’ve also matched records against PROD data which means I can further reconcile against project names, locations etc.

Yummy data!


By my calculation it’s day 16 of the OER Visualisation Project. Since day 11 and the Jorum UKOER ‘snowflake’ visualisation I’ve been going back to the refined data and trying to validate the set and understand the information it contains better.

One of the things I did was upload the data from Google Refine to Google Spreadsheet (exported Refine as .xls and uploaded it to Docs). Here is a copy of the spreadsheet. Using the UNIQUE and COUNTIF formula it’s very easy to built a summary of the top Jorum UKOER contributors and subject categorisation.

In the original OER funding call paragraph 19 states: “depositing resources funded through this call into JorumOpen will be mandatory” so in theory all 51 Phase 1 and 2 OER projects should in theory have records in Jorum. We can use this assumption to validate the refined dataset.

Using day 8’s CETIS PROD to Google Spreadsheet its easy for me to create a list of Phase 1 and 2 lead institutions (41 in total as some institutions from phase one were refunded). Using this list was able to query the spreadsheet data and produce the following table embedded below which counts Jorum ‘ukoer’ records for each of the institutions:

You can see a number of institutions have zero record counts. These are mainly for the HEA Subject Centre projects which were not detected using the original extraction and reconciliation method, but as also noted, a number of these records are reconciled against other university names. Using this data the original extracted dataset was further refined and an additional 705 ukoer records were reconciled against institution names. A revised issue and summary of ukoer records is available here.

Data Driven Journalism

Most people are probably unfamiliar with the term ‘data driven journalism’ but would have seen some of the products of the process like The Guardian’s Interactive guide to every IED attack (Wikipedia has a useful definition and overview of data-driven journalism).

It’s been useful for me to consider the OER visualisation project like a data journalistic assignment, using Paul Bradshaws The inverted pyramid of data journalism as a basic processes to approach the Jorum data. For example, remembering the ‘context’ in which the Jorum data was collected (mandatory task, which in cases wasn’t always full automated) is a reminder that even after multiple refinements of the data it’s still not 100% complete and in parts may be unrepresentative.

Looking at a table of top Jorum UKOER contributors, for example, Staffordshire University accounts for almost 50% of the deposits almost all going in the HE – Creative Arts and Design subject area, while University College Falmouth have one Jorum entry for their entire ukoer work.     

Top UKOER Depositors Records
Staffordshire University4113
University of Cambridge855
Newcastle University847
Subject Centre for Information and Computer Sciences669
Leeds Metropolitan University383
Top subjectsRecords
HE - Creative Arts and Design4068
HE - Engineering1227
HE - Veterinary Sciences, Agriculture and related subjects874
HE - Mathematical and Computer Sciences767
HE - Physical Sciences454

Using Data Journalism processes should also be helpful when considering how the data is communicated using techniques like interactive slideshows and providing brief narration to the data.

With this in mind it was useful to revisit Geoff McGhee’s Journalism in the Age of Data (embedded below)

A lot more to do in 2012


*viewed doesn't necessarily viewed by a human. There are multitude of bots and automated code that might visit the jorum site which could increase a records view count. For example the process of me getting this data scraping each record page generated almost 10,000 'views'.

Most of the numbers above come from two spreadsheets: CETIS PROD Spreadsheet; and jorumUKOERReconciled - Issue 2.  I’ve mentioned both of these spreasheets before (day 8 | day 16), but you might like to File > Make a copy of these to play with the data yourself and see some of the formulas used. An additional note on the resource view counts. These were collected by fetching the each resource page on Jorum using Google Refine and scraping the data (more details on using Refine to extract data from day 11.

[The additional processes were to extract a Jorum url by using the expression filter(value.split("|"),v,contains(v,"jorum.ac.uk"))[0]) on the identifier column, fetching a new column based on the new url and then extracting a count using toNumber(replace(filter(cells["jorumpage"].value.parseHtml().select("p.ds-paragraph"),v,contains(v.htmlText(),"viewed"))[0].htmlText(),/\D/, ''))]

So I now have a decent amount of data (some of which might be dropped), next to communicate …


Dear Diary, it is now day 20 of the OER Visualisation Project … One of the suggested outputs of this project was “collections mapped by geographical location of the host institution” and over the last couple of days I’ve experimented with different map outputs using different techniques. Its not the first time I’ve looked at maps and as early as day 2 used SPARQL to generate a project locations map. At the time I got some feedback questioning the usefulness of this type of data, but was still interested in pursuing the idea as a way to provide an interface for users to navigate some of the OER Phase 1 & 2 information. This obsession shaped the way I approached refining the data, trying to present project/institution/location relationships, which in retrospect was a red herring. Fortunately the refined data I produced has helped generate a map which might be interesting (thought there would be more from London), but I thought it would also be useful to document some of what I’m sure will end up on the cutting room floor.

Filling in the holes

One of the things the day 2 experiment showed was it was difficult to use existing data sources (PROD and location data from the JISC Monitoring Unit) to resolve all host institution names. The main issue was HEA Subject Centres and partnered professional organisations. I’m sure there are other linked data sources I could have tapped into (maybe inst. > postcode > geo), but opted for the quick and dirty route by:

  1. Creating a sheet in the PROD Linked Spreadsheet of all projects and partners currently filtered for Phase 1 and 2 projects. I did try to also pull location data using this query but it was missing data so instead created a separate location lookup sheet using the queries here. As this produced 130 institutions without geo-data (Column M) I cheated and created a list of unmated OER institutions (Column Q) [File > Make a copy of the spreadsheet to see the formula used which includes SQL type QUERY].
  2. Resolving geo data for the 57 unresolved Phase 1 & 2 projects was a 3 stage process:
    1. Use the Google Maps hack recently rediscovered by Tony Hirst to get co-ordinates from a search. You can see the remnants of this here in cell U56 (Google Spreadsheets only allow 50 importDatas per spreadsheet so it is necessary to Copy > Paste Special > As values only).
    2. For unmatched locations ‘Google’ Subject Centres to find their host institution and insert the name in the appropriate row in Column W – existing project locations are then used to get coordinates.  
    3. For other institutions ‘google’ them in Google Maps (if that didn’t return anything conclusive then a web search for a postcode was used). To get the co-ordinate pasted in Column W I centred their location on Google Maps then used the modified bookmarklet javascript:void(prompt('',gApplication.getMap().getCenter().toString().replace(',','').replace(')','').replace('(',''))); to get the data.
  3. The co-ordinates in Column S and T are generated using a conditional lookup of existing project leads (Column B) OR IF NOT partners (Column F) OR IF NOT entered/searched co-ordinates.

Satisfied that I had enough lookup data I created a sheet of OER Phase 1 & 2 project leads/partners (filtering Project_and_Partners and pasting the values in a new sheet). Locations are then resolved by looking up data from the InstLocationLookup sheet.

Map 1 – Using NodeXL to plot projects and partners

Exporting the OER Edge List as a csv allows it to be imported to the Social Network Analysis add-on for Excel (NodeXL). Using the geo-coordinates as layout coordinates gives:

OERPhase1&2 Edge 

The international partners mess with the scale. Here’s the data displayed in my online NodeXL viewer. I’m not sure much can be taken from this.

Map 2 – Generating a KML file using a Google Spreadsheet template for Google Maps

KML is an XML based format for geodata originally designed for Google Earth, but now used in Google Maps and other tools. Without a templating tool like Yahoo Pipes which was used in day 2, generating KML can be very laborious. Fortunately the clever folks at Google have come up with a Google Spreadsheet template – Spreadsheet Mapper 2.0. The great thing about this template is you can download the generate KML file or host it in the cloud as part of the Google Spreadsheet.

The instructions for using the spreadsheet are very clear so I won’t go into details, you might however want to make a copy of the KML Spreadsheet for OER Phase 1 & 2 to see how data is being pulled from the PROD Spreadsheet. The results can be viewed in Google Maps (shown below), or viewed in Google Earth.

OER Phase 1 & 2 in Google Maps

Map 3 – Customising the rendering KML data using Google Maps API

Map 3 ExampleWhilst digging around the Google Maps API for inspiration I came across this KML with features example (in the KML and GeoRSS Layers Section. Out of interest I thought I’d use the KML link from Map 2 as the source which gives this OER Phase 1 & 2 map. [If you are wondering about the map styling I recently came across the Google Maps API Styled Map Wizard which lets you customise the appearance of Google Maps, creating a snippet of code you can use in Google Maps API Styling.

Map 4 – Rendering co-ordinate data from a Google Spreadsheet

Another example I game across was using a Google Spreadsheet source which gives this other version of OER Phase 1 & 2 Lead Institutions rendered from this sheet.

I haven’t even begun on Google Map Gadgets, so it looks like there are 101 ways to display geo data from a Google Spreadsheet. Although all of this data bashing was rewarding I didn’t feel I was getting any closer to something useful. At this point in a moment of clarity I realised I was chasing the wrong idea, that I’d made that schoolboy error of not reading the question properly.

Map 5 – Jorum UKOER records rendered as a heatmap in Google Fusion Tables

Having already extracted ‘ukoer’ records from Jorum and reconciling them against institution names in day 11 it didn’t take much to geo-encode the 9,000 records to resolve them to an institutional location (I basically imported a location lookup from the PROD Spreadsheet, did a VLOOKUP, then copy/pasted the values. The result is in this sheet)

For a quick plot of the data I thought I’d upload to Google Fusion Tables and render as a heatmap but all I got was a tiny green dot over Stoke-on-Trent for the 4000 records from Staffordshire University. Far from satisfying.

Map 6 - Jorum UKOER records rendered in Google Maps API with Marker Clusters

imageThe final roll of the dice … for now anyway. MarkerClusterer is an open source library for Google Maps API which groups large numbers of closely located markers for speed and usability gains. I’d never used this library before but the Speed Test Example looked easy to modify. This has resulted in the example linked at the very beginning of this post mapping Jorum ukoer records

This is still a prototype version and lots of tweaking/optimisation required and the data file, which is a csv to json dump has a lot of extra information that’s not required (hence the slow load speed), but is probably the beginnings of the best solution for visualising this aspect of the OER programme.

So there you go. Two sets of data, 6 ways to turn it into a map and hopefully some hopefully useful methods for mashing data in between.

I don’t have a ending to this post, so this is it.


This might be slightly off-topic for the OER Visualisation project, but I followed an idea, did - what I think are – some interesting things with an archive of tweets and thought I would share. This line of thought was triggered by a tweet from Terry McAndrew in which he asked:

@mhawksey Have you a visualisation planned of JORUM 'customers' of OER (and the rest of it for that matter).

Tracking who views or uses OER material can be tricky but not impossible the main issue comes when someone else like me comes along and wants to see who else has viewed, used, remixed the resource. For example, with the Jorum ukoer resources the only usage data I could get was each resource page view and even getting this required scraping over 8,000 pages. This is mentioned in day 18, but I realise I haven’t gone into any detail about how Google Refine was used to get this data – if someone wants me to I will reveal all.

A recent development in this area is the US Learning Registry project which is looking to maximise the use of activity data to support resource discovery and reuse. On Lorna Campbell’s CETIS blog there is a very useful introduction to the Learning Registry announcing JISCs involvement in a UK node. The post includes the following use case which helps illustrate what the project is about:

“Let’s assume you found several animations on orbital mechanics. Can you tell which of these are right for your students (without having to preview each)? Is there any information about who else has used them and how effective they were? How can you provide your feedback about the resources you used, both to other teachers and to the organizations that published or curated them? Is there any way to aggregate this feedback to improve discoverability?

The Learning Registry is defining and building an infrastructure to help answer these questions. It provides a means for anyone to ‘publish’ information about learning resources. Beyond metadata and descriptions, this information includes usage data, feedback, rankings, likes, etc.; we call this ‘paradata’”

Lorna’s post was made in November 2011 and since then the Mimas have started The JLeRN Experiment (and if you are a developer you might want to attend the CETIS contributors event on the 23rd Jan).

Plan A – Inside-out: Repository resource sharing data

All of this is a bit late for me but I thought I’d see what ‘paradata’ I could build around ukoer and see if there were any interesting stories to be told as part of the visualisation project. This is an area I’ve visited before with a series of ‘And the most socially engaging is …’ posts which started with And the most engaging JISC Project is… For this I used Google Spreadsheet to get social share counts (Facebook, Twitter and more) for a list of urls. One of the issues with this technique is Google Spreadsheets timeout after 5 minutes so there is a limit to the number of links you can get through -this is however not a problem for Google Refine.

Taking 380 of the most viewed Jorum ukoer tagged resources (approximately 5% of the data) I used Google Refine to

  1. ‘Add column by fetching URL’ passing the resource link url into Yahel Carmon’s Shared Count API -  using the expression "http://api.sharedcount.com/?url="+escape(value,"url") 
  2. Add column based on the results column parsing each of the counts – e.g. using expressions similar to parseJson(value)['Twitter']

At this point I stopped parsing columns because it was clear that there was very little if any social sharing of Jorum ukoer resources (here is a spreadsheet of the data collected).  In fact the most tweeted resource which Twitter records as having 7 tweets gets most of these following my tweet as it being the most viewed resource.

So what might be going on here. Is just a issue for national repositories? Are people consuming ukoer resources from other repositories? Are Jorum resources not ranking well in search engines? Are resources not being marketed enough?

I don’t have answers to any of those questions, and maybe this is just an isolated case, but the ‘marketing’ aspect interests me. When I publish a blog post I’ll push it into a number of communication streams including RSS, a couple of tweets, the occasional Google+. For posts I think are really worthwhile I’ll setup a twitter search column on Tweetdeck with related keywords and proactively push posts to people I think might be interested (I picked up this idea from Tony Hirst’s Invisible Frictionless Tech Support. Are we doing something similar with our repositories?  

Plan B – Outside-in: Community resource sharing data

There’s probably a lot more to be said about repositories or agencies promoting resources. To try and find some more answers, instead of looking from the repository perspective of what is being shared, I thought it’d be useful to look at what people are sharing. There are a number of ways you could do this like selecting and monitoring a group of staff. I don’t have time for that, so decided to use data from an existing community  who are likely to be using or sharing resources aka the #ukoer Twitter hashtag community. [There are obvious issues with this approach, but I think it’s a useful starter for ten]

Having grabbed a copy of the #ukoer Twapper Keeper archive using Google Refine before it disappeared I’ve got over 8,000 tweets from 8th March 2010 to 3rd January 2012.  My plan was to extract all the links mentioned in these tweets, identify any patterns or particularly popular tweets.

Extracting links and expand shortened urls

As most tweets now get links replaced with t.co shortened urls and the general use of url shortening the first step was to extract and expand all the links mentioned in tweets. Link expansion was achieved using the longurl.org API, which has the added bonus of returning meta description and keywords for target pages. Here’s a summary of the Google Refine actions I did (taken from the undo/redo history):

  1. Create new column links based on column item - title by filling 8197 rows with grel:filter(split(value, " "),v,startsWith(v,"http")).join("||")
  2. Split multi-valued cells in column links
  3. Create column long_url at index 4 by fetching URLs based on column links using expression grel:"http://api.longurl.org/v2/expand?url="+escape(value,"url")+"&format=json&user-agent=Google%20Refine&title=1&meta-keywords=1"
  4. Create new column url based on column long_url by filling 5463 rows with grel:parseJson(value)['long-url']

Searchable table of #ukoer linksWith long urls extracted the data was exported and uploaded to Google Spreadsheet so that a list of unique urls and their frequencies could be calculated. Here is the Spreadsheet of data.  From the refined data there are 2,482 different links which appear 6,220 times in the #ukoer archive. Here is the searchable table of extracted links with frequencies (sorry for the small font – can’t seem to find a way to control column width using Google Visualisation API … anyone?).

Not surprisingly a number of domain root urls appear at the top of the list. More work needs to be done to match resource sharing to different OER sources, but you can start doing simple filtering to find out what’s there. Something for the initial analysis I find interesting is that the top common link in the archive is to Oxfords Open Advent Calendar, which was a daily posting highlighting some of their OER resources. This could be interpreted as underlying the need for OER resources to be more effectively marketed. I’ll let you decide.

PS out of interest I put the list of 2,500 odd links back into Google Refine and extracted social share counts. I haven’t had a chance to look at the data closely but if you want to play a copy of it and a meta-data enhanced version of the #ukoer archive is here. Please share any findings ;)