Jorum contains 9,502 unique records containing ‘ukoer’ of which 8,859 can be reconciled against institution names (including duplicates for records containing more than one subject category there are 9069)
*viewed doesn’t necessarily viewed by a human. There are multitude of bots and automated code that might visit the jorum site which could increase a records view count. For example the process of me getting this data scraping each record page generated almost 10,000 ‘views’.
Most of the numbers above come from two spreadsheets: CETIS PROD Spreadsheet; and jorumUKOERReconciled – Issue 2. I’ve mentioned both of these spreasheets before (day 8 | day 16), but you might like to File > Make a copy of these to play with the data yourself and see some of the formulas used. An additional note on the resource view counts. These were collected by fetching the each resource page on Jorum using Google Refine and scraping the data (more details on using Refine to extract data from day 11.
[The additional processes were to extract a Jorum url by using the expression filter(value.split("|"),v,contains(v,"jorum.ac.uk"))) on the identifier column, fetching a new column based on the new url and then extracting a count using toNumber(replace(filter(cells["jorumpage"].value.parseHtml().select("p.ds-paragraph"),v,contains(v.htmlText(),"viewed")).htmlText(),/\D/, ''))]
So I now have a decent amount of data (some of which might be dropped), next to communicate …
One of the things I did was upload the data from Google Refine to Google Spreadsheet (exported Refine as .xls and uploaded it to Docs). Here is a copy of the spreadsheet. Using the UNIQUE and COUNTIF formula it’s very easy to built a summary of the top Jorum UKOER contributors and subject categorisation.
In the original OER funding call paragraph 19 states: “depositing resources funded through this call into JorumOpen will be mandatory” so in theory all 51 Phase 1 and 2 OER projects should in theory have records in Jorum. We can use this assumption to validate the refined dataset.
You can see a number of institutions have zero record counts. These are mainly for the HEA Subject Centre projects which were not detected using the original extraction and reconciliation method, but as also noted, a number of these records are reconciled against other university names. Using this data the original extracted dataset was further refined and an additional 705 ukoer records were reconciled against institution names. A revised issue and summary of ukoer records is available here.
It’s been useful for me to consider the OER visualisation project like a data journalistic assignment, using Paul Bradshaws The inverted pyramid of data journalism as a basic processes to approach the Jorum data. For example, remembering the ‘context’ in which the Jorum data was collected (mandatory task, which in cases wasn’t always full automated) is a reminder that even after multiple refinements of the data it’s still not 100% complete and in parts may be unrepresentative.
I should start with the result so that you can see if it’s worth doing this:
The video shows the deposits from institutions and Subject Centres to Jorum tagged ‘ukoer’ from January 2009 to November 2011. In total over 8,000 deposits condensed into 5 minutes (there are more records, but these were the ones that could be reconciled against an institution name).
Here’s the recipe I used to do it, which should be easy to modify for your own and other repositories. As the explanation takes longer than to actually do it I’m going to assume you understand some basic tools and techniques, but you can always leave a comment if something isn’t clear.
Let start by looking at what it is we are trying to achieve. The animation is generated using code from the open source Gource project. Gource uses an input log file to visualise software commits using the format shown below. So for the Jorum visualisation we need to generate a file with timstamp, creator (in this case the submitters host institution) and title (prefixed by subject classification).
Gource log format
The user andrew adding the file src/main.cpp on Thu, 03 Jun 2010 05:39:55 GMT (1275543595):
Getting the data – building a source
Building the log file we need details of the records from Jorum. Fortunately Jorum implements the OAI Protocol for Metadata Harvesting, which is designed to allow the easy sharing and access of repository data. I say easy but in reality its easy if you have another repository on a server somewhere that can consume OAI data, but its not easy to find a desktop based solution. After a lot of trial and error I’ve arrived at a solution using a combination of MS Excel and Google Refine (BTW “Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another” – it’s also open source).
I had hoped to do all of this in Google Refine but was struggling with the initial data import, recognising the schema and including records with multiple subject classifications, so we briefly start with MS Excel.
Once it’s finished loading (which can take a while) click Import. You’ll now get some dialog boxes warning you about the xml import but you can ignore those. You should now have a sheet of List Identifiers, that is a list of all the record identifiers (ns1:identifier4) and the subject set they are attached to (ns1:setSpec6status3) – you’ll find that there are more columns, mainly blank, which we don’t need.
Next we want add some readable subject classification to the data by changing setSpec ids into text equivalents. This data is also available via Jorum’s OAI service and the raw data can be seen by looking at http://dspace.jorum.ac.uk/oai/request?verb=ListSets.
To get this data into Excel we want to follow a similar process to above in the same spreadsheet getting Data – From Web using http://dspace.jorum.ac.uk/oai/request?verb=ListSets as the address. This gives us a sheet similar to below with setSpec ids and associated category name.
Next we want to match this data to the sheet of List Identifiers. To do this we first want to sort the data we just captured on the setSpec column. Now in the original sheet add a new column and enter the following formula in the cell immediately beneath the column name (row 2):
This formula looks up the setSpec6 value, matches it against the data we just got and returns a setName. You can now save this spreadsheet.
Getting more data using Google Refine
So far we’ve got a list of record ids from Jorum and the subject category for that record. We still need to get when the record was created, who by and resource title. To do this we are going to use Google Refine. If you haven’t already here’s how to install Google Refine. Open Google Refine and create a new project from the Excel file we just created. The default setting should work just make sure you select the sheet with 19,000 plus rows.
After the project has been created next we want to get more information for each record identifier. From the ns1:identifier4 column drop-down menu select Edit column > Add column by fetching URLSs:
In the dialog box that opens use the following settings:
So now we have all the information we need but it’s all in one cell, so we need to do a bit more refining.
Extracting a date
You’ll notice that each record has a couple of dates stored in dc:date. Lets look at extracting the first date we find. Google Refine has a couple of ways to parse a cell and get data out. Initially I tried using Jython but didn’t get very far, but thanks to some help from the Google Refine community found I could use Refine’s GREL language. Here’s how.
From the new ‘record’ column dropdown select Edit column > Add column > Add column based on this column. In the dialog that opens set the new column name as first date and enter the following GREL expression:
What is happening here is within the cell forEach <dc:date> If the result is length than 11 characters slice the text for the first 10 characters (yyyy-mm-dd) and a space then slice characters 11 to 19 (hh:mm:ss). As the dc:dates are temporarily stored in an array we sort this and get the first () value, which should be the smallest.
Next we want to turn the date, which is being stored as a string, into a UNIX timestamp (the number of seconds or milliseconds since midnight on January 1, 1970). We need a timestamp as this is the date/time format used by Gource.
To get this we want to add a column based on firstDate. In the Add column based on column firstDate enter the name timestamp and switch the language to Jython (I found this the best for this procedure) and the expression:
This takes the cell value and turns it into a Jython time object by matching the date/time pattern used in the firstDate column. As Jython times are stored as UNIX timestamps we can just return the value to the new cell.
Some basic timestamp validation
I obviously didn’t start up Refine drop the expression from above in and get to this point. There was a lot of trial and error, testing assumptions like all the dates are in yyyy-mm-ddTHH:MM:SSZ format, and checking the processed data. For example, if we want to check we’ve got valid timestamps for all the rows from the timestamp column dropdown menu we can select Facet > Customized facet > Facet by blank. To filter the blank rows we have to click on include in the Facet/Filter menu on the left hand side (we can also conveniently see that 3616 rows are blank).
Initial visual inspection of the results show that the status column contains a lot of records marked deleted. From The status column dropdown we can create an addition Facet > Text Facet. In the Facet/Filter section we can see that there are 3616 occurrences of the text ‘delete’, so we can conclude that blank timestamps are because of deleted records, which we can live with.
Important tip: As we have filtered the data if we do any additional column operations it will only be applied to the filtered rows so before moving on remove these facets by click on the little ‘x’ next to them.
Next lets sort the timestamps to check they are in a reasonable range. Do this by clicking the dropdown on timestamp ad using the sort option, sorting the cells as numbers (check both ascending and descending order). You’ll notice some of the dates are in 2004, I’m happy with these as Jorum has been going for some time now.
[By turning on the numeric facet for the timestamp column we also get a little histogram which is handy for filtering rows].
Before moving on make sure timestamp is sorted smallest first
So we now have a timestamp next lets extract the resource title.
Extracting a resource title
This is relatively straight forward as each record has a. So from the record column drop down select Edit column > Add column > Add column based on this column. In the dialog box use GREL, name the new column ‘title’ and use the following expression
[Each record only has one <dc:title> so it’s safe to just return the first title we find]
Reconciling who ‘damn’ made these resources
The headache comes from resource creators filling in information about their submission including information about who made it. This means that there are inconsistencies with how the data is entered, some records using a separated creator for the institution name, others including it with their name, or omitting this data altogether. For the visualisation I wanted to resolve the resource against an institutional name rather than an individual or department. Here’s how the data was reconciled.
Lets start by extracting all the recordsto let use see what we are dealing with. We can do this by again using Edit column > Add column > Add column based on this column from the ‘record’ column. This time lets call the new column ‘creators’ and use the following GREL expression:
This will forEachget the value and store as a comma separated string.
For the early records you’ll notice that it’s a named person and there is little we can do to reconcile the record against an institution. For the later records you’ll see named people and an institutional affiliation. So lets see if we can extract these institutions into their own column.
From the creators column dropdown add a column based on this one calling it inst_id and using the following GREL expression
What this expression is doing is if the value contains the word ‘university’ the string is split into an array using the symbols –,.;() or the word ‘for’ and the array value with ‘university’ is stored, else if the value contains the word centre this value is stored (the OER Programme has projects from Universities and HEA Subject Centres).
Some additional refining via faceted filters and edit cells
To let us refine this data further from the new inst_id column and click the dropdown menu and select Facet > Customized facets > Facet by blank. Click on true so that we are just working with the blank inst_ids.
Scrolling through the records we can see some records the a creator that begins with ‘UKOER,Open Educational Repository in Support of Computer Science’. On the creators column from the dropdown sect ‘Text filter’ and use ‘Open Educational Repository in Support of Computer Science’. With this facet in place we can see there are 669 records. As we are confident these files were submitted as part of the Information and Computer Sciences Subject Centre’s work we can autofill the inst_id column with this data by clicking the dropdown on the inst_id column and selecting Edit cells > Transform. In the expression box enter “Information and Computer Sciences Subject Centre” and click OK.
Remove the ‘creators’ filter by clicking the small ‘x’ in the top left of the box.
Let add a new text filter to the records column (you should know how to do this by now) with the word ‘university’. This should filter 878 rows or so. To make it easier to see what it is matching press Ctrl+F to bring up you browser Find on page and look for university.
Moving through the data you’ll see things like:
384 rows can have inst_id’s by using the cell transformation filter(cells["record"].value.parseHtml().select("dc|publisher"),v,contains(v.htmlText().toLowercase(),"university")).htmlText()
89 rows include the term “University of Plymouth” in the dc:description, we can filter and fill these using the subject centre method.
81 rows can have university names pulled from dc:subject using filter(cells["record"].value.parseHtml().select("dc|subject"),v,contains(v.htmlText().toLowercase(),"university")).htmlText()
At this point if we just use the blank inst_id facet we’ve got 10,262 true (ie blank inst_id’s) and 9199 false, so a 47% hit rate … not great! But if we add a ‘ukoer’ text filter to the records column this improves to 8433 inst_id’s in 9955 matching rows which is a 84% hit rate. Whilst this isn’t perfect it’s probably the best we can do with this data. Next to turn those institutional id guesses into reusable data.
The real magic reconciling institutional names against CETIS PROD
So far we’ve tried to extract an institutional origin from various parts of the Jorum data and there is a lot of variation in how those ids are represented. For example, the inst_id column might have ‘the university of nottingham’, ‘university of nottingham’ or even ‘nottingham university’. To make further analysis of the data easier we want to match these variations against a common identifier, in the example above the ‘University of Nottingham’.
In the inst_id column select Reconcile > Start reconciling.
Google Refine has existing Freebase databases, which we could use to match institutional names against database ids, but as we are dealing with JISC/HEA projects it makes more sense to try and reconcile the data against the CETIS PROD database (this opens up further analysis down the line).
Fortunately PROD data is mirrored to Kasabi, which includes a Reconciliation API for use with Google Refine. To use this data you need to register with Kasabi and then subscribe to the PROD data by visiting this page and clicking ‘Subscribe’. Once subscribed if you revisit the previous link and then click on the link to the ‘experimental API explorer’ and copy the url in the blue book including your apikey e.g. http://api.kasabi.com/dataset/jisc-cetis-project-directory/apis/reconciliation/search?apikey=aaaaaaaaaaaaakkkkkkkkkkeeeeeyyy
Back in the Google Refine Reconciliation dialog box click on ‘Add Standard Service …’ and enter the url you just created. Once added click on the new Reconciliation API and select ‘Reconcile against no particular type, then Start Reconciling’.
Once complete you should hopefully see from the inst_id judgment facet that the majority of names (all but 131) have been matched to PROD data. Filtering on the ‘none’ you can do mass edits on unmatched inst_ids by clicking the ‘edit’ and ‘Apply to All Identical Cells’. Once you’ve done this you can re-run Reconcile > Start reconciling to get additional matches.
Exporting to Gource using a custom template
Almost there people ;). At the very beginning I mentioned that the visualisation tool Gource has it’s own input log formats, shown below as a refresher:
Another useful feature of Google Refine is Export Templating, which allows us to control how our data can be written out to a separate file.
In Google Refine make sure you have a text facet on the record column filtering for ‘ukoer’ and inst_id: judgement is on ‘matched’ (this means when we export it just include this data). Now select Export > Templating …. Remove any text in Prefix, Row Separator and Suffix and in Row Template use:
This will write the timestamp cell value, then the reconciled name for the inst_id, then the subject value (stripping whitespace between slashes) and the resource title stripped down to 20 characters.
Google Refine will spit out a .txt file with the formatted data. Before we use it with Gource there is one thing we need to do. Initially I was getting log file format errors in Gource and then discovered it was a .txt file encoding problem. So open your newly created .txt file (which is in UTF-8 format) and File > Save As changing the encoding to ANSI.
To test you visualisation download gource and extract the files. In the same directory as your extracted files place a copy of your refined log file. To view what you’ve got open your command line, navigate to your extracted gource location and executing:
This work was undertaken as part of my OER Visualisation work (day 11) and while it’s useful to have the Jorum OER snowflake visualisation in the bag, having a refined data source opens up more opportunities to explore and present OER activity in other ways. For example, I immediate have a decent sized dataset of OER records with subject classification. I’ve also matched records against PROD data which means I can further reconcile against project names, locations etc.
One of my ambitions from Day 1 of the OER Visualisation project was to start linking PROD data in to Google Spreadsheets. Whilst this was primarily designed to help me with the project after speaking to some of the JISC/JISC CETIS people it sounds like it would help them and others.
I used to find pivot table creation quite daunting but they are a great way to filter and analyse large sets of data. The PROD spreadsheet contains 2 example pivot reports one for technology and the other for standards (if you want to enable pivot table options you’ll need to File > Make a copy of the spreadsheet then on the pivot sheets select Data > Pivot Table Report).
The example ‘Technology Pivot’ is summarising the data from phase 1&2 of the OER Programme. You can see there is a number of technologies were recorded (over 100), the top three being YouTube, Flash and Slideshare. This data can be shown graphically using Google Spreadsheets chart tools and embedded as an interactive or static graphic.
One of the great things about the PROD data is there is a lot of relationship data already there. For example if you look at the PROD page for the ENABLE project you can see there are details of the projects that ENABLE builds on or was built on by, related projects and even comments that relate to the individual relationships.
This relationship data can all be extracted from PROD and in this case imported to the Spreadsheet. On the Relates_to sheet I’ve imported details of all the JISC funded project ‘relates_to’ relationships. What can we do with this data? Well at a basic level in column B we have a source id and column F has a target id which makes it suitable for using in a force layout diagram. Fortunately I’ve been playing around with online force layout diagrams for a while and most recently created a Google Spreadsheet Gadget to display this info (this is how you can extend the basic chart selection).
Whilst this gadget still needs to be verified by Google for anyone to see the results we can use the spreadsheet as a datasource for the gadget’s big brother EDGESExplorer. Publishing the spreadhseet to the web, using the built in tools to do this, we can reformat the data in EDGESExplorer to see how all JISC funded projects stored in PROD are related (click on the image below for the interactive version, you can explorer individual nodes by clicking on them).
So to recap: a Google Spreadsheet is being populated from PROD (live data). Users can create reports and charts within the Spreadsheet environment (live data) or export data to other services like Many Eyes (dead – as in the live link is broken – data). Finally we can publish live data from the Spreadsheet for use in other tools like EDGESExplorer.
My question for you is what data would you like in the Spreadsheet? Summary of projects by institution? Breakdown of projects by partners? Projects by JISC Programme Managers? Let me know what you would like to see ;)
[As Leeds Met Unicycle repository also has an OAI service I thought I’d collect data from there to compare the RSS feed but I was getting timeouts]
PS Whilst looking for OAI export solutions I came across Ghent University Library Download/APIs page which includes: daily/weekly exports of the complete datasets of the Ghent University Academic Bibliography in various formats; every report record can be accessed by HTML, RDF, Dublin Core, METS, MODS, MPEG-21/DIDL; an Author export; department exports; classification type exports; and OAI-PMH; and other stuff. All this lovely data, it just makes you want and go and do something with it! It’s a shame to have all these wonderful repositories of open resources but hidden away. Here are some tips from the JISC Digital Infrastructure Team on Making OER visible and findable.
Instead I switched to getting the Pipe as JSON. JSON input powers a lot of my other Google Spreadsheet toys like TAGS and Google+ network generator so I’ve a growing bank of code snippets I can throw together.
I’m not intending on blogging about the OER Visualisation Project but I’ve got a new piece in the jigsaw worth sharing. Below should be an embedded map of JISC/HEA funded OER projects (well half of them anyway. If you visit the Yahoo Pipe which was used to generate this there are 51 projects listed but I was only able to automatically get locations for 28). This was generated by using SPARQL to query the CETIS PROD database for projects matching the strand title “open education …”. After the map is the recipe used to make it, which includes some 2-to-1 online tutoring (aren’t PLNs great ;) and a bunch of other tools which I might dip into during this project.
[As part of the processes I also produced this PROD Theme GeoMap as a proof of concept. As there’s a user defined theme entry on this it would be easy to drop into the CETIS PROD summary pages like this one filtering the querystring parameter into the map embed code highlighted below:
It all started last night after reading Owen Stephens’ post on Experimenting with British Museum data in which he uses SPARQL (a query language) to extract data for use elsewhere. As I mentioned in my opening post for this project I’m not a huge fan of SPARQL, part of the problem being I can’t get my head around the markup. As I was venting my frustration ‘via the medium of Twitter’ I started received encouraging words of support from Wilbert Kraan (CETIS) @wilm and Owen @ostephens which spilled over into this Google Doc.
Which I now translate as SELECT DISTINCT project id, name, latitude, longitude, description and strand WHERE strand begins with ‘open education’ “ignoring case” and OPTIOANLly get a short description (if available) and from project host name OPTIONALly lookup a matching lat/long from JISC Monitoring Unit.
There a couple of things we can do with this query. Putting it through the Rensselaer Polytechnic Institutes SPARQL Proxy we can generate formats easily digestible by Yahoo Pipes, as demonstrated by the example above (if you log in to Pipes you can see there is very little going on with the source). The proxy service also has a Google Visualization API output which could lend itself to further mashineerings, wrapping in widget/gadget etc.
Another option is to publish the results as .csv and then use the resulting url in the formula =ImportData(“your url”), which I’ve done in this spreadsheet and again you have options for presenting the data in different ways.
So lots of potential around these techniques to extract some data, but notes of caution Yahoo Pipes can be very temperamental (in fact while I hover over the publish button for this post it’s showing no results) and whilst the concept of linked data eg using JISC MU data to get geo-location for projects listed in PROD, as the spreadsheet linked above shows it didn’t always find a longitude/latitude for the project, it was a 50% hit (more investigation required).
Big thanks for Wilbert and Owen for there help with this!
Since the closure of JISC RSC Scotland North & East I’ve taken the opportunity to rediscover this interest looking at network analysis tools like NodeXL and visualisation techniques and libraries like d3.js and the Google Visualization API. As part of this period of Edupunking myself I’ve submitted some (un)assessments including the Guardian Tag Explorer, which I got some great feedback on and an invitation to visit and chat to some of the Guardian engineers/journalists.
This work hasn’t gone unnoticed at JISC CETIS/OER Programme team and I’m very pleased to announce that I’ve been contracted to visualise the OER Programme (gulp) and I’ve got 40 days to do it (gulp).
This post is day 1 of the project and what I wanted to do was introduce what it is I’ll be doing over the next 39 days, jot down some notes of things I’m thinking of or have already found (including some web standards) and give you an opportunity to comment on any/all of this.
OER Programmes Visualisation Project Outline
The UK Open Educational Resources initiative is jointly led by JISC and the Higher Education Academy and now is in it’s third year. You can find out more about the programme on JISC or HEA pages. I would also highly recommend John Robertson’s (CETIS) collection of posts on OER which are a great lead-in to this project.
Having dabbled with PROD before I’m familiar with the data and architecture. Something I regularly struggle with is using SPARQL to query the linked datasets and I usually end up asking Wilbert Kraan (CETIS) for help. Something I thinking might be useful, even if it’s just form me, is to create a synchronised version of the data stored in Google Spreadsheets (À la Guardian Datastore). My thoughts behind this are spreadsheets are more commonly understood, Google Spreadsheets include a number of visualisation options which can be embedded elsewhere, and as in my TAGSExplorer the data can be read and visualised using other visualisation libraries. CETIS also already have examples for getting PROD into Google Spreadsheets which means I can spend more time on mashing up the data.
Initial thoughts on OER content and collections
Top level data like geographic mappings look straight forward, but delving deeper is going to take some more work. As the project proposal highlights whilst projects were required host resources anywhere only requiring a reference to be submitted to Jorum what has been submitted (if anything) varies greatly.
What I’ve learned over recent weeks is VML (Vector Markup Language) works for Internet Explorer 5 and later. It was never passed as a W3C standard, instead they opted for SVG (Scalable Vector Graphics) – Microsoft however didn’t incorporate it until IE9. Some libraries/tools like Google Chart API embed scalable graphics using iFrames and then detect the viewer’s browser to deliver the graphic in VML or SVG. One of my preferred libraries, d3.js, only renders in SVG so there is a question about whether it should be used.
I’m only just getting up to speed with HTML5 canvas. Being part of the HTML5 spec it’s only viewable on newer browsers. So like VML/SVG there is a question of compatibility. As canvas converts coded shapes into bitmaps it can render large numbers of objects more efficiently, where as SVG draws the bitmap and remembers the shape allowing further interaction – that’s my understanding anyway.
Top priority is to find out how much UKOER data there is and in what format. Once I’ve planted the seeds of what I need I can get on with the PROD package and the top level OER information. I’ll also need some answers around standards. My feeling is SVG should be okay as it’s a W3C recognised format, but I’m probably biased as it means I can remash some of my d3.js work.
I am as always on the scrounge for new ideas/comments so please share with what’s on your mind ;)