A quick postscript to day 24 of the OER Visualisation project where I looked at how individual Jorum UKOER resources were being, or as was the case, not being shared on social networking sites like Twitter, Facebook et al. I did a similar analysis on HumBox records and using my method it was a similar story, almost undetectable social sharing of individual resources.

To try an see if this was because of bad data on my behalf I posed the question to the #ukoer twitter community and the OER-DISCUSS mailing list.  On the OER-DISCUSS list Peter Robinson highlighted that one mention of one of University of Oxford’s resources on StumbleUpon resulted in a 20,000 views spike. On Twitter Catherine Grant (@filmstudiesff) responded:

Likes: 13
Shares: 24
Comments: 11
Total: 48

Tweets: 81

Google +1
+1s: 0

Diggs: 0

Shares: 2

Google Buzz
Buzzes: 0

Bookmarks: 8

Stumbles: 1

Putting Catherine’s first link into sharedcount.com gives the following –> 

So one page of curated resources with almost 50 Facebook reactions, over 80 tweets can have as many social shares an entire repository.

This issue is a well known one within the OER community and with almost eerie timing the following day after the ‘day 24’ post Pat Lockley posted The OERscars – and the winner is  in which he looks at some of the activity stream around ‘Dynamic Collections’ created as part of the OER Phase 2 Triton Project. From Pat’s post:

Dynamic Collections function as a WordPress plug in, bring in RSS Feeds from OER sites and blogs, and then search these feeds for particular words before moving these items into collections. These collections can be created as simply as a WordPress post, and so gives almost everyone the scope to start building OER collections straight away. Once a collection has some content, it can be displayed to visitors to the site (normally as a “wider reading” style link at the end of a post on a particular topic) and we made sure to track how these resources are used.  As well as showing as a WordPress page, the collections can also be seen as an RSS Feed (add ?rss_feed_collection=true to the end of a page), An Activity Stream – which will be handy for the Learning Registry (?activity_stream=true), or embedded into another page (?dc_embed=true) via some javascript.

I’m not entirely sure what my point is but thought worth sharing the information and links.


Following on Day 20’s Maps, Maps, Maps, Maps the last couple of days I’ve been playing around with timelines. This was instigated by a CETIS PROD ‘sprint day’ on Friday where Sheila MacNeill, Wilbert Kraan, David Kernohan and I put our thinking caps on, cracked the knuckles and look at what we could do with the CETIS PROD data.

Creating a timeline of JISC projects from PROD is not new, Wilbert already posting a recipe for using a Google Gadgetized version of MIT’s SIMILE timeline widget to create a timeline of  JISC e-Learning projects. I wanted to do something different, trying to extract project events and also have more timeline functionality than offered by the SIMILE gadget. My decision to go down this particular route was also inspired by seeing Derek Bruff’s Timeline CV which renders Google Spreadsheet data in a full feature version of the SIMILE timeline widget (an idea that Derek had got from Brian Croxall).   

PROD Project Directory pageHaving already dabbled with the PROD data I knew CETIS staff have annotated JISC projects with almost 3,000 individual comments. These can be categorised as general and related projects comments and comments associated with technology and standards choices (you can see this rendered in a typical project page and highlighted in the graphic).

Timeline 1 - All project comments

Discovery number one was that all the comments don’t have timestamp  information in the linked data (it turns out only general comments have these). Ignoring these for now and seeing what happens if we create a SPARQL query, import the data into a Google Spreadsheet, format and then wrap in a HTML page we get this JISC CETIS PROD Comment Timeline:

JISC CETIS PROD Comment Timeline

The good – search and filtering strands; information popups render well; user can resize window; easy export options (activated when mouseover timeline via orange scissors)
The bad – too many comments to render in the timeline; the key under the timeline can’t render all the strands

Timeline 2: Technology timeline with comments

One of the suggestions at ‘show and tell’ was to focus on a particular area like technology to see if there were any trends in uptake and use. As mentioned earlier there are currently no timestamps associated with technology comments and it was suggested that project start and end dates could be used as an indication. So using the same recipe of SPARQL query, formatting data in a Google Spreadsheet to a HTML page we get the CETIS PROD Tech Timeline.  

CETIS PROD Tech Timeline

Again the default view presents information overload which is slightly alleviated by filtering. I still don’t get any sense of wave of technologies coming and going, partly because the project start/end dates and maybe it very rare for a technology to die.    

Timeline 3 – Programme level

Trying to take a more focused view it was suggested I look at a programme level timeline of general comments (being general comments means they are individually timestamped). Using the recipe one more time of SPARQL query, formatting data in a Google Spreadsheet to a HTML page we get the CETIS PROD OER Phase 1 & 2 timeline.

CETIS PROD OER Phase 1 & 2 timeline

Still there is a problem navigating the data because of clustering of comments (shown by the string of blue dots in the bottom timebar). So what’s going on here? Looking at the data it’s possible to see that despite the fact that general comments could have been made at any point in the two years of the programme 912 comments were made on only 73 different days (which partly makes sense – ‘going to sit down and do some admin, I’ll review and comment on project progress’).    

So timelines are maybe not best for this type of data. At least there’s a recipe/template used here which might help people uncover useful information. There is an evolution of this combining timelines and maps that I’m working on so stay tuned.

In May 2009 JISC CETIS announced the winners of the OER Technical Mini-Projects. These projects were designed:

to explore specific technical issues that have been identified by the community during CETIS events such as #cetisrow and #cetiswmd and which have arisen from the JISC / HEA OER Programmes

JISC CETIS OER Technical Mini Projects Call
Source :
Author: Phil Barker, JISC CETIS

One of the successfully funded projects was CaPRéT - Cut and PAste reuse and Tracking from Brandon Muramatsu, MIT OEIT and Justin Ball and Joel Duffin, Tatemae. I’ve already touched upon OER tracking in day 24 and day 30 briefly looking at social shares of OER repository records. Whilst projects like the Learning Registry have the potential to help it still early days and tracking still seems to be an afterthought, which has been picked up in various technical briefings. CaPReT tries to address part of this problem, as stated in introduction to their final report:

Teachers and students cut and paste text from OER sites all the time—usually that's where the story ends. The OER site doesn't know what text was cut, nor how it might be used. Enter CaPRéT: Cut and Paste Reuse Tracking. OER sites that are CaPRéT-enabled can now better understand how their content is being used.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

The code and other resources can be found on their site. You can also read Phil Barker’s (JISC CETIS) experience testing CaPReT and feedback and comments about the project on the OER-DISCUSS list.

One of the great things about CaPReT is the activity data is available for anyone to download (or as summaries Who's using CaPRéT right now? | CaPRéT use in the last hour, day and week | CaPRéT use by day).

One of the challenges set to me by Phil Barker was to see what I could do with the CaPReT data. Here’s what I’ve come up with. First a map of CaPReT (great circles) usage plotting source website and where in the world some text was copied from (click on image for full scale):

capret - source target map

An an interactive timeline which renders where people copied text and pop-ups with a summary of what they copied

capret timemap

Both these examples rely on the same refined data source rendered in different ways and in this post I’ll tell you how it was done. As always it would be useful to get you feedback as to whether these visualisations are useful, things you’d improve or other ways you might use the recipes. 

How was it made – getting geo data

  1. Copied the CaPReT tabular data into Excel (.csv download didn’t work well for me, columns got mixed up on unescaped commas), and saved as .xls
  2. .xls imported in Google Refine. The main operation was to convert text source and copier IP/domains to geo data using www.ipinfodb.com. An importable set of routines can be downloaded from this gist [Couple of things to say about this -  CaPReT have a realtime map, but I couldn’t see any locations come through – if they are converting IP to geo it would be useful if this was recorded in the tabular results. IP/domain location lookups can also be a bit misleading, for example, my site is hosted in Canada, I’m typing in France and soon I’ll be back to Scotland
  3. the results were then exported to Excel and duplicates removed based on ‘text copied on’ dates and then uploaded to a Google Spreadsheet   

Making CaPReT ‘great circles’ map




























This was rendered in RStudio using Nathan Yau’s (Flowing Data) How to map connections with great circles. The R code I used is here. The main differences are reading the data from a Google Spreadsheet and handling the data slightly differently (Who would have thought as.data.frame(table(dataset$domain)) would turn the source spreadsheet into (#stilldiscoveringR) –>

I should also say that there was some post production work done on the map. For some reason some of the ‘great circles’ weren’t so great and wrapped around the map a couple of times. Fortunately these anomalies can easily be removed using Inkscape (and while I was there added a drop shadow.


capret before post production

Making CaPReT ‘timemap’

Whilst having a look around SIMILE based timelines for day 30 I came across the timemap.js project which:

is a Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously. By default, only items in the visible range of the timeline are displayed on the map.

Using the Basic Example, Google v3 (because it allows custom styling of Google Maps) and Google Spreadsheet Example I was able to format the refined data already uploaded to a Google Spreadsheet and then plug it in as a data source for the visualisation (I did have a problem with reading data from a sheet other than the first one, which I’ve logged as an issue including a possible fix).

A couple of extras I wanted to do with this example is also show and allow the user to filter based on source. There’s also an issue of scalability. Right now the map is rendering 113 entries, if CaPReT were to take off the spreadsheet would suddenly fill up and the visualisation will probably grind to a halt. 

[I might be revisiting timemap.js as they have another example for a temporal heatmap, which might be could for showing Jorum UKOER deposits by institution.] 

So there you go, two recipes for converting IP data into something else. I can already see myself using both methods in other aspects of the OER Visualisation and other project. And all of this was possible because CaPReT had some open data.

I should also say JISC CETIS have a wiki on Tracking OERs: Technical Approaches to Usage Monitoring for UKOER and Tony also recently posted Licensing and Tracking Online Content – News and OERs.


OER Phase 1 & 2 project descriptions wordcloudI’m in the final stretch of the OER Visualisation project. Recently reviewing the project spec I’m fairly happy that I’ll be able to provide everything asked for. One of the last things I wanted to explorer was automated/semi-automated programme reporting from the PROD database. From early discussions with CETIS programme level reporting, particularly of technology and standards used by projects,  emerged as a common task. CETIS already have a number of stored SPARQL queries to help them with this, but I wondered if more could be done to optimise the process. My investigations weren't entirely successful, and at times I was thwarted by misbehaving tools, but I thought it worth sharing my discoveries to save others time and frustration.

My starting point was the statistical programming and software environment R (in my case the more GUI friendly RStudio). R is very powerful in terms of reading data, processing it and producing data analysis/visualisations. Already CETIS’s David Sherlock has used R to produce a Google Visualisation of Standards used in JISC programmes and projects over time and CETIS’s Adam Cooper has used R for Text Mining Weak Signals, so there is some in-house skills which could have built on this idea.

Two other main factors for looking at R as a solution are:

  • the modular design of the software environment makes it easy to add functionality through existing packages (as I pointed out to David there is a SPARQL package for R which means he could theoretically consume linked data directly from PROD); and
  • R has a number of ways to produce custom reports, most notably the Sweave function allows the integration of R output in LaTeX documents allowing the generation of dynamic reports  

So potentially a useful combination of features. Lets start looking at some of the details to get this to work.

Getting data in

Attempt 1 – Kasabi custom API query

Kasabi is a place where publishers can put there data for people like me to come along and try and do interesting stuff with it. One of the great things you can do with Kasabi is make custom APIs onto linked data (like this one from Wilbert Kraan) and add one of the Kasabi authored XSLT stylesheets to get the data returned in a different format, for example .csv which is easily digestible by R.

Problem: Either I’m doing something wrong or there is an issue with the data on Kasabi or an issue with Kasabi itself because I keep getting 400 errors on queries I know work like this OER Projects with location but not when converted to an API

Attempt 2 – Query the data directly in R using the SPARQL package

As I highlight to David there is a SPARQL package for R which in theory lets you construct a query in R, collect the data and put it in a data frame.

Problem: Testing the package with this query returns: 

Error in data.frame(projectID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,  : 
  arguments imply differing number of rows: 460, 436, 291, 427, 426

My assumption is the package doesn’t like empty values.

Attempt 3 – Going via a sparqlproxy

For a lot of the other PROD SPARQL work I’ve done I’ve got csv files by using the Rensselaer SPARQL proxy service. R is quite happy reading a csv via the proxy service (code included later), it just means you are relying on an external service, which isn’t really necessary as Kasabi should be able to do the job and it would be better if the stored procedures were in one place (I should also say I looked at just using an XML package to read data from Talis/Kasabi but didn’t get very far. 

Processing the data

This is where my memory gets a little hazy and I wished I took more notes of the useful sites I went to. I’m pretty sure I got started with Tony Hirst’s How Might Data Journalists Show Their Working? Sweave, I know also that I looked at Vanderbilt’s Converting Documents Produced by Sweave, Nicola Sartori’s An Sweave Tutorial, Charlie Geyer's An Sweave Demo Literate Programming in R Reproducible Research, Greg Snow’s Automating reports with Sweave and Jim Robison-Cox’s Sweave Intro (this one included instructions on installing the MikTeX latex engine for windows which with out none of this would had worked).

The general idea with R/Sweave/LaTeX is you markup a document inserting R script which can be executed to include data tables and visualisations. Here’s a very basic of an output (I think I’ve been using R for a week now so don’t laugh) example which pulls in two data sets (Project Descriptions | Project Builds On), includes some canned text and generates a wordcloud from project descriptions and a force directed graph of project relationships.

The code used do this is embedded below (also available from here):

The main features are the \something e.g. \section which is LaTeX markup for the final document and the <<>>=  and @ R script code wrappers. I’ve even less experience of LaTeX than R so I’m sure there are many things I don’t know yet/got wrong, but hopefully you can see the potential power of the solution. Things I don’t like are being locked into particular LaTeX styles (although you can create your own) and the end product being a .pdf (as Sweave/R now go hand in hand a lot of the documentation and coding examples end up in .pdf which can get very frustrating when you are trying to copy ad paste code snippets, which also makes me wonder how accessible/screen reader friendly sweave/latex pdfs are).

Looking for something that gives more flexibility in output I turned to R2HTML which includes “a driver for Sweave allows to parse HTML flat files containing R code and to automatically write the corresponding outputs (tables and graphs)” . Using a similar style of markup (example of script here) we can generate a similar report in html. The R2HTML package generates all the graph images and html so in this example it was a case of uploading the files to a webserver. Because it’s html the result can easily be styled with a CSS sheet or opened in a word processor for some layout tweaking (here’s the result of a 60 second tweak.       

Is it worth it?

After a day toiling with SPARQL queries and LaTeX markup I’d say no, but its all very dependant on how often you need to produce the reports, the type of analysis you need to do and your anticipated audience. For example, if you are just going to present wordclouds and histograms R is probably overkill and it might be better to just use some standard web data visualisation libraries like mootools or Google Visualisation API to create live dashboards. Certainly the possibilities of R/Sweave are worth knowing about.


UKOER Hashtag CommunityLast week I started to play with the #ukoer hashtag archive (which has generated lots of useful coding snippets to processes the data that I still need to blog … doh!). In the meantime I thought I’d share an early output. Embedded below is a zoom.it of the #ukoer hashtag community. The sketch (HT @psychemedia) is from a partial list* of twitters (n. 865) who have used the #ukoer hashtag in the last couple of years and who they currently follow. The image represents over 24,000 friendships, the average person having almost 30 connections to other people in the community.


3D Heart SSD
3D Heart SSD
Originally uploaded by Generation X-Ray
Publishing an early draft of this image generated a couple of ‘it looks like’ comments (HT @LornaMCampbell @glittrgirl). To me it looks like a heart, hence the title of this post. The other thing that usually raises questions is how the colour grouping are defined (HT @ambrouk). The answer in this case is it’s  generated from a modularity algorithm which tries to automatically detect community structure.

As an experiment I’ve filtered the Twitter profile information used for each of these groupings and generated a wordcloud using R (The R script used is a slight modification of one I’ve submitted to the Twitter Backchannel Analysis repository Tony started – something else I need to blog about. The modification is to SELECT a column WHERE modclass=somthing).

Right all this post has done is remind me of my post backlog and I’ve got more #ukoer visualisation to do so better get on with it.

*it’s a partial list because as far as I know there isn’t a complete archive of #ukoer tweets. The data I’m working from is from an export from TwapperKeeper for March 2010-Jan 2012  topped up with some data from Topsy for April 2009-March 2010


It’s the last day of the OER Visualisation Project and this is my penultimate ‘official’ post. Having spent 40 days unlocking some of the data around the OER Programme there are more things I’d like to do with the data, some loose ends in terms of how-to’s I still want to document and some ideas I want to revisit. In the meantime here are some  of the outputs from my last task, looking at the #ukoer hashtag community. This follows on from day 37 when I looked at ‘the heart of #ukoer’, this time looking at some of the data pumping through the veins of UKOER. It’s worth noting that the information I’m going to present is a snapshot of OER activity, only looking at a partial archive of information tweeted using the #ukoer hashtag from April 2009 to the beginning of January 2012, but hopefully gives you an sense of what is going on.

The heart revisited

I revisited the heart after I read Tony Hirst’s What is the Potential Audience Size for a Hashtag Community?. In the original heart nodes were sized using ‘betweenness centrality’ which is a social network metric to identify nodes which are community bridges, nodes which provide a pathway to other parts of the community. When calculating betweenness centrality on a friendship network it takes no account of how much that person may have contributed. So for example someone like John Robertson (@KavuBob) was originally ranked has having the 20th highest betweenness centrality in the #ukoer hashtag community, while JISC Digital Media (@jiscdigital) is ranked 3rd. But if you look at how many tweets John has contributed (n.438) compared to JISC Digital Media (n.2) isn’t John’s potential ‘bridging’ ability higher?

Weighted Betweenness CentrailityThere may be some research in this area, and I have to admit I haven’t had the chance to look, but for now I decided to weight betweenness centrality based on the volume of the archive the user has contributed. So John goes from ranked 20th to 3rd and JISC Digital Media goes from 3rd to 55th. Here’s a graph on the winners and losers (click on the image to enlarge).

Here is the revised heart on zoom.it (and if zoom.it doesn’t work for you the heart as a .jpg

The 'heart' of #ukoer (click to enlarge)

[In the bottom left you’ll notice I’ve included a list of top community contributors (based on weighted betweenness – a small reward for those people (I was all out of #ukoer t-shirts).]

These slides also show the difference in weighted betweenness centrality (embedded below). You should ignore the change in colour palette, the node text size is depicting betweenness centrality weight [Google presentation has come on a lot recently – worth a look at if you are sick of the clutter of slideshare]:


The ‘pulse’ of #ukoer

In previous work I’ve explored visualising Twitter conversations using my TAGSExplorer.  Because of the way I reconstructed the #ukoer twitter archive (a story for another day) it’s compatible with this tool so you can see and explorer the #ukoer archive of the 8300 tweets I’ve saved here. One of the problems I’m finding with this tool is it takes a while to get the data from the Google Spreadsheet for big archives.

TAGSExplorer - ballofstuffThis problem was also encountered in Sam’s Visualising Twitter Networks: John Terry Captaincy Controversy. As TAGSExplorer internally generates a graph of the conversation, rather than scratching my head on some R Script it was easy to expose this data so that it can be imported into Gephi. So now if you add &output=true to a TAGSExplorer url you get a comma separated edge list to use with you SNA package of choice (the window may be blocked as a pop-up, so you need to enable). Here is the link for the #ukoer archive with edges for replies, mentions and retweets (which generates ‘a ball of awesome stuff’ (see insert above) but will eat your browser performance)

ukoer conversation (click to enlarge)Processing the data in Gephi you get a similar ball of awesome stuff (ukoer conversation on zoom.it | ukoer conversation .jpg). What does it all mean I hear you ask. These flat images don’t tell you a huge amount. Being able to explore what was said is very powerful  (hence coming up with TAGSExplorer). You can however see a lot of mentions (coloured blue and line width indicating volume) in the centre between a small number of people. It’s also interesting to contrast OLNet top right and 3d_space mid left. OLNet has a number of green lines radiating out indicating @replies indicating they are in conversations with individuals using the #ukoer tag. This compares to 3d_space which has red lines indicating retweets suggesting they are more engaged in broadcast.

Is there still a pulse?

UKOER Community StatsWhen looking at the ‘ball of awesome stuff’ it’s important to remember that this is a depiction of over 8,000 tweets from April 2009 to January 2012. How do we know if this tag is alive and kicking or not just burned out like a dwarf star?

The good news is there is still a pulse within #ukoer, or more accurately lots of individual pulses. The screenshot to the right is an extract from this Google Spreadsheet of #UKOER. As well as including 8,300 tweets from #ukoer it also lists the twitter accounts that have used this tag. On this sheet are sparklines indicating the number of tweets in the archive they’ve made and when. At the top of the list you can see some strong pulses from UKOER, xpert_project and KavuBob. You can also see others just beginning or ending their ukoer journey.

The good news is the #ukoer hashtag community is going strong December 2011 having the most tweets in one month and the number of unique Twitter accounts using the tag has probably by now tipped over the 1,000 mark.

#ukoer community growth

There is more for you to explore in this spreadsheet but alas I have a final post to write so you’ll have to be your own guide. Leave a comment if you find anything interesting or have any questions

[If you would like so explorer both the ‘heart’ and ‘pulse’ graphs more closely I’ve upload them to my installation of Raphaël Velt's Gexf-JS Viewer (it can  take 60 seconds to render the data). This also means the .gexf files are available for download:]


This is the final official post (a little late than expected) for the OER Visualisation Project. In this post I’ll summaries the work and try and answer the questions originally raised in the project specification.

Over the course of the project there have been 17 blog posts, including this one, listed at the end and accessible via the ooh-er category. The project was tasked with two work packages: PROD Data Analysis (WP1); and Content Outputs (WP2). The nature of these packages means there is overlap in deliverables but I will attempt to deal with them separately.

PROD Data Analysis (WP1)

  • Examples of enhanced data visualisations from OER Phase 1 and 2.

This part of the project mainly focused around process rather than product. It’s rewarding to see that CETIS staff are already using some of documented processes to generate there own visualisations (David’s post | Sheila’s post). Visualisations that were produced include: OER Phase 1 and 2 maps [day 20], timelines [day 30], wordclouds [day 36] and project relationship [day 8]

OER MapOER TimelineOER WordcloudOER Project relates force layout


  • Recommendations on use and applicability of visualisation libraries with PROD data to enhance the existing OER dataset.

The main recommendation is that visualisation libraries should be integrated into the PROD website or workflows. In particular it is perceived that it would be most useful to incorporate summary visualisations in general queries for programme/strand information. The types of visualisations that are most useful would be the automatic generation of charts already used by CETIS staff in their summaries of programme information (histograms, bubble charts and word clouds). Most of these are possible using the Google Visualisation API. This may emerge as the most preferential library given Wilbert’s existing work with the Sgvizler library. The disadvantage of the Google Visualisation API is that the appearance of the charts is very standard and investment in a more glossy solution such as raphaelJS (used in WP2) might be considered. [Although I never got around to it I’m also still interested in investigating an entirely new navigation/exploration solution based on my Guardian Tag Explorer]


  • Recommendations and example workflows including sample data base queries used to create the enhanced visualisations.

In the blog post on day 8 an example spreadsheet was created (which has been further developed) and shared for reuse. As part of this process an online document of SPARQL queries used in the spreadsheet was also created and maintained. Throughout the project this resource has been revisited and reused as part of other outputs. It has also been used to quickly filter PROD data.

This technique of importing SPARQL data has already been highlighted by CETIS staff but to move the idea forward it might be worth considering releasing a set of prepared spreadsheets to the public. The main advantage of this is there is a steep learning curve when using linked data and SPARQL queries. Having a set of spreadsheets ready for people to use should make PROD data more accessible, allowing people to explore the information in a familiar environment.

An issue to consider if promoting the PROD Datastore Spreadsheet is weather to make a live link to the PROD data or create as fixed data. For example, the mapping solutions from day 20 broke after PROD data was updated. Removing whitespace from some institution names meant that location data which was looked-up using a combination of fixed and live data failed to find the amended institution names. This is not so much an issue with PROD data but in the way the spreadsheet was developed. Consequently if CETIS were to go down the route of providing ready made Google Spreadsheets it should make clear to users that data might change at any point.  Alternatively instead of providing spreadsheets with live data CETIS might consider release versions of spreadsheets with fixed results (this could be provided for programmes or strands with have been completed). Production of the static sheets could be achieved manually by uploading a collection of csv reports (sheets for project information, technologies, standards etc) or be automated using Google Apps Script.

In day 36 an additional workflow using a combination of R, Sweave and R2HTML was also explored. Using R opens the prospect of enhanced analysis of data being directly pulled via SPARQL queries to generate standard reports. The main question is whether there would be enough return in investment to set this up, or should the focus be of more lightweight solutions useable by a wider user base

  • Issues around potential workflows for mirroring data from our PROD database and linking it to other datasets in our Kasabi triple store.

It was hoped that stored queries could be used with the Kasabi triple store but a number of issues prevented this.  The main reason was the issue of exposing Kasabi API keys when using client side visualisations and in shared Google Spreadsheets. There was also an issue with custom APIs within Kasabi not working when they included OPTIONAL statements in queries. This issue is being investigated by Kasabi.

During the course of the project it was also discovered that this is missing metadata on comments stored as linked data. Currently general comments are the only ones to include author and timestamp information. For deeper analysis of projects it might be worth also including this data in related projects comments and comments associated with technology and standards choices [day 30b]

  • Identification of other datasets that would enhance PROD queries, and some exploration of how transform and upload them.

No other data sets were identified.

  • General recommendations on wider issues of data, and observed data maintenance issues within PROD.

Exploring PROD data has been a useful opportunity to explore it’s validity and coverage. As mentioned earlier geo data for institutions, particularly for the OER Programme, is patchy (a list of missing institutions was given to Wilbert). Another observation was whitespaces on data values prevented some lookups from working. It might be worth seeing if these could be trimmed on data entry.

Another consideration if CETIS want PROD data to be attractive to the data visualisation community is to natively provide output in csv or JSON format. This is already partially implemented in the PROD API and already available on some stored Kasabi stored queries using a csv XSLT stylesheet.

Content Outputs (WP2)

Before I mention some of the outputs it’s worth highlighting some of the processes to get there. This generally followed Paul Bradshaw’s The inverted pyramid of data journalism highlighted in day 16 of compile, clean, context and combine.

Two datasets were compiled and cleaned as part of this project: UKOER records on Jorum; and an archive of #ukoer tweets from April 2009* to January 2012 (the archive from April 2009 – March 2010 is only partial, more complete data recovered from TwapperKeeper exists for the remaining period.

UKOER records on Jorum

As the Jorum API was offline for the duration of this project an alternative method for compiling  UKOER records had to be found. This resulted in the detailed documentation of a recipe for extracting OAI service records using Google Refine [day 11]. Using Google Refine proved very advantageous as not only were ukoer records extracted, but it was possible to clean and combine the dataset with other information. In particular three areas of enhancement were achieved:

  • reconciling creator metadata to institutional names - as this data is entered by the person submitting the record there can be numerous variations in format which can make this processes difficult. Fortunately with Google Refine it was possible to extract enough data to match organisations stored in PROD data (made possible via the Kasabi Reconciliation API).
  • extract Jorum record view counts by scraping record pages – a request was made for this data but at the time wasn’t available. Google Refine was used to lookup each Jorum record page and parse the view count that is publically displayed.
  • return social share counts for Jorum records – using the sharedcount.com API social shares for a range of services (Facebook, Twitter, Google+ etc) were returned for individual Jorum records

The processed Jorum UKOER dataset is available in this Google Spreadsheet (the top 380 Jorum UKOER social share counts is available separately)

#ukoer Twitter Archive

The process of extracting an archive of #ukoer tweets for March 2010 to January 2012 was more straight forward as this data was stored on TwapperKeeper. Tweets prior to this date were more problematic as no publically complied archive could be found. The solution was to extract partial data from the Topsy Otter API (the process for doing this is still to be documented).

The #ukoer Twitter archive is available in this Google Spreadsheet

In the following sections the visualisations produced are summarised.

  • Collections mapped by geographical location of the host institution

UKOER submissions

Title: Jorum UKOER Geo-Collections
About: Having reconciled the majority of Jorum UKOER records to an institution name, the geographic location of these institutions was obtained from PROD data. For institutions without this data a manual lookup was done. Results are published on a custom Google Map  using a modified MarkerCluster Speed Test Example [day 20]
Type: Interactive web resource
Link: http://hawksey.info/maps/oer-records.html

  • Collections mapped by subject focus/Visualisations of the volume of collections


Title: Snowflake
About: Using data exported from Google Refine a custom log file was rendered using the open source visualisation package Gource. The visualisation shows institutional deposits to Jorum over time. Individual deposits (white dots) are clustered around Jorum subject classifications [day 11]
Type: Video
Link: http://www.youtube.com/watch?v=ekqnXztr0mU

Subject Wheel

Title: Jorum Subject Wheel
About: Generated in NodeXL this image illustrates subject deposits by institutions. Line width indicates the number of deposits from the institution to the subject area and node size indicates the number of different subject areas the institution has deposited to. For example, Staffordshire University has made a lot of deposits to HE – Creative Arts and Design and very few other subjects, while Leeds Metropolitan University has made deposits to lots of subjects.
Type: Image/Interactive web resource
Link: https://mashe.hawksey.info/wp-content/uploads/2012/02/SubjectCircle.jpg
Link: http://hawksey.info/nodegl/#0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc (Interactive web resource)
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc#gid=0 (Source Data)

deposit dots

Title: Jorum Records Institutional Deposits Bubble Diagram
About: Creating a pivot report from the Jorum UKOER records in Google Spreadsheet, the data was then rendered using a modification of a raphaelJS library dots example. Bubble size indicates the number of records deposited by the institution per month.
Type: Interactive web resource
Link: http://hawksey.info/labs/raphdot.html

  • Other visualisations

Capret great circle

Title: CaPRéT ‘great circle’ Tracking Map
About: Raw CaPRéT OER tracking data was processed in Google Refine converting IP log data for target website and copier location into longitude and latitude. The results were then processed in R. The map uses ‘great circle’ lines to illustrate the location of the source data and the location of the person taking a copy [day 32].
Type: Image
Link: http://mcdn.hawksey.info/wp-content/uploads/2012/01/capret.jpg

capret timemap

Title: CaPRéT timemap
About: Using the same data from the CaPRéT ‘great circle’ Tracking Map the data was rendered from a Google Spreadsheet using a modification of a timemap.js project example. Moving the time slider highlights the locations of people copying text tracked using CaPRéT. Clicking on a map pin opens an information popup with details of what was copied [day 32]
Type: Interactive web resource
Link: http://hawksey.info/maps/CaPReT.html
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdDd5UXFGdEJGUjEyN3M5clU1X2R5V0E#gid=0 (Source Data)

heart of ukoer
Title: The heart of #ukoer
About: Produced using a combination of Gephi and R ‘the heart of #ukoer’ depicts the friend relationships between 865 twitter accounts who have used the #ukoer hashtag since April 2009 to January 2012. The image represents over 24,000 friendships and node size indicates the persons weighted ‘betweenness centrality’ (how much of a community bridge that person is). Colours indicate internal community groups (programmatically detected). The wordclouds round the visualisation are a summary of that sub-groups Twitter profile descriptions [day 37, revisited day 40].
Type: Image/Interactive web resource
Link: http://hawksey.info/labs/ukoer-community3-weighted-BC.jpg
Link: http://zoom.it/6ucv5

pulse of #ukoer

Title: The pulse of #ukoer
About: Produced using Gephi this image is a summary of the conversations between people using the #ukoer hashtag. Connecting lines are colour coded with green showing @replies, blue are @mentions and red are reweets [day 40].
Type: Image/Interactive web resource
Link: http://hawksey.info/labs/ukoer-conversation.jpg
Link: http://zoom.it/xpRG

ball of stuff

Title: Interactive ball of stuff
About: Using the same data from the ‘pulse of #ukoer’ an interactive version of the #ukoer twitter archive is rendered in the experimental TAGSExplorer. Click on nodes allows the user to see all the tweets that person has made in the archive and replay part of the conversation [day 40].
Type: Interactive web resource
Link: http://hawksey.info/tagsexplorer/?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE&sheet=od6

still a pulse

Title: Is there still a pulse
About: As part of the process of preserving #ukoer tweets a number of associated graphs used to detect the health of the #ukoer hashtag were produced. These are available in the Google Spreadsheet [day 40].
Type: Spreadsheet
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE#gid=3

Recommendations/Observations/Closing thoughts

The project has documented a number of recipes for data processing and visualisation, but in many ways has only exposed the tip of the iceberg. It is likely with current financial constraints repository managers will increasingly be required to illustrate value for money and impact. Data analysis and visualisation can help with both aspects, helping monitor repository use, but equally be used in an intelligence mode to identifying possible gaps and proactively leveraging repository resources. It was interesting to discover a lack of social sharing of Jorum records (day 24 | day 30) and perhaps more can be done in the area of frictionless sharing (see Tony Hirst’s draft WAR bid)

This project has mainly focused on Jorum and the #ukoer Twitter hashtag, due to time constraints as well as the amount of time required to compile a useful dataset. It would be useful if these datasets were more readily available, but I imagine this is less of a problem for internal repository analysis as data is easier to access.

Towards the end of the project focus shifted towards institutional repositories, some work being done to assist University of Oxford (and serendipitously Leeds Metropolitan University). If this work is to be taken forward, and this may already be a part of the OER Rapid Innovation projects, more work needs to be done with institutional repository managers to surface the tools and recipes they need to help them continue to push their work forward.

Whilst not a purposeful aim of this project it’s very fitting that all of tools and visualisation libraries used in this project are open source or freely available. This rich selection of high quality tools and libraries also means that all of the project outputs are replicable without any software costs.

An area that is however lacking is documented uses of these tools for OER analysis. Recipes for extracting OAI data for desktop analysis were, until this project, none existent, and as mentioned more work potentially needs to be done in this area to streamline the processes for compiling, cleaning and communicating repository data.

To help the sharing of OER data I would encourage institutions to adopted an open data philosophy including information on how this data can be accessed (the Ghent University Downloads/API page is an example of good practice). There is also a lack of activity data being recorded around OER usage. This is a well established issue and hopefully projects like Learning Registry/JLeRN can address this. It’s however worth remembering that these projects are very unlikely to be magic bullets and projects like CaPRéT still have an important role.

Was this project a success? I would say partial. Looking back at the selection of visualisation produced I feel there could have been more. So much time was spent creating recipes for data extraction and analysis, that it left little time for visualisation. I hope what has been achieved is of benefit to the sector and it’s reassuring that outputs from this project is already being used elsewhere.

Project Posts


On Tuesday 19th June I’ll be presenting at the Institutional Web Manager Workshop (IWMW) in Edinburgh … twice! Tony Hirst and I are continuing our tour, which started at the JISC CETIS Conference 2012, before hitting the stage at GEUG12. For IWMW12 we are doing a plenary and workshop around data visualisation (the plenary being a taster for our masterclass workshop). I’ll be using this post as a holder for all the session resources.

Update: I've also added Tony Hirst's (OU) slides. Tony went on first to introduce some broad data visualisation themes before I went into a specific case study.

The draft slides for my part of the plenary are embedded below and available from Slideshare and Google Presentation (the slides are designed for use with pptPlex, but hopefully they still make sense). For the session I’m going to use the OER Visualisation Project to illustrate the processes required to get a useful dataset and how the same data can be visualised in a number of ways depending on audience and purpose. Update: I should have said the session should be streamed live, details will appear on IWMW site.

Update: As a small aside I've come up with a modified version of Craig Russell's UK Universities Social Media table as mentioned in Further Evidence of Use of Social Networks in the UK Higher Education Sector guest post on UKWebFocus (something more 'glanceable'). Using the Twitter account list as a starting point I've looked at how University accounts follow each other and come up with this (click on the image for an interactive version).

If you have any questions feel free to leave a comment or get in touch.