On Tuesday 19th June I’ll be presenting at the Institutional Web Manager Workshop (IWMW) in Edinburgh … twice! Tony Hirst and I are continuing our tour, which started at the JISC CETIS Conference 2012, before hitting the stage at GEUG12. For IWMW12 we are doing a plenary and workshop around data visualisation (the plenary being a taster for our masterclass workshop). I’ll be using this post as a holder for all the session resources.

Update: I've also added Tony Hirst's (OU) slides. Tony went on first to introduce some broad data visualisation themes before I went into a specific case study.

The draft slides for my part of the plenary are embedded below and available from Slideshare and Google Presentation (the slides are designed for use with pptPlex, but hopefully they still make sense). For the session I’m going to use the OER Visualisation Project to illustrate the processes required to get a useful dataset and how the same data can be visualised in a number of ways depending on audience and purpose. Update: I should have said the session should be streamed live, details will appear on IWMW site.

Update: As a small aside I've come up with a modified version of Craig Russell's UK Universities Social Media table as mentioned in Further Evidence of Use of Social Networks in the UK Higher Education Sector guest post on UKWebFocus (something more 'glanceable'). Using the Twitter account list as a starting point I've looked at how University accounts follow each other and come up with this (click on the image for an interactive version).

If you have any questions feel free to leave a comment or get in touch.


This is the final official post (a little late than expected) for the OER Visualisation Project. In this post I’ll summaries the work and try and answer the questions originally raised in the project specification.

Over the course of the project there have been 17 blog posts, including this one, listed at the end and accessible via the ooh-er category. The project was tasked with two work packages: PROD Data Analysis (WP1); and Content Outputs (WP2). The nature of these packages means there is overlap in deliverables but I will attempt to deal with them separately.

PROD Data Analysis (WP1)

  • Examples of enhanced data visualisations from OER Phase 1 and 2.

This part of the project mainly focused around process rather than product. It’s rewarding to see that CETIS staff are already using some of documented processes to generate there own visualisations (David’s post | Sheila’s post). Visualisations that were produced include: OER Phase 1 and 2 maps [day 20], timelines [day 30], wordclouds [day 36] and project relationship [day 8]

OER MapOER TimelineOER WordcloudOER Project relates force layout


  • Recommendations on use and applicability of visualisation libraries with PROD data to enhance the existing OER dataset.

The main recommendation is that visualisation libraries should be integrated into the PROD website or workflows. In particular it is perceived that it would be most useful to incorporate summary visualisations in general queries for programme/strand information. The types of visualisations that are most useful would be the automatic generation of charts already used by CETIS staff in their summaries of programme information (histograms, bubble charts and word clouds). Most of these are possible using the Google Visualisation API. This may emerge as the most preferential library given Wilbert’s existing work with the Sgvizler library. The disadvantage of the Google Visualisation API is that the appearance of the charts is very standard and investment in a more glossy solution such as raphaelJS (used in WP2) might be considered. [Although I never got around to it I’m also still interested in investigating an entirely new navigation/exploration solution based on my Guardian Tag Explorer]


  • Recommendations and example workflows including sample data base queries used to create the enhanced visualisations.

In the blog post on day 8 an example spreadsheet was created (which has been further developed) and shared for reuse. As part of this process an online document of SPARQL queries used in the spreadsheet was also created and maintained. Throughout the project this resource has been revisited and reused as part of other outputs. It has also been used to quickly filter PROD data.

This technique of importing SPARQL data has already been highlighted by CETIS staff but to move the idea forward it might be worth considering releasing a set of prepared spreadsheets to the public. The main advantage of this is there is a steep learning curve when using linked data and SPARQL queries. Having a set of spreadsheets ready for people to use should make PROD data more accessible, allowing people to explore the information in a familiar environment.

An issue to consider if promoting the PROD Datastore Spreadsheet is weather to make a live link to the PROD data or create as fixed data. For example, the mapping solutions from day 20 broke after PROD data was updated. Removing whitespace from some institution names meant that location data which was looked-up using a combination of fixed and live data failed to find the amended institution names. This is not so much an issue with PROD data but in the way the spreadsheet was developed. Consequently if CETIS were to go down the route of providing ready made Google Spreadsheets it should make clear to users that data might change at any point.  Alternatively instead of providing spreadsheets with live data CETIS might consider release versions of spreadsheets with fixed results (this could be provided for programmes or strands with have been completed). Production of the static sheets could be achieved manually by uploading a collection of csv reports (sheets for project information, technologies, standards etc) or be automated using Google Apps Script.

In day 36 an additional workflow using a combination of R, Sweave and R2HTML was also explored. Using R opens the prospect of enhanced analysis of data being directly pulled via SPARQL queries to generate standard reports. The main question is whether there would be enough return in investment to set this up, or should the focus be of more lightweight solutions useable by a wider user base

  • Issues around potential workflows for mirroring data from our PROD database and linking it to other datasets in our Kasabi triple store.

It was hoped that stored queries could be used with the Kasabi triple store but a number of issues prevented this.  The main reason was the issue of exposing Kasabi API keys when using client side visualisations and in shared Google Spreadsheets. There was also an issue with custom APIs within Kasabi not working when they included OPTIONAL statements in queries. This issue is being investigated by Kasabi.

During the course of the project it was also discovered that this is missing metadata on comments stored as linked data. Currently general comments are the only ones to include author and timestamp information. For deeper analysis of projects it might be worth also including this data in related projects comments and comments associated with technology and standards choices [day 30b]

  • Identification of other datasets that would enhance PROD queries, and some exploration of how transform and upload them.

No other data sets were identified.

  • General recommendations on wider issues of data, and observed data maintenance issues within PROD.

Exploring PROD data has been a useful opportunity to explore it’s validity and coverage. As mentioned earlier geo data for institutions, particularly for the OER Programme, is patchy (a list of missing institutions was given to Wilbert). Another observation was whitespaces on data values prevented some lookups from working. It might be worth seeing if these could be trimmed on data entry.

Another consideration if CETIS want PROD data to be attractive to the data visualisation community is to natively provide output in csv or JSON format. This is already partially implemented in the PROD API and already available on some stored Kasabi stored queries using a csv XSLT stylesheet.

Content Outputs (WP2)

Before I mention some of the outputs it’s worth highlighting some of the processes to get there. This generally followed Paul Bradshaw’s The inverted pyramid of data journalism highlighted in day 16 of compile, clean, context and combine.

Two datasets were compiled and cleaned as part of this project: UKOER records on Jorum; and an archive of #ukoer tweets from April 2009* to January 2012 (the archive from April 2009 – March 2010 is only partial, more complete data recovered from TwapperKeeper exists for the remaining period.

UKOER records on Jorum

As the Jorum API was offline for the duration of this project an alternative method for compiling  UKOER records had to be found. This resulted in the detailed documentation of a recipe for extracting OAI service records using Google Refine [day 11]. Using Google Refine proved very advantageous as not only were ukoer records extracted, but it was possible to clean and combine the dataset with other information. In particular three areas of enhancement were achieved:

  • reconciling creator metadata to institutional names - as this data is entered by the person submitting the record there can be numerous variations in format which can make this processes difficult. Fortunately with Google Refine it was possible to extract enough data to match organisations stored in PROD data (made possible via the Kasabi Reconciliation API).
  • extract Jorum record view counts by scraping record pages – a request was made for this data but at the time wasn’t available. Google Refine was used to lookup each Jorum record page and parse the view count that is publically displayed.
  • return social share counts for Jorum records – using the sharedcount.com API social shares for a range of services (Facebook, Twitter, Google+ etc) were returned for individual Jorum records

The processed Jorum UKOER dataset is available in this Google Spreadsheet (the top 380 Jorum UKOER social share counts is available separately)

#ukoer Twitter Archive

The process of extracting an archive of #ukoer tweets for March 2010 to January 2012 was more straight forward as this data was stored on TwapperKeeper. Tweets prior to this date were more problematic as no publically complied archive could be found. The solution was to extract partial data from the Topsy Otter API (the process for doing this is still to be documented).

The #ukoer Twitter archive is available in this Google Spreadsheet

In the following sections the visualisations produced are summarised.

  • Collections mapped by geographical location of the host institution

UKOER submissions

Title: Jorum UKOER Geo-Collections
About: Having reconciled the majority of Jorum UKOER records to an institution name, the geographic location of these institutions was obtained from PROD data. For institutions without this data a manual lookup was done. Results are published on a custom Google Map  using a modified MarkerCluster Speed Test Example [day 20]
Type: Interactive web resource
Link: http://hawksey.info/maps/oer-records.html

  • Collections mapped by subject focus/Visualisations of the volume of collections


Title: Snowflake
About: Using data exported from Google Refine a custom log file was rendered using the open source visualisation package Gource. The visualisation shows institutional deposits to Jorum over time. Individual deposits (white dots) are clustered around Jorum subject classifications [day 11]
Type: Video
Link: http://www.youtube.com/watch?v=ekqnXztr0mU

Subject Wheel

Title: Jorum Subject Wheel
About: Generated in NodeXL this image illustrates subject deposits by institutions. Line width indicates the number of deposits from the institution to the subject area and node size indicates the number of different subject areas the institution has deposited to. For example, Staffordshire University has made a lot of deposits to HE – Creative Arts and Design and very few other subjects, while Leeds Metropolitan University has made deposits to lots of subjects.
Type: Image/Interactive web resource
Link: https://mashe.hawksey.info/wp-content/uploads/2012/02/SubjectCircle.jpg
Link: http://hawksey.info/nodegl/#0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc (Interactive web resource)
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc#gid=0 (Source Data)

deposit dots

Title: Jorum Records Institutional Deposits Bubble Diagram
About: Creating a pivot report from the Jorum UKOER records in Google Spreadsheet, the data was then rendered using a modification of a raphaelJS library dots example. Bubble size indicates the number of records deposited by the institution per month.
Type: Interactive web resource
Link: http://hawksey.info/labs/raphdot.html

  • Other visualisations

Capret great circle

Title: CaPRéT ‘great circle’ Tracking Map
About: Raw CaPRéT OER tracking data was processed in Google Refine converting IP log data for target website and copier location into longitude and latitude. The results were then processed in R. The map uses ‘great circle’ lines to illustrate the location of the source data and the location of the person taking a copy [day 32].
Type: Image
Link: http://mcdn.hawksey.info/wp-content/uploads/2012/01/capret.jpg

capret timemap

Title: CaPRéT timemap
About: Using the same data from the CaPRéT ‘great circle’ Tracking Map the data was rendered from a Google Spreadsheet using a modification of a timemap.js project example. Moving the time slider highlights the locations of people copying text tracked using CaPRéT. Clicking on a map pin opens an information popup with details of what was copied [day 32]
Type: Interactive web resource
Link: http://hawksey.info/maps/CaPReT.html
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdDd5UXFGdEJGUjEyN3M5clU1X2R5V0E#gid=0 (Source Data)

heart of ukoer
Title: The heart of #ukoer
About: Produced using a combination of Gephi and R ‘the heart of #ukoer’ depicts the friend relationships between 865 twitter accounts who have used the #ukoer hashtag since April 2009 to January 2012. The image represents over 24,000 friendships and node size indicates the persons weighted ‘betweenness centrality’ (how much of a community bridge that person is). Colours indicate internal community groups (programmatically detected). The wordclouds round the visualisation are a summary of that sub-groups Twitter profile descriptions [day 37, revisited day 40].
Type: Image/Interactive web resource
Link: http://hawksey.info/labs/ukoer-community3-weighted-BC.jpg
Link: http://zoom.it/6ucv5

pulse of #ukoer

Title: The pulse of #ukoer
About: Produced using Gephi this image is a summary of the conversations between people using the #ukoer hashtag. Connecting lines are colour coded with green showing @replies, blue are @mentions and red are reweets [day 40].
Type: Image/Interactive web resource
Link: http://hawksey.info/labs/ukoer-conversation.jpg
Link: http://zoom.it/xpRG

ball of stuff

Title: Interactive ball of stuff
About: Using the same data from the ‘pulse of #ukoer’ an interactive version of the #ukoer twitter archive is rendered in the experimental TAGSExplorer. Click on nodes allows the user to see all the tweets that person has made in the archive and replay part of the conversation [day 40].
Type: Interactive web resource
Link: http://hawksey.info/tagsexplorer/?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE&sheet=od6

still a pulse

Title: Is there still a pulse
About: As part of the process of preserving #ukoer tweets a number of associated graphs used to detect the health of the #ukoer hashtag were produced. These are available in the Google Spreadsheet [day 40].
Type: Spreadsheet
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE#gid=3

Recommendations/Observations/Closing thoughts

The project has documented a number of recipes for data processing and visualisation, but in many ways has only exposed the tip of the iceberg. It is likely with current financial constraints repository managers will increasingly be required to illustrate value for money and impact. Data analysis and visualisation can help with both aspects, helping monitor repository use, but equally be used in an intelligence mode to identifying possible gaps and proactively leveraging repository resources. It was interesting to discover a lack of social sharing of Jorum records (day 24 | day 30) and perhaps more can be done in the area of frictionless sharing (see Tony Hirst’s draft WAR bid)

This project has mainly focused on Jorum and the #ukoer Twitter hashtag, due to time constraints as well as the amount of time required to compile a useful dataset. It would be useful if these datasets were more readily available, but I imagine this is less of a problem for internal repository analysis as data is easier to access.

Towards the end of the project focus shifted towards institutional repositories, some work being done to assist University of Oxford (and serendipitously Leeds Metropolitan University). If this work is to be taken forward, and this may already be a part of the OER Rapid Innovation projects, more work needs to be done with institutional repository managers to surface the tools and recipes they need to help them continue to push their work forward.

Whilst not a purposeful aim of this project it’s very fitting that all of tools and visualisation libraries used in this project are open source or freely available. This rich selection of high quality tools and libraries also means that all of the project outputs are replicable without any software costs.

An area that is however lacking is documented uses of these tools for OER analysis. Recipes for extracting OAI data for desktop analysis were, until this project, none existent, and as mentioned more work potentially needs to be done in this area to streamline the processes for compiling, cleaning and communicating repository data.

To help the sharing of OER data I would encourage institutions to adopted an open data philosophy including information on how this data can be accessed (the Ghent University Downloads/API page is an example of good practice). There is also a lack of activity data being recorded around OER usage. This is a well established issue and hopefully projects like Learning Registry/JLeRN can address this. It’s however worth remembering that these projects are very unlikely to be magic bullets and projects like CaPRéT still have an important role.

Was this project a success? I would say partial. Looking back at the selection of visualisation produced I feel there could have been more. So much time was spent creating recipes for data extraction and analysis, that it left little time for visualisation. I hope what has been achieved is of benefit to the sector and it’s reassuring that outputs from this project is already being used elsewhere.

Project Posts


It’s the last day of the OER Visualisation Project and this is my penultimate ‘official’ post. Having spent 40 days unlocking some of the data around the OER Programme there are more things I’d like to do with the data, some loose ends in terms of how-to’s I still want to document and some ideas I want to revisit. In the meantime here are some  of the outputs from my last task, looking at the #ukoer hashtag community. This follows on from day 37 when I looked at ‘the heart of #ukoer’, this time looking at some of the data pumping through the veins of UKOER. It’s worth noting that the information I’m going to present is a snapshot of OER activity, only looking at a partial archive of information tweeted using the #ukoer hashtag from April 2009 to the beginning of January 2012, but hopefully gives you an sense of what is going on.

The heart revisited

I revisited the heart after I read Tony Hirst’s What is the Potential Audience Size for a Hashtag Community?. In the original heart nodes were sized using ‘betweenness centrality’ which is a social network metric to identify nodes which are community bridges, nodes which provide a pathway to other parts of the community. When calculating betweenness centrality on a friendship network it takes no account of how much that person may have contributed. So for example someone like John Robertson (@KavuBob) was originally ranked has having the 20th highest betweenness centrality in the #ukoer hashtag community, while JISC Digital Media (@jiscdigital) is ranked 3rd. But if you look at how many tweets John has contributed (n.438) compared to JISC Digital Media (n.2) isn’t John’s potential ‘bridging’ ability higher?

Weighted Betweenness CentrailityThere may be some research in this area, and I have to admit I haven’t had the chance to look, but for now I decided to weight betweenness centrality based on the volume of the archive the user has contributed. So John goes from ranked 20th to 3rd and JISC Digital Media goes from 3rd to 55th. Here’s a graph on the winners and losers (click on the image to enlarge).

Here is the revised heart on zoom.it (and if zoom.it doesn’t work for you the heart as a .jpg

The 'heart' of #ukoer (click to enlarge)

[In the bottom left you’ll notice I’ve included a list of top community contributors (based on weighted betweenness – a small reward for those people (I was all out of #ukoer t-shirts).]

These slides also show the difference in weighted betweenness centrality (embedded below). You should ignore the change in colour palette, the node text size is depicting betweenness centrality weight [Google presentation has come on a lot recently – worth a look at if you are sick of the clutter of slideshare]:


The ‘pulse’ of #ukoer

In previous work I’ve explored visualising Twitter conversations using my TAGSExplorer.  Because of the way I reconstructed the #ukoer twitter archive (a story for another day) it’s compatible with this tool so you can see and explorer the #ukoer archive of the 8300 tweets I’ve saved here. One of the problems I’m finding with this tool is it takes a while to get the data from the Google Spreadsheet for big archives.

TAGSExplorer - ballofstuffThis problem was also encountered in Sam’s Visualising Twitter Networks: John Terry Captaincy Controversy. As TAGSExplorer internally generates a graph of the conversation, rather than scratching my head on some R Script it was easy to expose this data so that it can be imported into Gephi. So now if you add &output=true to a TAGSExplorer url you get a comma separated edge list to use with you SNA package of choice (the window may be blocked as a pop-up, so you need to enable). Here is the link for the #ukoer archive with edges for replies, mentions and retweets (which generates ‘a ball of awesome stuff’ (see insert above) but will eat your browser performance)

ukoer conversation (click to enlarge)Processing the data in Gephi you get a similar ball of awesome stuff (ukoer conversation on zoom.it | ukoer conversation .jpg). What does it all mean I hear you ask. These flat images don’t tell you a huge amount. Being able to explore what was said is very powerful  (hence coming up with TAGSExplorer). You can however see a lot of mentions (coloured blue and line width indicating volume) in the centre between a small number of people. It’s also interesting to contrast OLNet top right and 3d_space mid left. OLNet has a number of green lines radiating out indicating @replies indicating they are in conversations with individuals using the #ukoer tag. This compares to 3d_space which has red lines indicating retweets suggesting they are more engaged in broadcast.

Is there still a pulse?

UKOER Community StatsWhen looking at the ‘ball of awesome stuff’ it’s important to remember that this is a depiction of over 8,000 tweets from April 2009 to January 2012. How do we know if this tag is alive and kicking or not just burned out like a dwarf star?

The good news is there is still a pulse within #ukoer, or more accurately lots of individual pulses. The screenshot to the right is an extract from this Google Spreadsheet of #UKOER. As well as including 8,300 tweets from #ukoer it also lists the twitter accounts that have used this tag. On this sheet are sparklines indicating the number of tweets in the archive they’ve made and when. At the top of the list you can see some strong pulses from UKOER, xpert_project and KavuBob. You can also see others just beginning or ending their ukoer journey.

The good news is the #ukoer hashtag community is going strong December 2011 having the most tweets in one month and the number of unique Twitter accounts using the tag has probably by now tipped over the 1,000 mark.

#ukoer community growth

There is more for you to explore in this spreadsheet but alas I have a final post to write so you’ll have to be your own guide. Leave a comment if you find anything interesting or have any questions

[If you would like so explorer both the ‘heart’ and ‘pulse’ graphs more closely I’ve upload them to my installation of Raphaël Velt's Gexf-JS Viewer (it can  take 60 seconds to render the data). This also means the .gexf files are available for download:]


UKOER Hashtag CommunityLast week I started to play with the #ukoer hashtag archive (which has generated lots of useful coding snippets to processes the data that I still need to blog … doh!). In the meantime I thought I’d share an early output. Embedded below is a zoom.it of the #ukoer hashtag community. The sketch (HT @psychemedia) is from a partial list* of twitters (n. 865) who have used the #ukoer hashtag in the last couple of years and who they currently follow. The image represents over 24,000 friendships, the average person having almost 30 connections to other people in the community.


3D Heart SSD
3D Heart SSD
Originally uploaded by Generation X-Ray
Publishing an early draft of this image generated a couple of ‘it looks like’ comments (HT @LornaMCampbell @glittrgirl). To me it looks like a heart, hence the title of this post. The other thing that usually raises questions is how the colour grouping are defined (HT @ambrouk). The answer in this case is it’s  generated from a modularity algorithm which tries to automatically detect community structure.

As an experiment I’ve filtered the Twitter profile information used for each of these groupings and generated a wordcloud using R (The R script used is a slight modification of one I’ve submitted to the Twitter Backchannel Analysis repository Tony started – something else I need to blog about. The modification is to SELECT a column WHERE modclass=somthing).

Right all this post has done is remind me of my post backlog and I’ve got more #ukoer visualisation to do so better get on with it.

*it’s a partial list because as far as I know there isn’t a complete archive of #ukoer tweets. The data I’m working from is from an export from TwapperKeeper for March 2010-Jan 2012  topped up with some data from Topsy for April 2009-March 2010


OER Phase 1 & 2 project descriptions wordcloudI’m in the final stretch of the OER Visualisation project. Recently reviewing the project spec I’m fairly happy that I’ll be able to provide everything asked for. One of the last things I wanted to explorer was automated/semi-automated programme reporting from the PROD database. From early discussions with CETIS programme level reporting, particularly of technology and standards used by projects,  emerged as a common task. CETIS already have a number of stored SPARQL queries to help them with this, but I wondered if more could be done to optimise the process. My investigations weren't entirely successful, and at times I was thwarted by misbehaving tools, but I thought it worth sharing my discoveries to save others time and frustration.

My starting point was the statistical programming and software environment R (in my case the more GUI friendly RStudio). R is very powerful in terms of reading data, processing it and producing data analysis/visualisations. Already CETIS’s David Sherlock has used R to produce a Google Visualisation of Standards used in JISC programmes and projects over time and CETIS’s Adam Cooper has used R for Text Mining Weak Signals, so there is some in-house skills which could have built on this idea.

Two other main factors for looking at R as a solution are:

  • the modular design of the software environment makes it easy to add functionality through existing packages (as I pointed out to David there is a SPARQL package for R which means he could theoretically consume linked data directly from PROD); and
  • R has a number of ways to produce custom reports, most notably the Sweave function allows the integration of R output in LaTeX documents allowing the generation of dynamic reports  

So potentially a useful combination of features. Lets start looking at some of the details to get this to work.

Getting data in

Attempt 1 – Kasabi custom API query

Kasabi is a place where publishers can put there data for people like me to come along and try and do interesting stuff with it. One of the great things you can do with Kasabi is make custom APIs onto linked data (like this one from Wilbert Kraan) and add one of the Kasabi authored XSLT stylesheets to get the data returned in a different format, for example .csv which is easily digestible by R.

Problem: Either I’m doing something wrong or there is an issue with the data on Kasabi or an issue with Kasabi itself because I keep getting 400 errors on queries I know work like this OER Projects with location but not when converted to an API

Attempt 2 – Query the data directly in R using the SPARQL package

As I highlight to David there is a SPARQL package for R which in theory lets you construct a query in R, collect the data and put it in a data frame.

Problem: Testing the package with this query returns: 

Error in data.frame(projectID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,  : 
  arguments imply differing number of rows: 460, 436, 291, 427, 426

My assumption is the package doesn’t like empty values.

Attempt 3 – Going via a sparqlproxy

For a lot of the other PROD SPARQL work I’ve done I’ve got csv files by using the Rensselaer SPARQL proxy service. R is quite happy reading a csv via the proxy service (code included later), it just means you are relying on an external service, which isn’t really necessary as Kasabi should be able to do the job and it would be better if the stored procedures were in one place (I should also say I looked at just using an XML package to read data from Talis/Kasabi but didn’t get very far. 

Processing the data

This is where my memory gets a little hazy and I wished I took more notes of the useful sites I went to. I’m pretty sure I got started with Tony Hirst’s How Might Data Journalists Show Their Working? Sweave, I know also that I looked at Vanderbilt’s Converting Documents Produced by Sweave, Nicola Sartori’s An Sweave Tutorial, Charlie Geyer's An Sweave Demo Literate Programming in R Reproducible Research, Greg Snow’s Automating reports with Sweave and Jim Robison-Cox’s Sweave Intro (this one included instructions on installing the MikTeX latex engine for windows which with out none of this would had worked).

The general idea with R/Sweave/LaTeX is you markup a document inserting R script which can be executed to include data tables and visualisations. Here’s a very basic of an output (I think I’ve been using R for a week now so don’t laugh) example which pulls in two data sets (Project Descriptions | Project Builds On), includes some canned text and generates a wordcloud from project descriptions and a force directed graph of project relationships.

The code used do this is embedded below (also available from here):

The main features are the \something e.g. \section which is LaTeX markup for the final document and the <<>>=  and @ R script code wrappers. I’ve even less experience of LaTeX than R so I’m sure there are many things I don’t know yet/got wrong, but hopefully you can see the potential power of the solution. Things I don’t like are being locked into particular LaTeX styles (although you can create your own) and the end product being a .pdf (as Sweave/R now go hand in hand a lot of the documentation and coding examples end up in .pdf which can get very frustrating when you are trying to copy ad paste code snippets, which also makes me wonder how accessible/screen reader friendly sweave/latex pdfs are).

Looking for something that gives more flexibility in output I turned to R2HTML which includes “a driver for Sweave allows to parse HTML flat files containing R code and to automatically write the corresponding outputs (tables and graphs)” . Using a similar style of markup (example of script here) we can generate a similar report in html. The R2HTML package generates all the graph images and html so in this example it was a case of uploading the files to a webserver. Because it’s html the result can easily be styled with a CSS sheet or opened in a word processor for some layout tweaking (here’s the result of a 60 second tweak.       

Is it worth it?

After a day toiling with SPARQL queries and LaTeX markup I’d say no, but its all very dependant on how often you need to produce the reports, the type of analysis you need to do and your anticipated audience. For example, if you are just going to present wordclouds and histograms R is probably overkill and it might be better to just use some standard web data visualisation libraries like mootools or Google Visualisation API to create live dashboards. Certainly the possibilities of R/Sweave are worth knowing about.

In May 2009 JISC CETIS announced the winners of the OER Technical Mini-Projects. These projects were designed:

to explore specific technical issues that have been identified by the community during CETIS events such as #cetisrow and #cetiswmd and which have arisen from the JISC / HEA OER Programmes

JISC CETIS OER Technical Mini Projects Call
Source :
Author: Phil Barker, JISC CETIS

One of the successfully funded projects was CaPRéT - Cut and PAste reuse and Tracking from Brandon Muramatsu, MIT OEIT and Justin Ball and Joel Duffin, Tatemae. I’ve already touched upon OER tracking in day 24 and day 30 briefly looking at social shares of OER repository records. Whilst projects like the Learning Registry have the potential to help it still early days and tracking still seems to be an afterthought, which has been picked up in various technical briefings. CaPReT tries to address part of this problem, as stated in introduction to their final report:

Teachers and students cut and paste text from OER sites all the time—usually that's where the story ends. The OER site doesn't know what text was cut, nor how it might be used. Enter CaPRéT: Cut and Paste Reuse Tracking. OER sites that are CaPRéT-enabled can now better understand how their content is being used.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

The code and other resources can be found on their site. You can also read Phil Barker’s (JISC CETIS) experience testing CaPReT and feedback and comments about the project on the OER-DISCUSS list.

One of the great things about CaPReT is the activity data is available for anyone to download (or as summaries Who's using CaPRéT right now? | CaPRéT use in the last hour, day and week | CaPRéT use by day).

One of the challenges set to me by Phil Barker was to see what I could do with the CaPReT data. Here’s what I’ve come up with. First a map of CaPReT (great circles) usage plotting source website and where in the world some text was copied from (click on image for full scale):

capret - source target map

An an interactive timeline which renders where people copied text and pop-ups with a summary of what they copied

capret timemap

Both these examples rely on the same refined data source rendered in different ways and in this post I’ll tell you how it was done. As always it would be useful to get you feedback as to whether these visualisations are useful, things you’d improve or other ways you might use the recipes. 

How was it made – getting geo data

  1. Copied the CaPReT tabular data into Excel (.csv download didn’t work well for me, columns got mixed up on unescaped commas), and saved as .xls
  2. .xls imported in Google Refine. The main operation was to convert text source and copier IP/domains to geo data using www.ipinfodb.com. An importable set of routines can be downloaded from this gist [Couple of things to say about this -  CaPReT have a realtime map, but I couldn’t see any locations come through – if they are converting IP to geo it would be useful if this was recorded in the tabular results. IP/domain location lookups can also be a bit misleading, for example, my site is hosted in Canada, I’m typing in France and soon I’ll be back to Scotland
  3. the results were then exported to Excel and duplicates removed based on ‘text copied on’ dates and then uploaded to a Google Spreadsheet   

Making CaPReT ‘great circles’ map




























This was rendered in RStudio using Nathan Yau’s (Flowing Data) How to map connections with great circles. The R code I used is here. The main differences are reading the data from a Google Spreadsheet and handling the data slightly differently (Who would have thought as.data.frame(table(dataset$domain)) would turn the source spreadsheet into (#stilldiscoveringR) –>

I should also say that there was some post production work done on the map. For some reason some of the ‘great circles’ weren’t so great and wrapped around the map a couple of times. Fortunately these anomalies can easily be removed using Inkscape (and while I was there added a drop shadow.


capret before post production

Making CaPReT ‘timemap’

Whilst having a look around SIMILE based timelines for day 30 I came across the timemap.js project which:

is a Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously. By default, only items in the visible range of the timeline are displayed on the map.

Using the Basic Example, Google v3 (because it allows custom styling of Google Maps) and Google Spreadsheet Example I was able to format the refined data already uploaded to a Google Spreadsheet and then plug it in as a data source for the visualisation (I did have a problem with reading data from a sheet other than the first one, which I’ve logged as an issue including a possible fix).

A couple of extras I wanted to do with this example is also show and allow the user to filter based on source. There’s also an issue of scalability. Right now the map is rendering 113 entries, if CaPReT were to take off the spreadsheet would suddenly fill up and the visualisation will probably grind to a halt. 

[I might be revisiting timemap.js as they have another example for a temporal heatmap, which might be could for showing Jorum UKOER deposits by institution.] 

So there you go, two recipes for converting IP data into something else. I can already see myself using both methods in other aspects of the OER Visualisation and other project. And all of this was possible because CaPReT had some open data.

I should also say JISC CETIS have a wiki on Tracking OERs: Technical Approaches to Usage Monitoring for UKOER and Tony also recently posted Licensing and Tracking Online Content – News and OERs.


Following on Day 20’s Maps, Maps, Maps, Maps the last couple of days I’ve been playing around with timelines. This was instigated by a CETIS PROD ‘sprint day’ on Friday where Sheila MacNeill, Wilbert Kraan, David Kernohan and I put our thinking caps on, cracked the knuckles and look at what we could do with the CETIS PROD data.

Creating a timeline of JISC projects from PROD is not new, Wilbert already posting a recipe for using a Google Gadgetized version of MIT’s SIMILE timeline widget to create a timeline of  JISC e-Learning projects. I wanted to do something different, trying to extract project events and also have more timeline functionality than offered by the SIMILE gadget. My decision to go down this particular route was also inspired by seeing Derek Bruff’s Timeline CV which renders Google Spreadsheet data in a full feature version of the SIMILE timeline widget (an idea that Derek had got from Brian Croxall).   

PROD Project Directory pageHaving already dabbled with the PROD data I knew CETIS staff have annotated JISC projects with almost 3,000 individual comments. These can be categorised as general and related projects comments and comments associated with technology and standards choices (you can see this rendered in a typical project page and highlighted in the graphic).

Timeline 1 - All project comments

Discovery number one was that all the comments don’t have timestamp  information in the linked data (it turns out only general comments have these). Ignoring these for now and seeing what happens if we create a SPARQL query, import the data into a Google Spreadsheet, format and then wrap in a HTML page we get this JISC CETIS PROD Comment Timeline:

JISC CETIS PROD Comment Timeline

The good – search and filtering strands; information popups render well; user can resize window; easy export options (activated when mouseover timeline via orange scissors)
The bad – too many comments to render in the timeline; the key under the timeline can’t render all the strands

Timeline 2: Technology timeline with comments

One of the suggestions at ‘show and tell’ was to focus on a particular area like technology to see if there were any trends in uptake and use. As mentioned earlier there are currently no timestamps associated with technology comments and it was suggested that project start and end dates could be used as an indication. So using the same recipe of SPARQL query, formatting data in a Google Spreadsheet to a HTML page we get the CETIS PROD Tech Timeline.  

CETIS PROD Tech Timeline

Again the default view presents information overload which is slightly alleviated by filtering. I still don’t get any sense of wave of technologies coming and going, partly because the project start/end dates and maybe it very rare for a technology to die.    

Timeline 3 – Programme level

Trying to take a more focused view it was suggested I look at a programme level timeline of general comments (being general comments means they are individually timestamped). Using the recipe one more time of SPARQL query, formatting data in a Google Spreadsheet to a HTML page we get the CETIS PROD OER Phase 1 & 2 timeline.

CETIS PROD OER Phase 1 & 2 timeline

Still there is a problem navigating the data because of clustering of comments (shown by the string of blue dots in the bottom timebar). So what’s going on here? Looking at the data it’s possible to see that despite the fact that general comments could have been made at any point in the two years of the programme 912 comments were made on only 73 different days (which partly makes sense – ‘going to sit down and do some admin, I’ll review and comment on project progress’).    

So timelines are maybe not best for this type of data. At least there’s a recipe/template used here which might help people uncover useful information. There is an evolution of this combining timelines and maps that I’m working on so stay tuned.


A quick postscript to day 24 of the OER Visualisation project where I looked at how individual Jorum UKOER resources were being, or as was the case, not being shared on social networking sites like Twitter, Facebook et al. I did a similar analysis on HumBox records and using my method it was a similar story, almost undetectable social sharing of individual resources.

To try an see if this was because of bad data on my behalf I posed the question to the #ukoer twitter community and the OER-DISCUSS mailing list.  On the OER-DISCUSS list Peter Robinson highlighted that one mention of one of University of Oxford’s resources on StumbleUpon resulted in a 20,000 views spike. On Twitter Catherine Grant (@filmstudiesff) responded:

Likes: 13
Shares: 24
Comments: 11
Total: 48

Tweets: 81

Google +1
+1s: 0

Diggs: 0

Shares: 2

Google Buzz
Buzzes: 0

Bookmarks: 8

Stumbles: 1

Putting Catherine’s first link into sharedcount.com gives the following –> 

So one page of curated resources with almost 50 Facebook reactions, over 80 tweets can have as many social shares an entire repository.

This issue is a well known one within the OER community and with almost eerie timing the following day after the ‘day 24’ post Pat Lockley posted The OERscars – and the winner is  in which he looks at some of the activity stream around ‘Dynamic Collections’ created as part of the OER Phase 2 Triton Project. From Pat’s post:

Dynamic Collections function as a WordPress plug in, bring in RSS Feeds from OER sites and blogs, and then search these feeds for particular words before moving these items into collections. These collections can be created as simply as a WordPress post, and so gives almost everyone the scope to start building OER collections straight away. Once a collection has some content, it can be displayed to visitors to the site (normally as a “wider reading” style link at the end of a post on a particular topic) and we made sure to track how these resources are used.  As well as showing as a WordPress page, the collections can also be seen as an RSS Feed (add ?rss_feed_collection=true to the end of a page), An Activity Stream – which will be handy for the Learning Registry (?activity_stream=true), or embedded into another page (?dc_embed=true) via some javascript.

I’m not entirely sure what my point is but thought worth sharing the information and links.


This might be slightly off-topic for the OER Visualisation project, but I followed an idea, did - what I think are – some interesting things with an archive of tweets and thought I would share. This line of thought was triggered by a tweet from Terry McAndrew in which he asked:

@mhawksey Have you a visualisation planned of JORUM 'customers' of OER (and the rest of it for that matter).

Tracking who views or uses OER material can be tricky but not impossible the main issue comes when someone else like me comes along and wants to see who else has viewed, used, remixed the resource. For example, with the Jorum ukoer resources the only usage data I could get was each resource page view and even getting this required scraping over 8,000 pages. This is mentioned in day 18, but I realise I haven’t gone into any detail about how Google Refine was used to get this data – if someone wants me to I will reveal all.

A recent development in this area is the US Learning Registry project which is looking to maximise the use of activity data to support resource discovery and reuse. On Lorna Campbell’s CETIS blog there is a very useful introduction to the Learning Registry announcing JISCs involvement in a UK node. The post includes the following use case which helps illustrate what the project is about:

“Let’s assume you found several animations on orbital mechanics. Can you tell which of these are right for your students (without having to preview each)? Is there any information about who else has used them and how effective they were? How can you provide your feedback about the resources you used, both to other teachers and to the organizations that published or curated them? Is there any way to aggregate this feedback to improve discoverability?

The Learning Registry is defining and building an infrastructure to help answer these questions. It provides a means for anyone to ‘publish’ information about learning resources. Beyond metadata and descriptions, this information includes usage data, feedback, rankings, likes, etc.; we call this ‘paradata’”

Lorna’s post was made in November 2011 and since then the Mimas have started The JLeRN Experiment (and if you are a developer you might want to attend the CETIS contributors event on the 23rd Jan).

Plan A – Inside-out: Repository resource sharing data

All of this is a bit late for me but I thought I’d see what ‘paradata’ I could build around ukoer and see if there were any interesting stories to be told as part of the visualisation project. This is an area I’ve visited before with a series of ‘And the most socially engaging is …’ posts which started with And the most engaging JISC Project is… For this I used Google Spreadsheet to get social share counts (Facebook, Twitter and more) for a list of urls. One of the issues with this technique is Google Spreadsheets timeout after 5 minutes so there is a limit to the number of links you can get through -this is however not a problem for Google Refine.

Taking 380 of the most viewed Jorum ukoer tagged resources (approximately 5% of the data) I used Google Refine to

  1. ‘Add column by fetching URL’ passing the resource link url into Yahel Carmon’s Shared Count API -  using the expression "http://api.sharedcount.com/?url="+escape(value,"url") 
  2. Add column based on the results column parsing each of the counts – e.g. using expressions similar to parseJson(value)['Twitter']

At this point I stopped parsing columns because it was clear that there was very little if any social sharing of Jorum ukoer resources (here is a spreadsheet of the data collected).  In fact the most tweeted resource which Twitter records as having 7 tweets gets most of these following my tweet as it being the most viewed resource.

So what might be going on here. Is just a issue for national repositories? Are people consuming ukoer resources from other repositories? Are Jorum resources not ranking well in search engines? Are resources not being marketed enough?

I don’t have answers to any of those questions, and maybe this is just an isolated case, but the ‘marketing’ aspect interests me. When I publish a blog post I’ll push it into a number of communication streams including RSS, a couple of tweets, the occasional Google+. For posts I think are really worthwhile I’ll setup a twitter search column on Tweetdeck with related keywords and proactively push posts to people I think might be interested (I picked up this idea from Tony Hirst’s Invisible Frictionless Tech Support. Are we doing something similar with our repositories?  

Plan B – Outside-in: Community resource sharing data

There’s probably a lot more to be said about repositories or agencies promoting resources. To try and find some more answers, instead of looking from the repository perspective of what is being shared, I thought it’d be useful to look at what people are sharing. There are a number of ways you could do this like selecting and monitoring a group of staff. I don’t have time for that, so decided to use data from an existing community  who are likely to be using or sharing resources aka the #ukoer Twitter hashtag community. [There are obvious issues with this approach, but I think it’s a useful starter for ten]

Having grabbed a copy of the #ukoer Twapper Keeper archive using Google Refine before it disappeared I’ve got over 8,000 tweets from 8th March 2010 to 3rd January 2012.  My plan was to extract all the links mentioned in these tweets, identify any patterns or particularly popular tweets.

Extracting links and expand shortened urls

As most tweets now get links replaced with t.co shortened urls and the general use of url shortening the first step was to extract and expand all the links mentioned in tweets. Link expansion was achieved using the longurl.org API, which has the added bonus of returning meta description and keywords for target pages. Here’s a summary of the Google Refine actions I did (taken from the undo/redo history):

  1. Create new column links based on column item - title by filling 8197 rows with grel:filter(split(value, " "),v,startsWith(v,"http")).join("||")
  2. Split multi-valued cells in column links
  3. Create column long_url at index 4 by fetching URLs based on column links using expression grel:"http://api.longurl.org/v2/expand?url="+escape(value,"url")+"&format=json&user-agent=Google%20Refine&title=1&meta-keywords=1"
  4. Create new column url based on column long_url by filling 5463 rows with grel:parseJson(value)['long-url']

Searchable table of #ukoer linksWith long urls extracted the data was exported and uploaded to Google Spreadsheet so that a list of unique urls and their frequencies could be calculated. Here is the Spreadsheet of data.  From the refined data there are 2,482 different links which appear 6,220 times in the #ukoer archive. Here is the searchable table of extracted links with frequencies (sorry for the small font – can’t seem to find a way to control column width using Google Visualisation API … anyone?).

Not surprisingly a number of domain root urls appear at the top of the list. More work needs to be done to match resource sharing to different OER sources, but you can start doing simple filtering to find out what’s there. Something for the initial analysis I find interesting is that the top common link in the archive is to Oxfords Open Advent Calendar, which was a daily posting highlighting some of their OER resources. This could be interpreted as underlying the need for OER resources to be more effectively marketed. I’ll let you decide.

PS out of interest I put the list of 2,500 odd links back into Google Refine and extracted social share counts. I haven’t had a chance to look at the data closely but if you want to play a copy of it and a meta-data enhanced version of the #ukoer archive is here. Please share any findings ;)


Dear Diary, it is now day 20 of the OER Visualisation Project … One of the suggested outputs of this project was “collections mapped by geographical location of the host institution” and over the last couple of days I’ve experimented with different map outputs using different techniques. Its not the first time I’ve looked at maps and as early as day 2 used SPARQL to generate a project locations map. At the time I got some feedback questioning the usefulness of this type of data, but was still interested in pursuing the idea as a way to provide an interface for users to navigate some of the OER Phase 1 & 2 information. This obsession shaped the way I approached refining the data, trying to present project/institution/location relationships, which in retrospect was a red herring. Fortunately the refined data I produced has helped generate a map which might be interesting (thought there would be more from London), but I thought it would also be useful to document some of what I’m sure will end up on the cutting room floor.

Filling in the holes

One of the things the day 2 experiment showed was it was difficult to use existing data sources (PROD and location data from the JISC Monitoring Unit) to resolve all host institution names. The main issue was HEA Subject Centres and partnered professional organisations. I’m sure there are other linked data sources I could have tapped into (maybe inst. > postcode > geo), but opted for the quick and dirty route by:

  1. Creating a sheet in the PROD Linked Spreadsheet of all projects and partners currently filtered for Phase 1 and 2 projects. I did try to also pull location data using this query but it was missing data so instead created a separate location lookup sheet using the queries here. As this produced 130 institutions without geo-data (Column M) I cheated and created a list of unmated OER institutions (Column Q) [File > Make a copy of the spreadsheet to see the formula used which includes SQL type QUERY].
  2. Resolving geo data for the 57 unresolved Phase 1 & 2 projects was a 3 stage process:
    1. Use the Google Maps hack recently rediscovered by Tony Hirst to get co-ordinates from a search. You can see the remnants of this here in cell U56 (Google Spreadsheets only allow 50 importDatas per spreadsheet so it is necessary to Copy > Paste Special > As values only).
    2. For unmatched locations ‘Google’ Subject Centres to find their host institution and insert the name in the appropriate row in Column W – existing project locations are then used to get coordinates.  
    3. For other institutions ‘google’ them in Google Maps (if that didn’t return anything conclusive then a web search for a postcode was used). To get the co-ordinate pasted in Column W I centred their location on Google Maps then used the modified bookmarklet javascript:void(prompt('',gApplication.getMap().getCenter().toString().replace(',','').replace(')','').replace('(',''))); to get the data.
  3. The co-ordinates in Column S and T are generated using a conditional lookup of existing project leads (Column B) OR IF NOT partners (Column F) OR IF NOT entered/searched co-ordinates.

Satisfied that I had enough lookup data I created a sheet of OER Phase 1 & 2 project leads/partners (filtering Project_and_Partners and pasting the values in a new sheet). Locations are then resolved by looking up data from the InstLocationLookup sheet.

Map 1 – Using NodeXL to plot projects and partners

Exporting the OER Edge List as a csv allows it to be imported to the Social Network Analysis add-on for Excel (NodeXL). Using the geo-coordinates as layout coordinates gives:

OERPhase1&2 Edge 

The international partners mess with the scale. Here’s the data displayed in my online NodeXL viewer. I’m not sure much can be taken from this.

Map 2 – Generating a KML file using a Google Spreadsheet template for Google Maps

KML is an XML based format for geodata originally designed for Google Earth, but now used in Google Maps and other tools. Without a templating tool like Yahoo Pipes which was used in day 2, generating KML can be very laborious. Fortunately the clever folks at Google have come up with a Google Spreadsheet template – Spreadsheet Mapper 2.0. The great thing about this template is you can download the generate KML file or host it in the cloud as part of the Google Spreadsheet.

The instructions for using the spreadsheet are very clear so I won’t go into details, you might however want to make a copy of the KML Spreadsheet for OER Phase 1 & 2 to see how data is being pulled from the PROD Spreadsheet. The results can be viewed in Google Maps (shown below), or viewed in Google Earth.

OER Phase 1 & 2 in Google Maps

Map 3 – Customising the rendering KML data using Google Maps API

Map 3 ExampleWhilst digging around the Google Maps API for inspiration I came across this KML with features example (in the KML and GeoRSS Layers Section. Out of interest I thought I’d use the KML link from Map 2 as the source which gives this OER Phase 1 & 2 map. [If you are wondering about the map styling I recently came across the Google Maps API Styled Map Wizard which lets you customise the appearance of Google Maps, creating a snippet of code you can use in Google Maps API Styling.

Map 4 – Rendering co-ordinate data from a Google Spreadsheet

Another example I game across was using a Google Spreadsheet source which gives this other version of OER Phase 1 & 2 Lead Institutions rendered from this sheet.

I haven’t even begun on Google Map Gadgets, so it looks like there are 101 ways to display geo data from a Google Spreadsheet. Although all of this data bashing was rewarding I didn’t feel I was getting any closer to something useful. At this point in a moment of clarity I realised I was chasing the wrong idea, that I’d made that schoolboy error of not reading the question properly.

Map 5 – Jorum UKOER records rendered as a heatmap in Google Fusion Tables

Having already extracted ‘ukoer’ records from Jorum and reconciling them against institution names in day 11 it didn’t take much to geo-encode the 9,000 records to resolve them to an institutional location (I basically imported a location lookup from the PROD Spreadsheet, did a VLOOKUP, then copy/pasted the values. The result is in this sheet)

For a quick plot of the data I thought I’d upload to Google Fusion Tables and render as a heatmap but all I got was a tiny green dot over Stoke-on-Trent for the 4000 records from Staffordshire University. Far from satisfying.

Map 6 - Jorum UKOER records rendered in Google Maps API with Marker Clusters

imageThe final roll of the dice … for now anyway. MarkerClusterer is an open source library for Google Maps API which groups large numbers of closely located markers for speed and usability gains. I’d never used this library before but the Speed Test Example looked easy to modify. This has resulted in the example linked at the very beginning of this post mapping Jorum ukoer records

This is still a prototype version and lots of tweaking/optimisation required and the data file, which is a csv to json dump has a lot of extra information that’s not required (hence the slow load speed), but is probably the beginnings of the best solution for visualising this aspect of the OER programme.

So there you go. Two sets of data, 6 ways to turn it into a map and hopefully some hopefully useful methods for mashing data in between.

I don’t have a ending to this post, so this is it.