Google Refine

1 Comment

I started writing this last week so the intro doesn’t make sense. Slides from the presentation I did are here (all pictures so probably also makes very little).

To paraphrase Stephen Heppell (I’ve misquoted this before):

Content is king, but community is sovereign

The UKOER Programme is having it’s final meeting next week and while the final set of projects come to a close a strong community has formed and I’m sure will continue. Something I was interested in doing is looking at how the community has evolved over time. I’ve previously looked at data around the #ukoer hashtag, but not everyone uses Twitter so I thought I look for another data source.  As email is still a strong component in most peoples everyday lives I started poking around OER-DISCUSS JISCMail (Listserv) list:

A public list for discussion about the release, use, remix and discovery of Open Educational Resources (OER). Managed jointly by OU SCORE Project, and the JISC / HE Academy OER Programme.

As far as I could see there are limited options for getting data out of JISCMail (some limited RSS/Atom feeds) so cue the music for a good out fashioned scrape and refine. Whilst I’ll walk you through this for OER-DISCUSS the same recipe can be used for other public lists.

Source survey

Instead of going straight into the recipe I wanted to record some of the initial thought processes in tackling the problem. This usually begins with looking at what you’ve got to work with. Starting with the list homepage I can see some RSS/Atom feeds, that don’t take me far, instead I turn my attention to the list of links for each months archives. Clicking through to one of these and poking around the HTML source (I mainly use Chrome so a right click in the page gives the option to Inspect Element) I can see that the page uses a table template structure to render the results – good. Next I checked if the page would render even when I was logged out of JISCMail, which I can – good. So far so good.

page source

Next a step back. This looks scrapable so has anyone done this before. A look on Scraperwiki turns up nothing on Listserv or JISCMail, so next a general Google search. Looking for terms like ‘listserv data scrape’ are problematic because there are lots of listserv lists about data scraping in general. So we push on. We’ve got a page with links to each months archives and we know each archive uses a table to layout results. Next it’s time to start thinking about how we get data out of the tables. Back in the Chrome Element Inspector we can see that the source contains a lot of additional markup for each table cell and in places cells contain tables within them. At this point I’m think OpenRefine (nee Google Refine).

Scraping list of archive links

A feature of OpenRefine I use a lot is fetching data from a url. To do this we need a list of urls to hit. Back on the list homepage I start looking at how to get the links for each month’s archive. Hover over the links I can see they use a standard sequence with a 2-digit year {yy} and month {mm}


I could easily generate these in a spreadsheet but I’m lazy so just point a Chrome extension I use called Scraper to find the part of the page I want and import to a Google Spreadsheet.  


[another way of doing this is creating a Google Spreadsheet and in this case entering the formula =ImportXml("","//tr[2]/td/ul/li/a/@href")

Fetching and Refining the Data

Import projectFinally we can fire up OpenRefine. You could create a project by using the Google Data option, which is used to import data from your Google Docs, instead as it’s not a huge amount of data I use the Clipboard option. At this point the preview will probably separate the data using ‘/’ and use the first row as a column heading so you’ll want to switch to comma or and de-select ‘Parse next’.

  1. Next we want to fetch each month’s archive page by using the Column 1 dropdown to Edit column > Add column by fetching url using the GREL expression ""+value using the column name month_raw
  2. This pulls in each month’s archive page in raw html. Now we want to parse out each row of data in a new column by selecting the dropdown from month_raw and selecting Edit column > Add column based on this column  using the GREL expression forEach(value.parseHtml().select("table.tableframe")[1].select("tr"),v,v).join(";;;") with the column name rows_raw – this selects the second table with class ‘tableframe’ and joins each row with a ‘;;;’
  3. Next from the rows_raw column use Edit cells > Split multi-valued cells using ;;; as the separator
  4. Again from the rows_raw column dropdown select Edit column > Add column based on this column using the GREL expression forEach(value.parseHtml().select("td"),v,v).join(";;;") with the column name rows_parsed – this joins each <td> with a ;;; which will let us spilt the values into new columns in the next step
  5. Now from the rows_parsed column select Edit column > Split into several columns using the separator ;;;

Column split

You should now have something similar to above with columns and rows split out, but still messy with html in the cells. We can cleat these up using Edit cells > Transform using variations of value.parseHtml().htmlText()

Here are the steps I used (the complete operation history you can use in Undo/Redo is here – using this apply all the actions starting with the list of monthly urls)

  1. Text transform on cells in column rows_parsed 4 using expression grel:value.parseHtml().htmlText().replace(" lines","").toNumber()
  2. Rename column rows_parsed 4 to lines
  3. Text transform on cells in column rows_parsed 3 using expression grel:value.parseHtml().htmlText().toDate("EEE, dd MMM y H:m:s")
  4. Rename column rows_parsed 3 to date
  5. Text transform on cells in column rows_parsed 2 using expression grel:value.parseHtml().htmlText().replace(" <[log in to unmask]>","")
  6. Rename column rows_parsed 2 to from
  7. Create column snippet at index 4 based on column rows_parsed 1 using expression grel:value.split("showDesc('")[1].split("','")[0].unescape("html").parseHtml().htmlText()
  8. Create column link at index 4 based on column rows_parsed 1 using expression grel:""+value.parseHtml().select("a")[0].htmlAttr("href")
  9. Text transform on cells in column rows_parsed 1 using expression grel:value.parseHtml().htmlText()
  10. Rename column rows_parsed 1 to subject
  11. Create column subject_normal at index 4 based on column subject using expression grel:trim(value.replace(/^Re:|^Fwd:/i,""))

You’ll probably notice some of the rows don’t contain the data we need. An easy way to remove these is use a timeline facet on the date column selecting non-time, blank and error and then from the All column dropdown menu select Edit rows > Remove all matching rows.

Tony Hirst has a great post on faceting tricks. Something not covered is clustering data using facets. We use this as a way to join rows where authors have multiple logins eg Tony Hirst and Tony.Hirst

To do this add a text facet to the author/from column and click Cluster:

Facet view

Finding the right settings is a bit of trial and error combined with a bit of knowledge about your dataset. In this example I saw that there were rows for Pat Lockley and Patrick Lockley so tried some settings until I got a hit (in this case using nearest neighbour and PPM – which by all accounts is the last resort). You might also need to run clustering a couple of times to catch most of the variations

Clustering columns

What can we do with this data?

In Dashboarding activity on public JISCMail lists using Google Sheets (Spreadsheets) I was able to get an indication of the overall activity of the list. Now that author names are clustered I can get a more accurate picture of the top contributors using a text facet on the author column:

Pat Lockley 269, Amber Thomas 122, Phil Barker 82 

I was tempted to mine the individual posts further using the techniques and code posted by Adam Cooper (CETIS), but it didn’t look like I could easily swap the data source. A couple of questions posed by Phil Barker (CETIS) were:

The easiest way I found to get this was to use the Export > Custom tabular exporter (using this settings file), import into Google Sheets (Spreadsheets) and use a couple of formula to get this summary page (opening the summary page will let you see he formula I’ve used):

And there is much more you could do, but I’ll leave it there. If nothing else I hope you get an idea of some of the thought processes involved in extracting data. As always if something is unclear get in touch.


In a recent post I showed that how given some urls it was easy to generated a templated report as a word document. This was partly done to show how blog posts from a JISC funded programme could be used to generate reports. One of the issues with this solution is not all the projects used WordPress as their blogging platform making it harder to get at some of the data. One thought I had was rather than dictating a particular platform JISC could aggregate all programme posts in a central database. This isn’t entirely new as at CETIS we already aggregate links to posts for selected programmes via our PROD database, the difference here would be as well as a link the post content would also be ingested making it easy to query and analysis the data from a single source. As I recently wrote about in Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine this technique is common in cMOOCs, and in particular highlighted how Digital Storytelling (DS106) do this already using the FeedWordPress plugin.

As part of a double hit I thought I’d see if there was a case for JISC implementing this type of infrastructure, and because I’ve got an interest in applying learning anlaytics to open online courses, have a look at the data stored by ds106 and see what analysis I could do. In the next series of posts I share my notes and sketches around extracting FeedWordPress data, supplement it with 3rd party sources and analysing dialog/activity around posts. In this first post I’m going to outline how all the data was imported into a local WordPress database, queried to pull out a data slice, refined and mined, before visualisation/analysis. What we end up with is a view of how ds106 blog posts are interlinked by hyperlinks within the post body. For the next in the series I record some notes on comment extraction and finally how to get more activity data from social network sites.

The data source

If JISC were hosting the data then in theory it would be  a lot easier to get to. As my test data comes from ds106 some extra preparatory work was required. Fortunately #ds106er Alan Levine (@cogdog) was able to provide me with a MySQL dump of data from their WordPress install. The dump contained over 700,000 lines of SQL commands and was over 500Mb. Problem one was then how to get this back into a local MySQL database to make slice, dice and export easier.

For a local Apache/MySQL/PHP server I used  XAMPP Lite. Importing the data wasn’t straight forward as despite tweaking the config kept having ‘allocated memory’ errors. The solution was ‘BigDump: Staggered MySQL Dump Importer’, which was able to automatically split and run the import. Even with this I got a couple of ‘MySQL has gone away’ errors so used Vim (a text editor that can handle very large files) to drop 5 blog post inserts (given ds106 had over 20,000 posts I can live with the lose).

The selected data export

Wordpress Database DiagramThose familiar with the backend of WordPress will know it has a relational structure with a number of tables holding different parts of the data. Because I knew I wanted to explore ‘ds106 dialog’ and that to do this the data would have to be compiled with other sources I needed to export a slice of the data. Whilst I was does this it also made sense to make some of the data more meaningful. For example, the wp_posts table which has most of the data I needed uses an author id number where as it would be better to get this as a display name by joining it with the wp_users data. Fortunately because MySQL is … well a ‘query language’ this is relatively easy to do by using the command below. To talk you through it I’m selecting a couple of columns from the wp_posts table and joining it with some other data on the condition that it’s a published post. As well as author display name you’ll see that I’m also returning a wfw:commentRSS. This is the comment RSS feed caught by the FeedWordPress plugin (I think it comes from here). I got wind of this metadata after reading Jim Groom’s Displaying Distributed Comments on the Hardboiled Blog. I’ll show how this is used later.

       wp_postmeta.meta_value AS comment_feed, 
FROM   wp_posts 
       LEFT JOIN wp_postmeta 
              ON = wp_postmeta.post_id 
                 AND wp_postmeta.meta_key = 'wfw:commentRSS' 
       LEFT JOIN wp_users 
              ON wp_posts.post_author = 
WHERE  wp_posts.post_type = 'post' 
       AND wp_posts.post_status = 'publish'

Running this query in phpMyAdmin gives us an option to exporting as a csv giving almost 20,000 blog posts (one per row) to play with which is plenty.

The refinement

apply-operationsUsing Google Refine we can import the csv file choosing comma separated and headers in first row. Next we want to process the post content to extract outbound links. Below a the rough steps I used which you can download and apply to your own project (the steps in the file vary slightly as I noticed that the guid links had ?p={number} instead of the post permalink. I ended getting these when fetch the comment feed and extracting the link. More detail about this in the next post. Note to self: issue with non-permalinks).

  1. Create column post_as_date at index 3 based on column post_date using expression grel:value.toDate("Y-m-d H:m:s")
  2. Create column outbound_links at index 5 based on column post_content using expression grel:forEach(value.parseHtml().select("a[href~=]"),e,e.htmlAttr("href")).join("|||")
  3. Split multi-valued cells in column outbound_links
  4. Create column source_links at index 9 based on column guid using expression grel:row.record.cells["guid"].value[0]

The key step is 3. which mines all the posts for <a href> tags and extracts them into a new column joining them together with ‘|||’


Using Refine we can split the outbound_links column to put one link per row (column dropdown, Edit cells > Split multi-valued cells). The last step is to make sure the post source link is included in each row using fill down the right and secure way. Filtering out blank outbound_link rows and Export using Custom tabular exporter (setting I used here) I get some data that looks like this:

The data meets Gephi and NodeXL

With this data we can open in Excel and then import into NodeXL using Import > From Open Workbook, more about that later, or alternatively using Gephi change the column headings in the .csv version to source and target and import via the ‘data laboratory’ tab (for graph manipulation in Gephi Tony Hirst has a great tutorial).

Sticking with Gephi for now you can generate the images below. On the left is the entire graph which depicts the ~59,000 links contained in ~20,000 ds106 blog posts. The image on the right is zoomed in to the inset where we can start seeing each webpage referenced as a dot connected by a line which is a link in a post. Whilst this is pretty, it’s also pretty useless.

All links from ds106 links Close-up of post links

So where to go with this? In the paper A Theoretical Proposal of Learning Communities Management Through the Analysis of the Organizational Blogosphere the authors propose using the measure of ‘betweenness centrality’ (BC) as a way of identifying influential learners within a blogging collective:

The betweeness centrality has been considered in literature (Marsden, 2002) as a way to find the most valuable nodes within a social network. The strategic function of these characters is fundamental for the system of learners, because they have the widest view of what happens in the net. This is due to the fact that they intercept the majority of the knowledge and information flows within the learning community. The people with the highest betweeness centrality can share their vision on the main knowledge claims of the community in an open debate.

Extrapolating this slightly lets consider individual blog posts rather than individual learners, using BC as a way to highlight posts that may bridge concepts, or support the community of learners in some way. Rather than examine all links in the blog posts lets first consider internal linking within the ds106 community.

To do this lets turn back to NodeXL which I find a lot easier to get data in and manipulate. Opening our export file from Google Refine and then using Import > From Open Workbook we can import all the data source and target being edge properties, the rest vertex 1 properties. This gives you over 75,000 connections between over 59,000 links, back at square one. A nice feature of NodeXL is to tell it to ‘skip’ some data (edge, vertex or group). The data is kept in the spreadsheet so you can include it later but it’s not used for any calculations. I only want to analysis ds106 posts that link to other ds106 posts. Here’s brief notes on the way I did it (other ways are possible and might even be better).

Autofill options

On the ‘Edges’ sheet in the ‘Other columns’ section I created 4 columns named: Community Link, Self Link, Combined and

Internal Link, then added the following formula in each column:

  • =IF(ISERROR(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30 ,FALSE)),0,IF(LEN(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30, FALSE))>12,1 ,0)) – because we only have dates, titles, names for posts made by the ds106 community, if this detail is blank on the vertices sheet then the edge is outwith the community (if it exists enter 1, otherwise 0) Tip: after doing this I copied all the Community Link values and Paste Special, Values to save recalculation on edit/open.
  • =IF([@[Vertex 1]]=[@[Vertex 2]],0,1) – test if edge is self-link
  • =[@[Community Link]]*[@[Self Link]] – multiplying these together as a simple operator
  • =IF(ISERROR(OR(FIND("/",[@[Vertex 1]],12), FIND("/",[@[Vertex 2]],12))), 0, IF(LEFT([@[Vertex 1]], FIND("/",[@[Vertex 1]], 12))=LEFT([@[Vertex 2]], FIND("/",[@[Vertex 2]],12)), 1,0)) – used to test if edge is two posts by the same author


Using the Autofill we can chose to skip edges based on the ‘Combined’ value (and while we are at it set the edge style based on the ‘Self Link’ value). Once this is done you can use NodeXL to calculate Groups and Metrics knowing that it’ll skip the edges it doesn’t need.

Once we’ve done all of this and applied some chart options this is the graph we get.

ds106 interlinked blog posts

Taking a closer look at one of the groupings (G1) shown below we can see how ds106 blog posts (the dots) link to each other, a dotted line indicating it’s the same post author referring to their own work.

Group G1

The first impression

We now have a dataset of ds106 blog posts and the webpages that they individually link to. There are some interesting patterns and more work to be done to interpret these. Whilst this technique has surfaced interlinks within the community there is no context in which they are made. Ranking the posts by betweenness centrality we get the following top 10:


It’s interesting that most of these come from tutors (although given the amount of co-creation the role of tutor/student is very blurred), which in some ways isn’t that surprising given they will be summarising and highlighting key work.

The Limitations

Hopefully by this point you are screaming at the screen because of the limitations of this analysis. The focus has been on blog post content, which I hope I’ve shown can be used to identify interesting moments. The big limitation here is it’s looking at a narrow slice of activity - how students are linking to blog posts without any real context. Another staple of blogs is comments. Comments provide an easy way for a reader to feedback to the author continuing the dialogue sharing thoughts, ideas and  reflections. It’s also all well and good me doing this but the real interesting thing would be to put this graphical overview in the hands of tutors and students to see if it helps them. I imagine it wouldn’t be hard to wrap the MySQL query initially used in some PHP and visualisation library like d3.js and provide users with some situational awareness of what is going on in their community. If you would like to explore the data the NodeXL file is here for download.

As part of this post was also about the feasibility of using FeedWordPress to aggregate JISC project blogs here are some of my notes:

  • Exporting post content was possible with MySQL access (interfaces would be easy to design)
  • Issue encountered with how post links stored (blogspot/blogger feeds use a internal guid=, where as other guids are links to the original posts). This was solved by fetching the comment feed and parsing the page link from the response
  • Need to be aware that FeedWordPress can't get content for partial rss feeds but plugins are available
  • Registering a feed is a manual process
  • Issue with mixed-mode feeds (if you are automatically pulling in all content if the feed source has a mixed use e.g. blog is mixture of personal and work, then data isn't as clean)
  • Getting the data into Google Refine made it easy to mine post content e.g. in one expression I'd calculated that ds106 has published over 4 million words

There are other technicalities of FeedWordPress I'd want to explore like how post edits are handled. Overall it was a lot easier having a single data source.

Your Thoughts

At this point it would be very useful to have your thoughts on this technique. And remember I’m a ‘maker of things’ so your suggestions may get incorporated into a functional tool ;)

In my next post in this series I’ll share my notes on building and analysing comments data.


Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).

To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.

Getting the data into C2

There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:

Google Refine Create Project from Google Doc
Google Refine Create Project from Google Doc

... and then use the Templating Export tool to generate the C2 xml.

Google Refine Templating Export
Google Refine Templating Export

For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.


<?xml version="1.0" encoding="UTF-8"?>

Row template

    <document id="{{row.index}}">

Row separator

[single carriage return]



If you want to play along here is a copy of the #or2012 archive in C2 xml

Processing the data in Carrot2

Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.

Carrot2 Workbench Search Panel
Carrot2 Workbench Search Panel

The results

Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):

Carrot2 Workbench Visualization Mode
Carrot2 Workbench Visualization Mode

The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).

View a tweet
View a tweet in Carrot2


This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out ;)


Working on some OER Visualisation Project work today I found I needed to get some additional information from an external web service for a Google Spreadsheet I was working on. I could have of course just used Google Apps Script, a native feature within Google Spreadsheets, to do a UrlFetch and get the information but my source data was a bit messy and timeouts on Apps Script prevent you from getting lots of data in one hit. Instead it made more sense to download the data into Google Refine, clean it up, call external web services to get additional information, then somehow get it back into the original spreadsheet.

As I recently posted Integrating Google Spreadsheet/Apps Script with R: Enabling social network analysis in TAGS it wasn’t much of a cognitive leap to try a similar trick of publishing the spreadsheet as a service to allow a basic API write. In the R example, however, I used the POST method to pass data and whilst I’m pretty sure with the right Jython libraries you could do something similar in Refine, I wanted to keep it simple so here’s my method for using GET instead.

  1. In a Google Spreadsheet open Tools >  Script editor.. and paste the following code (there is some tweaking required to where you want the data to go and the secret passphrase – in this example I want to pass back a value named geo 
  2. Once tweaked Run > Setup (this get a copy of the spreadsheet id need to run as a service)
  3. Open Share > Publish as service..  and check ‘Allow anyone to invoke’ with ‘anonymous access’, not forgetting to ‘enable service’. You’ll need a copy of the service URL for later on. Click ‘Save’
    Publish as service
  4. Back in Google Refine select a column with some data you want to pass and choose Edit column > Add column by fetching URLs…
  5. The expression you use will be something like the one below where we have the service url from step 3, the ‘secret’ to match the one set in step 1 and what ever data you want to include 
    Add column dialog

Note: You’ll see from the script my import included an idx number which matched a row number, but as long as I didn’t add any additional rows or change the order I could have used row.index in the expression instead.

You also might want to play around with the throttling. I had problems with my first go because this was either set too low or my values weren’t url encoded.

Instead if anyone finds this useful or if they show how to do in Jython. The amount of Google Spreadsheet work I do I’m sure I’ll use this trick again ;)  


This might be slightly off-topic for the OER Visualisation project, but I followed an idea, did - what I think are – some interesting things with an archive of tweets and thought I would share. This line of thought was triggered by a tweet from Terry McAndrew in which he asked:

@mhawksey Have you a visualisation planned of JORUM 'customers' of OER (and the rest of it for that matter).

Tracking who views or uses OER material can be tricky but not impossible the main issue comes when someone else like me comes along and wants to see who else has viewed, used, remixed the resource. For example, with the Jorum ukoer resources the only usage data I could get was each resource page view and even getting this required scraping over 8,000 pages. This is mentioned in day 18, but I realise I haven’t gone into any detail about how Google Refine was used to get this data – if someone wants me to I will reveal all.

A recent development in this area is the US Learning Registry project which is looking to maximise the use of activity data to support resource discovery and reuse. On Lorna Campbell’s CETIS blog there is a very useful introduction to the Learning Registry announcing JISCs involvement in a UK node. The post includes the following use case which helps illustrate what the project is about:

“Let’s assume you found several animations on orbital mechanics. Can you tell which of these are right for your students (without having to preview each)? Is there any information about who else has used them and how effective they were? How can you provide your feedback about the resources you used, both to other teachers and to the organizations that published or curated them? Is there any way to aggregate this feedback to improve discoverability?

The Learning Registry is defining and building an infrastructure to help answer these questions. It provides a means for anyone to ‘publish’ information about learning resources. Beyond metadata and descriptions, this information includes usage data, feedback, rankings, likes, etc.; we call this ‘paradata’”

Lorna’s post was made in November 2011 and since then the Mimas have started The JLeRN Experiment (and if you are a developer you might want to attend the CETIS contributors event on the 23rd Jan).

Plan A – Inside-out: Repository resource sharing data

All of this is a bit late for me but I thought I’d see what ‘paradata’ I could build around ukoer and see if there were any interesting stories to be told as part of the visualisation project. This is an area I’ve visited before with a series of ‘And the most socially engaging is …’ posts which started with And the most engaging JISC Project is… For this I used Google Spreadsheet to get social share counts (Facebook, Twitter and more) for a list of urls. One of the issues with this technique is Google Spreadsheets timeout after 5 minutes so there is a limit to the number of links you can get through -this is however not a problem for Google Refine.

Taking 380 of the most viewed Jorum ukoer tagged resources (approximately 5% of the data) I used Google Refine to

  1. ‘Add column by fetching URL’ passing the resource link url into Yahel Carmon’s Shared Count API -  using the expression ""+escape(value,"url") 
  2. Add column based on the results column parsing each of the counts – e.g. using expressions similar to parseJson(value)['Twitter']

At this point I stopped parsing columns because it was clear that there was very little if any social sharing of Jorum ukoer resources (here is a spreadsheet of the data collected).  In fact the most tweeted resource which Twitter records as having 7 tweets gets most of these following my tweet as it being the most viewed resource.

So what might be going on here. Is just a issue for national repositories? Are people consuming ukoer resources from other repositories? Are Jorum resources not ranking well in search engines? Are resources not being marketed enough?

I don’t have answers to any of those questions, and maybe this is just an isolated case, but the ‘marketing’ aspect interests me. When I publish a blog post I’ll push it into a number of communication streams including RSS, a couple of tweets, the occasional Google+. For posts I think are really worthwhile I’ll setup a twitter search column on Tweetdeck with related keywords and proactively push posts to people I think might be interested (I picked up this idea from Tony Hirst’s Invisible Frictionless Tech Support. Are we doing something similar with our repositories?  

Plan B – Outside-in: Community resource sharing data

There’s probably a lot more to be said about repositories or agencies promoting resources. To try and find some more answers, instead of looking from the repository perspective of what is being shared, I thought it’d be useful to look at what people are sharing. There are a number of ways you could do this like selecting and monitoring a group of staff. I don’t have time for that, so decided to use data from an existing community  who are likely to be using or sharing resources aka the #ukoer Twitter hashtag community. [There are obvious issues with this approach, but I think it’s a useful starter for ten]

Having grabbed a copy of the #ukoer Twapper Keeper archive using Google Refine before it disappeared I’ve got over 8,000 tweets from 8th March 2010 to 3rd January 2012.  My plan was to extract all the links mentioned in these tweets, identify any patterns or particularly popular tweets.

Extracting links and expand shortened urls

As most tweets now get links replaced with shortened urls and the general use of url shortening the first step was to extract and expand all the links mentioned in tweets. Link expansion was achieved using the API, which has the added bonus of returning meta description and keywords for target pages. Here’s a summary of the Google Refine actions I did (taken from the undo/redo history):

  1. Create new column links based on column item - title by filling 8197 rows with grel:filter(split(value, " "),v,startsWith(v,"http")).join("||")
  2. Split multi-valued cells in column links
  3. Create column long_url at index 4 by fetching URLs based on column links using expression grel:""+escape(value,"url")+"&format=json&user-agent=Google%20Refine&title=1&meta-keywords=1"
  4. Create new column url based on column long_url by filling 5463 rows with grel:parseJson(value)['long-url']

Searchable table of #ukoer linksWith long urls extracted the data was exported and uploaded to Google Spreadsheet so that a list of unique urls and their frequencies could be calculated. Here is the Spreadsheet of data.  From the refined data there are 2,482 different links which appear 6,220 times in the #ukoer archive. Here is the searchable table of extracted links with frequencies (sorry for the small font – can’t seem to find a way to control column width using Google Visualisation API … anyone?).

Not surprisingly a number of domain root urls appear at the top of the list. More work needs to be done to match resource sharing to different OER sources, but you can start doing simple filtering to find out what’s there. Something for the initial analysis I find interesting is that the top common link in the archive is to Oxfords Open Advent Calendar, which was a daily posting highlighting some of their OER resources. This could be interpreted as underlying the need for OER resources to be more effectively marketed. I’ll let you decide.

PS out of interest I put the list of 2,500 odd links back into Google Refine and extracted social share counts. I haven’t had a chance to look at the data closely but if you want to play a copy of it and a meta-data enhanced version of the #ukoer archive is here. Please share any findings ;)


Last month I posted Free the tweets! Export TwapperKeeper archives using Google Spreadsheet, which was a response to the announcement that TwapperKeeper would be removing public access to archives on the 6th January. This solution was limited to archives smaller than 15,000 tweets (although minor tweaking could probably get more). Since then Tony Hirst has come up with a couple of other solutions:

One of the limits these solutions have is they only collect the data stored on TwapperKeeper missing lots of other juicy data like in_reply_to, location, retweet_count (here’s what a tweet used to look like, now there is more data). Whilst this data is probably of little interest to most people for people like me it opens the opportunity to do other interesting stuff. So here’s a way you can make a copy of a Twapper Keeper archive and rebuild the data using Google Refine.

  1. You’re going to need a copy of Google Refine and install/run it
  2. Visit the Twapper Keeper page of the archive you want. On the page copy the RSS Permalink into the URL box in Google Refine adding &l=50000 to the end e.g. the ukoer archive is and click Next.
  3. In the preview window that appears switch ‘Parse data as’ to XML files. Scroll the top pane down to hover over the ‘item’ tag and click
    Refined parse xml 
  4. You should now have a preview with the data split in columns. Enter a Project name and click ‘Create Project’
    Refined Columns
  5. From the ‘item – title’ column dropdown select Edit column > Add column based on this column…
  6. In the dialog that opens enter a New column name ‘id_str’ and the Expression smartSplit(value," ")[-1] (this splits the cell and returns the last group of text)
  7. From the new id_str column dropdown select Edit column > Add column by fetching URLs… and enter a name ‘meta’, change throttle delay to 500 and enter the expression ""+value+"&include_entities=true" (that’s with quotes), then click OK.  [What we’ve done is extract a tweet id and then asked refine to fetch details about it from the Twitter API]
  8. Wait a bit (it took about 4 hours to get data from 8,000 tweets)

Once you have the raw data you can use Refine to make new columns using the expression parseJson(value) then navigate the object namespace. You might find it useful to paste a cell value into to see the structure. So to make a column which extracts the tweets full name you’d use parseJson(value)

So don’t delay ‘free the tweets’