Tag Archives: #ukoer

1 Comment

I started writing this last week so the intro doesn’t make sense. Slides from the presentation I did are here (all pictures so probably also makes very little).

To paraphrase Stephen Heppell (I’ve misquoted this before):

Content is king, but community is sovereign

The UKOER Programme is having it’s final meeting next week and while the final set of projects come to a close a strong community has formed and I’m sure will continue. Something I was interested in doing is looking at how the community has evolved over time. I’ve previously looked at data around the #ukoer hashtag, but not everyone uses Twitter so I thought I look for another data source.  As email is still a strong component in most peoples everyday lives I started poking around OER-DISCUSS JISCMail (Listserv) list:

A public list for discussion about the release, use, remix and discovery of Open Educational Resources (OER). Managed jointly by OU SCORE Project, and the JISC / HE Academy OER Programme.

As far as I could see there are limited options for getting data out of JISCMail (some limited RSS/Atom feeds) so cue the music for a good out fashioned scrape and refine. Whilst I’ll walk you through this for OER-DISCUSS the same recipe can be used for other public lists.

Source survey

Instead of going straight into the recipe I wanted to record some of the initial thought processes in tackling the problem. This usually begins with looking at what you’ve got to work with. Starting with the list homepage I can see some RSS/Atom feeds, that don’t take me far, instead I turn my attention to the list of links for each months archives. Clicking through to one of these and poking around the HTML source (I mainly use Chrome so a right click in the page gives the option to Inspect Element) I can see that the page uses a table template structure to render the results – good. Next I checked if the page would render even when I was logged out of JISCMail, which I can – good. So far so good.

page source

Next a step back. This looks scrapable so has anyone done this before. A look on Scraperwiki turns up nothing on Listserv or JISCMail, so next a general Google search. Looking for terms like ‘listserv data scrape’ are problematic because there are lots of listserv lists about data scraping in general. So we push on. We’ve got a page with links to each months archives and we know each archive uses a table to layout results. Next it’s time to start thinking about how we get data out of the tables. Back in the Chrome Element Inspector we can see that the source contains a lot of additional markup for each table cell and in places cells contain tables within them. At this point I’m think OpenRefine (nee Google Refine).

Scraping list of archive links

A feature of OpenRefine I use a lot is fetching data from a url. To do this we need a list of urls to hit. Back on the list homepage I start looking at how to get the links for each month’s archive. Hover over the links I can see they use a standard sequence with a 2-digit year {yy} and month {mm}


I could easily generate these in a spreadsheet but I’m lazy so just point a Chrome extension I use called Scraper to find the part of the page I want and import to a Google Spreadsheet.  


[another way of doing this is creating a Google Spreadsheet and in this case entering the formula =ImportXml("https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OER-DISCUSS","//tr[2]/td/ul/li/a/@href")

Fetching and Refining the Data

Import projectFinally we can fire up OpenRefine. You could create a project by using the Google Data option, which is used to import data from your Google Docs, instead as it’s not a huge amount of data I use the Clipboard option. At this point the preview will probably separate the data using ‘/’ and use the first row as a column heading so you’ll want to switch to comma or and de-select ‘Parse next’.

  1. Next we want to fetch each month’s archive page by using the Column 1 dropdown to Edit column > Add column by fetching url using the GREL expression "https://www.jiscmail.ac.uk"+value using the column name month_raw
  2. This pulls in each month’s archive page in raw html. Now we want to parse out each row of data in a new column by selecting the dropdown from month_raw and selecting Edit column > Add column based on this column  using the GREL expression forEach(value.parseHtml().select("table.tableframe")[1].select("tr"),v,v).join(";;;") with the column name rows_raw – this selects the second table with class ‘tableframe’ and joins each row with a ‘;;;’
  3. Next from the rows_raw column use Edit cells > Split multi-valued cells using ;;; as the separator
  4. Again from the rows_raw column dropdown select Edit column > Add column based on this column using the GREL expression forEach(value.parseHtml().select("td"),v,v).join(";;;") with the column name rows_parsed – this joins each <td> with a ;;; which will let us spilt the values into new columns in the next step
  5. Now from the rows_parsed column select Edit column > Split into several columns using the separator ;;;

Column split

You should now have something similar to above with columns and rows split out, but still messy with html in the cells. We can cleat these up using Edit cells > Transform using variations of value.parseHtml().htmlText()

Here are the steps I used (the complete operation history you can use in Undo/Redo is here – using this apply all the actions starting with the list of monthly urls)

  1. Text transform on cells in column rows_parsed 4 using expression grel:value.parseHtml().htmlText().replace(" lines","").toNumber()
  2. Rename column rows_parsed 4 to lines
  3. Text transform on cells in column rows_parsed 3 using expression grel:value.parseHtml().htmlText().toDate("EEE, dd MMM y H:m:s")
  4. Rename column rows_parsed 3 to date
  5. Text transform on cells in column rows_parsed 2 using expression grel:value.parseHtml().htmlText().replace(" <[log in to unmask]>","")
  6. Rename column rows_parsed 2 to from
  7. Create column snippet at index 4 based on column rows_parsed 1 using expression grel:value.split("showDesc('")[1].split("','")[0].unescape("html").parseHtml().htmlText()
  8. Create column link at index 4 based on column rows_parsed 1 using expression grel:"http://jiscmail.ac.uk"+value.parseHtml().select("a")[0].htmlAttr("href")
  9. Text transform on cells in column rows_parsed 1 using expression grel:value.parseHtml().htmlText()
  10. Rename column rows_parsed 1 to subject
  11. Create column subject_normal at index 4 based on column subject using expression grel:trim(value.replace(/^Re:|^Fwd:/i,""))

You’ll probably notice some of the rows don’t contain the data we need. An easy way to remove these is use a timeline facet on the date column selecting non-time, blank and error and then from the All column dropdown menu select Edit rows > Remove all matching rows.

Tony Hirst has a great post on faceting tricks. Something not covered is clustering data using facets. We use this as a way to join rows where authors have multiple logins eg Tony Hirst and Tony.Hirst

To do this add a text facet to the author/from column and click Cluster:

Facet view

Finding the right settings is a bit of trial and error combined with a bit of knowledge about your dataset. In this example I saw that there were rows for Pat Lockley and Patrick Lockley so tried some settings until I got a hit (in this case using nearest neighbour and PPM – which by all accounts is the last resort). You might also need to run clustering a couple of times to catch most of the variations

Clustering columns

What can we do with this data?

In Dashboarding activity on public JISCMail lists using Google Sheets (Spreadsheets) I was able to get an indication of the overall activity of the list. Now that author names are clustered I can get a more accurate picture of the top contributors using a text facet on the author column:

Pat Lockley 269, Amber Thomas 122, Phil Barker 82 

I was tempted to mine the individual posts further using the techniques and code posted by Adam Cooper (CETIS), but it didn’t look like I could easily swap the data source. A couple of questions posed by Phil Barker (CETIS) were:

The easiest way I found to get this was to use the Export > Custom tabular exporter (using this settings file), import into Google Sheets (Spreadsheets) and use a couple of formula to get this summary page (opening the summary page will let you see he formula I’ve used):

And there is much more you could do, but I’ll leave it there. If nothing else I hope you get an idea of some of the thought processes involved in extracting data. As always if something is unclear get in touch.

Jorum has a Dashboard Beta (for exposing usage and other stats about OER in Jorum) up for the community to have a play with: we would like to get your feedback!

For more information see the blog post here: http://www.jorum.ac.uk/blog/post/38/collecting-statistics-just-got-a-whole-lot-sweeter

Pertinent info: the Dashboard has live Jorum stats behind it, but the stats have some irregularities, so the stats themselves come with a health warning. We’re moving from quite an old version of DSpace to the most recent version over the summer, at which point we will have more reliable stats.

We also have a special project going over the summer to enhance our statistics and other paradata provision, so we’d love to get as much community feedback as possible to feed into that work. We’ll be doing a specific blog post about that as soon as we have contractors finalised!

Feedback by any of the mechanisms suggested in the blog post, or via discussion here on the list, all welcome.

The above message came from Sarah Currier on the [email protected] list. This was my response:

It always warms my heart to see a little more data being made openly available :)

I imagine (and I might be wrong) that the main users of this data might be repository managers wanting to analyse how their institutional resources are doing. So to be able to filter uploads/downloads/views for their resources and compare with overall figures would be useful.

Another (perhaps equally important) use case would be individuals wanting to know how their resources are doing, so a personal dashboard of resources uploaded, downloads, views would also be useful. This is an area Lincoln's Bebop project were interested in so it might be an idea to work with them to find out what data would be useful to them and in what format (although saying that think I only found one #ukoer record for Lincoln {hmm I wonder if anyone else would find it useful if you pushed data to Google Spreadsheets a la Guardian datastore (here's some I captured as part of the OER Visualisation Project}) ).

I'm interested to hear what the list think about these two points

You might also want to consider how the data is licensed on the developer page. Back to my favourite example, Gent use the Open Data Commons licence  http://opendatacommons.org/licenses/odbl/summary/

So what do you think of the beta dashboard? Do you think the two use cases I outline are valid or is there a more pertinent one? (If you want to leave a comment here I’ll make sure they are passed on to the Jorum team, or you can use other means).

[I’d also like to add a personal note that I’ve been impressed with the recent developments from Jorum/Mimas. There was a rocky period when I was at the JISC RSC when Jorum didn’t look aligned to what was going on in the wider world, but since then they’ve managed to turn it around and developments like this demonstrate a commitment to a better service]

Update: Bruce Mcpherson has been working some Excel/Google Spreadsheet magic and has links to examples in this comment thread


Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 


Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! ...  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)


a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server

That quote is taken from the Wikipedia entry for Git (software), the full quote is:

In software development, Git (/ɡɪt/) is a distributed revision control and source code management (SCM) system with an emphasis on speed.[4] Git was initially designed and developed by Linus Torvalds for Linux kernel development. Every Git working directory is a full-fledged repository with complete history and full revision tracking capabilities, not dependent on network access or a central server. … Git supports rapid branching and merging, and includes specific tools for visualizing and navigating a non-linear development history. A core assumption in Git is that a change will be merged more often than it is written, as it is passed around various reviewers.

The idea of using Git as a platform in open educational development (not just as a software development tool) is something that has pinged my radar a couple of times this year so I thought I’d quickly* share some interesting links material in this area.  The core concept when reading this is the idea that Git repositories are:

  • designed as a collaborative space; and
  • encourage remixing and branching of material

*I’m not entirely happy with how this post is written but don’t want to spend too much time on it – consider it as some very rough notes.

Open bid writing

As it happens to order in which I came across these links also fits in with an evolution of the idea from software to educational support tool. The first example is still more at the software end, in this case the use of the GitHub Service by Joss Winn at the University of Lincoln as a place for Open bid writing, but it helps highlight the potential benefits of Git.

Project proposal versioningIn ‘Open Bid writing’ Joss reflects on the use of GitHub to develop his proposal for, the now funded, JISC OER Rapid Innovation Bebop project. The main advantages highlighted in the post are as this was proposed as a software development project the final code and proposal will all sit in one place. Now you might say how is this different from just uploading your project plan to your project site. The difference here is just as GIt allows you to navigated different versions of the code you can also see how the proposal evolved, see different versions of the proposal and how it was constructed and even how ideas evolved. Joss also points out that using GitHub during the writing process also gave the opportunity for others to learn or even contribute to the proposal.

The final aspect not included in the post but mentioned by Joss is a tweet before submitting the proposal is Git’s functionality for someone else to fork the project, that is take a snapshot of the proposal and develop it in a completely different direction. So at a later date you might see an opportunity to do something similar to Bebop and instead of starting from scratch use Lincoln’s proposal as the basis of your own work.

[In Joss’ post he also that one of the student projects at DevXS was to create a GitHub hosted version of the collaborative writing tool Etherpad which stores documents in Github. You can read more about RevisionHub here and the code developed at DevXS is here].

Not code, but poetry

‘Code is poetry’ is the WordPress motto but as Phil Beauvoir (JISC CETIS) highlights in his post Forking Hell? Git, GitHub, and the Rise of Social Coding already people are using Git repositories for other purposes beyond coding. These include writers, musicians and artists all putting there material in Git for others to contribute or fork to make something different. My favourite example from Phil’s post is:

Durham-based band, the Bristol 7’s, last year released their album, “The Narwhalingus EP” on GitHub under a Creative Commons licence “to see what the world could do with it”. The release, if we can call it that, comprises the final mixes and the individual tracks as MP3 files. The band invites everyone to:

“Fork the repo, sing some harmony, steal my guitar solo, or add a Trance beat. Whatever you want to do, just tell us about it, so we can hear what’s become of our baby!”

[Sticking very loosely with art I see via Ed Summers cc0 and git for data post that:]

the Cooper-Hewitt National Design Museum at the Smithsonian Institution made a pretty important announcement almost a month ago that they have released their collection metadata on GitHub using the CC0 Creative Commons license

Forking Your Syllabus

So far the examples I’ve highlighted have all used the GitHub service. Earlier in the week I had a chance to chat to Joss Winn at the JISC OER Rapid Innovation start-up meeting and started talking about Git. One of the things Joss mentioned was whilst Git presented a number of opportunities for academics to contribute, share and reuse material the terms and concepts of Git are foreign to the average academic. A post I had read but not fully processed is Brian Croxall’s Forking Your Syllabus. In this post Brian highlights that for new teachers it can be daunting to design a programme of learning and that “when you’re beginning to plan something new, you can always benefit from seeing what others before you have done”

Brian goes on to join the dots between syllabus creation and Git, the final picture coming together with Audrey Watters ClassConnect: "GitHub" for Class Lessons. My hunch is ClassConnect has a Git backend and while the icon set and functionality is ‘fork’ the language is ‘used’. ClassConnect

As Audrey points out ClassConnect is a new product and I don’t think all of the required features are there yet, like selecting and searching by Creative Commons license, but the idea of using the Git model in educational development is one to watch.

But that’s what I think. What do you think? Are the soft issues of getting people to work in a more open way always going to overshadow any technical development to make it easier to do this? Or will tools like ClassConnect suck people into different working practices? Will staff ‘git’ it?

Update: There's been some more discussion on this idea on the OER-DISCUSS JISCMail list


Today in Capturing The Value Of Social Media Using Google Analytics Google announced some new features that will be appearing in Google Analytics. The post is mainly focused around 'social value’ of defining and monitoring goals for getting people coming to your site from social networks to do something on your site (click a button, view a certain page).

The bit that is really interesting (for me anyway) is the announcement on ‘activity streams’. These will include information on:

how people are engaging socially with your content off your site across the social web. For content that was shared publicly, you can see the URLs they shared, how and where they shared (via a “reshare” on Google+ for example), and what they said. Currently, activities are reported for Google+ and across a growing list of our Social Data Hub partners including recently signed brands Badoo, Disqus, Echo, Hatena and Meetup.

Example Activity Stream

There is obvious overlap here with some of my recent work extracting ‘activity data’ from social networks for sites and repositories, but before I pack my bags there are a number of things to consider.

Twitter and Facebook probably won’t come to the party
Google’s access to activity data is limited to those who want to join the Analytics Social Data Hub. While there are already some reasonably big names signed up given the Twitter/Facebook/Google+ social network war it’s unlikely that you are going to see individual tweet analytics as I achieved here in the near future.

Access to the data
It’ll be interesting if Google will make ‘activity stream’ data available for download or access via their API. There’s very little information on the Social Data Hub website about what 3rd party services are signing up to and if there is an compensation for make their data available. For a number of the existing signups they already have their own public APIs so they may be happy for this data to be made available. Only time will tell.

Not everyone uses Google Analytics
I’m also trying to take comfort in the fact that not everyone uses Google Analytics, so there is hopefully still value is surfacing and centralising activity data for non-Analytics users.

So interesting times, but does anyone actually care about this type of data yet?

1 Comment

Recently I’ve been interested in tracking activity around resources. This comes off the back of the OER Visualisation project where I started looking at social share data around educational resources, the beginnings of a PostRank style RSS social engagement tracker, and more recently Using Google Spreadsheets to combine Twitter and Google Analytics data to find your top content distributors (it’s been a eye-opener to see how much individual activity data there is … if you know where you look).

Working on the vague use case of ‘academic finds an interesting resource and bookmarks it for later’ my assumption is there might be more social bookmarking rather than shares via services like Twitter. To see what data is accessible I turned my attention to Diigo. For those that don’t know Diigo started as a online bookmarking service but have kept adding sharing, notetaking, highlighting type features and continues to try and steal the Delicious crowd.

Diigo does have an official API but is based around individual users rather than sites. Site level data is available and here is an example for my hawksey.info domain.  The page returns the last 20 bookmarks made by users for my site. Clicking on a bookmark lets you see how many people have also publically bookmarked the page, the date and how they tagged it (there’s probably more here to do on crowdsourced metadata … for another day).

Diigo bookmark details

Back to the top level data. Obviously you could visit this page each day to see who has been bookmarking your material or maybe even find a service that emails the webpage to you each day. I’m more interested in how this data might be centralised in one place so that you can combine it with other information.  It probably won’t be a surprise that I chose Google Spreadsheets to have a crack at this.

Below is embedded this Diigo Site Tracker Google Spreadsheet <- click on the link and File > Make a copy for your own version and enter your site url in cell B3

In the spreadsheet you can see the Diigo profile url for the person who has bookmarked a link, what was bookmarked and by scraping the details page how many times the link has already been saved.

How it was made

If you have been following my other work you might think this is powered by Google Apps Script, but you’d be wrong. The spreadsheet is entirely powered by the built-in importXML function. As you’ll see from the documentation the function can handle a range of markup languages including HTML. So we can point the function at a webpage but how do we get back the parts we want. This is usually the bit I trip over. To query the part of the page you want back you need to use XPath.

XPath lets you drilldown into the part of the page you want. The key I’ve found to unlocking XPath is a browser extension (I’m currently using this one) which lets me see the XPath for part of the page I’m looking at. I then use this information in the importXML function (it’s worth noting that Google Spreadsheets limits you to 50 imports per spreadsheet, so to scale this solution to get data from other services I’d probably have to switch to Apps Script or something else).

So that was Diigo, your homework is to do something similar with Delicious and I’ll give you my answer tomorrow ;) [I might even be able to show you how you can link this to Google Analytics data].   


Yes I'm going to be your public servant once more as a JISC CETIS Learning Technology Advisor starting on the 13th March. I have the mighty shoes of John Robertson to fill so no pressure ;-s.

When CETIS advertised the post David Kernohan tweeted:

Whilst still at the JISC RSC Scotland North & East figuring at where I wanted to go next I concluded that working for CETIS would be my dream job. As CETIS don't have a high churn of staff I thought the chances of working there were low so when the opportunity came along I put in my application quicker than you could say 'a bunch of munchy crunchy carrots' < that quote is just for John.

My role will be part CETIS core work, part OER Programme Support Officer. Having just completed the OER Visualisation Project I'm sure there will be opportunities to continue parts of this as part of the programme support, but if nothing else I'm already tapped into the heart and pulse of #ukoer.

So remember if you work in the UK Higher and Post-16 Education sectors and need some advice or support in educational technology and standards (particularly OER), then I'm your man ... well from next Tuesday.


It’s the last day of the OER Visualisation Project and this is my penultimate ‘official’ post. Having spent 40 days unlocking some of the data around the OER Programme there are more things I’d like to do with the data, some loose ends in terms of how-to’s I still want to document and some ideas I want to revisit. In the meantime here are some  of the outputs from my last task, looking at the #ukoer hashtag community. This follows on from day 37 when I looked at ‘the heart of #ukoer’, this time looking at some of the data pumping through the veins of UKOER. It’s worth noting that the information I’m going to present is a snapshot of OER activity, only looking at a partial archive of information tweeted using the #ukoer hashtag from April 2009 to the beginning of January 2012, but hopefully gives you an sense of what is going on.

The heart revisited

I revisited the heart after I read Tony Hirst’s What is the Potential Audience Size for a Hashtag Community?. In the original heart nodes were sized using ‘betweenness centrality’ which is a social network metric to identify nodes which are community bridges, nodes which provide a pathway to other parts of the community. When calculating betweenness centrality on a friendship network it takes no account of how much that person may have contributed. So for example someone like John Robertson (@KavuBob) was originally ranked has having the 20th highest betweenness centrality in the #ukoer hashtag community, while JISC Digital Media (@jiscdigital) is ranked 3rd. But if you look at how many tweets John has contributed (n.438) compared to JISC Digital Media (n.2) isn’t John’s potential ‘bridging’ ability higher?

Weighted Betweenness CentrailityThere may be some research in this area, and I have to admit I haven’t had the chance to look, but for now I decided to weight betweenness centrality based on the volume of the archive the user has contributed. So John goes from ranked 20th to 3rd and JISC Digital Media goes from 3rd to 55th. Here’s a graph on the winners and losers (click on the image to enlarge).

Here is the revised heart on zoom.it (and if zoom.it doesn’t work for you the heart as a .jpg

The 'heart' of #ukoer (click to enlarge)

[In the bottom left you’ll notice I’ve included a list of top community contributors (based on weighted betweenness – a small reward for those people (I was all out of #ukoer t-shirts).]

These slides also show the difference in weighted betweenness centrality (embedded below). You should ignore the change in colour palette, the node text size is depicting betweenness centrality weight [Google presentation has come on a lot recently – worth a look at if you are sick of the clutter of slideshare]:


The ‘pulse’ of #ukoer

In previous work I’ve explored visualising Twitter conversations using my TAGSExplorer.  Because of the way I reconstructed the #ukoer twitter archive (a story for another day) it’s compatible with this tool so you can see and explorer the #ukoer archive of the 8300 tweets I’ve saved here. One of the problems I’m finding with this tool is it takes a while to get the data from the Google Spreadsheet for big archives.

TAGSExplorer - ballofstuffThis problem was also encountered in Sam’s Visualising Twitter Networks: John Terry Captaincy Controversy. As TAGSExplorer internally generates a graph of the conversation, rather than scratching my head on some R Script it was easy to expose this data so that it can be imported into Gephi. So now if you add &output=true to a TAGSExplorer url you get a comma separated edge list to use with you SNA package of choice (the window may be blocked as a pop-up, so you need to enable). Here is the link for the #ukoer archive with edges for replies, mentions and retweets (which generates ‘a ball of awesome stuff’ (see insert above) but will eat your browser performance)

ukoer conversation (click to enlarge)Processing the data in Gephi you get a similar ball of awesome stuff (ukoer conversation on zoom.it | ukoer conversation .jpg). What does it all mean I hear you ask. These flat images don’t tell you a huge amount. Being able to explore what was said is very powerful  (hence coming up with TAGSExplorer). You can however see a lot of mentions (coloured blue and line width indicating volume) in the centre between a small number of people. It’s also interesting to contrast OLNet top right and 3d_space mid left. OLNet has a number of green lines radiating out indicating @replies indicating they are in conversations with individuals using the #ukoer tag. This compares to 3d_space which has red lines indicating retweets suggesting they are more engaged in broadcast.

Is there still a pulse?

UKOER Community StatsWhen looking at the ‘ball of awesome stuff’ it’s important to remember that this is a depiction of over 8,000 tweets from April 2009 to January 2012. How do we know if this tag is alive and kicking or not just burned out like a dwarf star?

The good news is there is still a pulse within #ukoer, or more accurately lots of individual pulses. The screenshot to the right is an extract from this Google Spreadsheet of #UKOER. As well as including 8,300 tweets from #ukoer it also lists the twitter accounts that have used this tag. On this sheet are sparklines indicating the number of tweets in the archive they’ve made and when. At the top of the list you can see some strong pulses from UKOER, xpert_project and KavuBob. You can also see others just beginning or ending their ukoer journey.

The good news is the #ukoer hashtag community is going strong December 2011 having the most tweets in one month and the number of unique Twitter accounts using the tag has probably by now tipped over the 1,000 mark.

#ukoer community growth

There is more for you to explore in this spreadsheet but alas I have a final post to write so you’ll have to be your own guide. Leave a comment if you find anything interesting or have any questions

[If you would like so explorer both the ‘heart’ and ‘pulse’ graphs more closely I’ve upload them to my installation of Raphaël Velt's Gexf-JS Viewer (it can  take 60 seconds to render the data). This also means the .gexf files are available for download:]


UKOER Hashtag CommunityLast week I started to play with the #ukoer hashtag archive (which has generated lots of useful coding snippets to processes the data that I still need to blog … doh!). In the meantime I thought I’d share an early output. Embedded below is a zoom.it of the #ukoer hashtag community. The sketch (HT @psychemedia) is from a partial list* of twitters (n. 865) who have used the #ukoer hashtag in the last couple of years and who they currently follow. The image represents over 24,000 friendships, the average person having almost 30 connections to other people in the community.


3D Heart SSD
3D Heart SSD
Originally uploaded by Generation X-Ray
Publishing an early draft of this image generated a couple of ‘it looks like’ comments (HT @LornaMCampbell @glittrgirl). To me it looks like a heart, hence the title of this post. The other thing that usually raises questions is how the colour grouping are defined (HT @ambrouk). The answer in this case is it’s  generated from a modularity algorithm which tries to automatically detect community structure.

As an experiment I’ve filtered the Twitter profile information used for each of these groupings and generated a wordcloud using R (The R script used is a slight modification of one I’ve submitted to the Twitter Backchannel Analysis repository Tony started – something else I need to blog about. The modification is to SELECT a column WHERE modclass=somthing).

Right all this post has done is remind me of my post backlog and I’ve got more #ukoer visualisation to do so better get on with it.

*it’s a partial list because as far as I know there isn’t a complete archive of #ukoer tweets. The data I’m working from is from an export from TwapperKeeper for March 2010-Jan 2012  topped up with some data from Topsy for April 2009-March 2010

In May 2009 JISC CETIS announced the winners of the OER Technical Mini-Projects. These projects were designed:

to explore specific technical issues that have been identified by the community during CETIS events such as #cetisrow and #cetiswmd and which have arisen from the JISC / HEA OER Programmes

JISC CETIS OER Technical Mini Projects Call
Source :
Author: Phil Barker, JISC CETIS

One of the successfully funded projects was CaPRéT - Cut and PAste reuse and Tracking from Brandon Muramatsu, MIT OEIT and Justin Ball and Joel Duffin, Tatemae. I’ve already touched upon OER tracking in day 24 and day 30 briefly looking at social shares of OER repository records. Whilst projects like the Learning Registry have the potential to help it still early days and tracking still seems to be an afterthought, which has been picked up in various technical briefings. CaPReT tries to address part of this problem, as stated in introduction to their final report:

Teachers and students cut and paste text from OER sites all the time—usually that's where the story ends. The OER site doesn't know what text was cut, nor how it might be used. Enter CaPRéT: Cut and Paste Reuse Tracking. OER sites that are CaPRéT-enabled can now better understand how their content is being used.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

The code and other resources can be found on their site. You can also read Phil Barker’s (JISC CETIS) experience testing CaPReT and feedback and comments about the project on the OER-DISCUSS list.

One of the great things about CaPReT is the activity data is available for anyone to download (or as summaries Who's using CaPRéT right now? | CaPRéT use in the last hour, day and week | CaPRéT use by day).

One of the challenges set to me by Phil Barker was to see what I could do with the CaPReT data. Here’s what I’ve come up with. First a map of CaPReT (great circles) usage plotting source website and where in the world some text was copied from (click on image for full scale):

capret - source target map

An an interactive timeline which renders where people copied text and pop-ups with a summary of what they copied

capret timemap

Both these examples rely on the same refined data source rendered in different ways and in this post I’ll tell you how it was done. As always it would be useful to get you feedback as to whether these visualisations are useful, things you’d improve or other ways you might use the recipes. 

How was it made – getting geo data

  1. Copied the CaPReT tabular data into Excel (.csv download didn’t work well for me, columns got mixed up on unescaped commas), and saved as .xls
  2. .xls imported in Google Refine. The main operation was to convert text source and copier IP/domains to geo data using www.ipinfodb.com. An importable set of routines can be downloaded from this gist [Couple of things to say about this -  CaPReT have a realtime map, but I couldn’t see any locations come through – if they are converting IP to geo it would be useful if this was recorded in the tabular results. IP/domain location lookups can also be a bit misleading, for example, my site is hosted in Canada, I’m typing in France and soon I’ll be back to Scotland
  3. the results were then exported to Excel and duplicates removed based on ‘text copied on’ dates and then uploaded to a Google Spreadsheet   

Making CaPReT ‘great circles’ map




























This was rendered in RStudio using Nathan Yau’s (Flowing Data) How to map connections with great circles. The R code I used is here. The main differences are reading the data from a Google Spreadsheet and handling the data slightly differently (Who would have thought as.data.frame(table(dataset$domain)) would turn the source spreadsheet into (#stilldiscoveringR) –>

I should also say that there was some post production work done on the map. For some reason some of the ‘great circles’ weren’t so great and wrapped around the map a couple of times. Fortunately these anomalies can easily be removed using Inkscape (and while I was there added a drop shadow.


capret before post production

Making CaPReT ‘timemap’

Whilst having a look around SIMILE based timelines for day 30 I came across the timemap.js project which:

is a Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously. By default, only items in the visible range of the timeline are displayed on the map.

Using the Basic Example, Google v3 (because it allows custom styling of Google Maps) and Google Spreadsheet Example I was able to format the refined data already uploaded to a Google Spreadsheet and then plug it in as a data source for the visualisation (I did have a problem with reading data from a sheet other than the first one, which I’ve logged as an issue including a possible fix).

A couple of extras I wanted to do with this example is also show and allow the user to filter based on source. There’s also an issue of scalability. Right now the map is rendering 113 entries, if CaPReT were to take off the spreadsheet would suddenly fill up and the visualisation will probably grind to a halt. 

[I might be revisiting timemap.js as they have another example for a temporal heatmap, which might be could for showing Jorum UKOER deposits by institution.] 

So there you go, two recipes for converting IP data into something else. I can already see myself using both methods in other aspects of the OER Visualisation and other project. And all of this was possible because CaPReT had some open data.

I should also say JISC CETIS have a wiki on Tracking OERs: Technical Approaches to Usage Monitoring for UKOER and Tony also recently posted Licensing and Tracking Online Content – News and OERs.