Here's some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.

1 Comment

For week 3 of cfhe12 analysis I thought I’d turn back to the Twitter data. I’m currently trying to prepare a Shuttleworth Fellowship application which has got me thinking more about the general premise of cMOOCs that “knowledge is distributed across a network of connections, and therefore that learning consists of the ability to construct and traverse those networks”  (from week 1 of cck11).

The aspect, which features in my Shuttleworth application, is providing mechanisms that aggregate data from distributed sub-networks which then can be processed to produce actionable insights to tutors or participants. The process I plan to adopted is to look at the data using heavyweight tools, like NodeXL, or just applying a bit of curiosity (this person has stopped tweeting, why? etc), and then converting some of these patterns into very lightweight applications or views to remove the complexity and highlight key parts of the data.

Some examples for you:

Summary of CFHE12 participant activity


Tweets from CFHE12 are being collected in this Google Spreadsheet. As part of this template there are a number of summary views, one of these being a breakdown of individual participant activity. As part of this sparklines are used to display someone's twitter activity. Looking at gsiemens you can see there is steady activity posting 45 tweets tagged #cfhe12. Towards the bottom of the graph is ViplavBaxi, who after initial high activity is no longer contributing to the hashtag. So what has happened to ViplavBaxi? There are a number of possible answers but let me highlight a couple which also highlights the limitation of the technique:

  • they have lost interest in the course ot time commitments prevent them from contributing (high drop outs aren’t unexpected in MOOCs)
  • no longer using #cfhe12 hashtag – the archive is only of #cfhe12 so if the have joined a sub community communicating without the hashtag it’s not recorded
  • found a different communication channel – this technique is only looking at Twitter activity, the person may have moved to another network channel like the discussion forum

Another interesting activity summary is for dieGoerelebt. They are one of the top 5 contributors in terms number of tweets, but recently their activity has trailed off. You can also see the ‘@s’ column, which is the number of times they’ve been mentioned in tweets is one of the lowest. Is the decline in activity a result of the lack of engagement?

The next question that springs to my mind is what did these people say. Within the spreadsheet it’s easy to filter what they said. To let you see too I’ve got this simple web interface primed with filtered tweets (I modified an existing tool I’ve developed to do this – unfortunately I’ve never documented it, but as I use it more and more I must get around to it):

Summary of CFHE12 participant activity with RT percentageFrom visual inspection dieGoerelebt had a high proportion of retweets. This is confirmed when I added a percentage of tweets that are retweets.

Something I noted in the filtered view for a persons tweets was that a lot of the context is lost (I can see they are @replying to someone, but I don’t know what they said.

To help with this I started looking at modifying the twitter questions filter I built to enable a view of the conversation.

This is a start, but as I noted when I published the question filter clicking through messages like the one showed below reveal there is more of the conversation that is missing.

 Part of the conversation

Bigger picture


So again I start exploring some ideas that branch off into many more avenues to follow. One thought is that the micro analysis of tweets might not my beneficial or practical, and given the issues with extracting a full conversation from Twitter a macro view might be better. Providing a summary of overall activity and the mode in which Twitter is being by people may be of the most use to tutors and participants to identify people they might want to connect with. As always your thoughts are greatly appreciated.

In this post I’ve taken an ego-centric approach contributions. In the next couple of days I’ll share an ego-centric approach to community connections.


On Monday I’ll be helping Tony (Hirst) deliver a workshop at Internet Librarian International … W2 – Appropriating IT: innovative uses of emerging technologies. Tony has already posted a draft outline of the day. For the event I wanted to prepare some notes on using Google Spreadsheets/Sheets (I very confused about what they are called now. There was a very quiet announcement here of a change).

I’ve come up with some practical exercises for participants to try. There will be a lot of contextualisation going on as part of the workshop, the main idea is to seed some ideas of what’s possible. Here’s the Google Docs version of the guide. The focus is on what you can do with the importHTML, importFeed, importXML, importRange and importData spreadsheet formula with a few others, including QUERY, thrown into the mix. Using QUERY isn’t straight forward but it opens up lots of possibilities in how you reshape data (extra examples of these are included in the Google Docs version).

Note: All text and images with the exception of text in grey boxes is licensed Creative Commons Attribution 3.0 Unported License. CC-BY Hirst & Hawksey (if this isn’t clear check the source or get in touch)


Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page.

Query is either “list” or “table” indicates what type of structure to pull in from the webpage. If it’s “list,” the function looks for the contents of <UL>, <OL>, or <DL> tags; if it’s “table,” it just looks for <TABLE> tags.

Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml(““; “table”;4). This function returns demographic information for the population of India.

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 1: Importing a html table and graphing the result

  1. Create a new Google Spreadsheet
  2. In cell A1 enter the formula to import a table from Wikipedia=importHTML("","table",3)
  3. Select Insert > Chart and then select the data ranges for country name and total medals
    Selecting chart data
  4. While still in the Chart Editor select the Charts tab, then Maps > Geo charts - regions
  5. Still in the Chart Editor select Customise and change the No value colour to blank.
  6. Finally click Insert

You should now have a chart that looks like this:

World map with missing data

Notice that the chart has No values (black) for most of the countries.  To fix this we need to remove the country codes in brackets. One way to do this is trim the text from the left until the first bracket “(“. This can be done using a combination of the LEFT, FIND and ARRAYFORMULA (ARRAYFORMULA allows you to apply formulas to multiple cells).

  1. In cell H2 enter the formula =ARRAYFORMULA(LEFT(B2:B86,FIND("(",B2:B86)-2)) this should create a column of country names without brackets)
  2. Click on your Chart and select Advanced edit.
  3. Make sure you are on the Start tab in the Chart Editor and edit the data range forSheet1!B2:B86 to Sheet1!H2:H86 then click Update

Your chart should now look like this (my solution

World map without missing data


Syntax: =ImportFeed(URL; [feedQuery | itemQuery]; [headers]; [numItems]). The arguments to the function are as follows:

URL is the url of the RSS or ATOM feed.

feedQuery/itemQuery is one of the following query strings: "feed", "feed title", "feed author", "feed description", "feed url", "items", "items author", "items title", "items summary", "items url", or "items created". The feed queries return properties of the feed as a whole: the feed's title, the feed's author, etc.Note: To get the data included in the feed, you need to do an "items" request.

  1. the "feed" query returns a single row with all of the feed information.
  2. the "feed <type>" query returns a single cell with the requested feed information.
  3. the "items" query returns a full table, with all of the item information about each item in the feed.
  4. the "items <type>" query returns a single column with the requested information about each item.
  5. if a query is given that begins with "feed", the numItems parameter isn't necessary and is replaced by the option headers param.
  6. if a query is given that begins with "items", the numItems parameter is expected as the 3rd param, and headers as the 4th.
  7. headers - "true" if column headers is desired. This will add an extra row to the top of the output labeling each column of the output.

Example: =ImportFeed("")

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 2: Importing a RSS feed and getting social share counts

  1. Open
  2. In cell B5 enter the RSS feed url for a blog (or you can use and hit enter

You should end up with something like:

RSS social counts

An aside: Spreadsheet Addiction by Patrick Burns ( - highlights the dangers of using spreadsheets for analytics. Particular issues include the ambiguity of a cell being a value or a formula. For example, if I sort cells on the value in the Twitter count column all the data is lost because cells are sorted as values but actually contain formula which get broken.

How it works

In cell A11 is the formula =IF(ISBLANK(B5),,IMPORTFEED(B5,"items",FALSE)) If the feed url is not blank this fetches the RSS feed defined in B5. Results are returned in cells A11:E30. You may have noticed that column E is hidden this is because it contains the feed item description.

The social share counts are returned by a custom function written in Google Apps Script ( Google Apps Script is similar to Excel Macros, written using a JavaScript syntax. If you open Tools > Script editor in your spreadsheet you can see some of the custom script powering the spreadsheet. This includes the getSharedCount formula used in cells F11:F30 which passes the post url to the API and returns social share counts. The code used is:

function getSharedCount(sourceLink){
//var url = ""
 var url = extractLink(sourceLink);
 var cache = CacheService.getPublicCache(); // using Cache service to prevent too many urlfetch
 var cached = cache.get("C"+url);
 if (cached != null) { // if value in cache return it
   //var test = cached.split(",")
   return cached.split(",");
 try {
   var options =
     "method" : "get",
     "contentType" : "application/json"
   var response = UrlFetchApp.fetch(""+encodeURI(url), options);
   var data = Utilities.jsonParse(response.getContentText());
   var output = [];
   for (i in data){
     if (i == "Facebook"){
     } else {
   cache.put("C"+url, output, 86400); // cache set for 1 day
   return output;
 } catch(e){

For more examples of Google App Script see


Syntax: =ImportXML(URL, query)

URL is the URL of the XML or HTML file.

Query is the XPath query to run on the data given at the URL. Each result from the XPath query is placed in its own row of the spreadsheet. For more information about XPath, please visit

Example: =importXml(""; "//a/@href"). This returns all of the href attributes (the link URLs) in all the <a> tags on homepage.

Note: The limit on the number of ImportXML functions per spreadsheet is 50.

Exercise 3: Turn a page of RSS Feeds into an OPML file

  1. Create a new spreadsheet
  2. In cell A1 enter the text ‘Title’ and in cell B2 ‘Url’
  3. Now in cell A2 enter the formula=ImportXML("","//b/a")
  4. Cell B2=ImportXML("","//a[.='XML']/@href")
  5. File > Publish to the web and click ‘Start publishing’, copy the link in the bottom box then ‘Close’
  6. Visit and paste your spreadsheet link in the box and copy the generated link into your broswer address bar
How it works

Using XPath we can identify parts of a XML (including HTML) page we want to extract. The screenshow below shows how parts of the page are identified. [I always struggle with XPath so use browser extensions to help (Scraper and XPath Helper)]. The results are pulled into the spreadsheet as live data so if the source page is updated the data in the spreadsheet will also be updated.

XPath parts


Syntax: =ImportRange(spreadsheet-key, range)

Spreadsheet-key is a STRING which is the key value from the spreadsheet URL.

Range is a STRING representing the range of cells you want to import, optionally including the sheet name (defaults to first sheet). You can also use a range name if you prefer. Given that the two arguments are STRINGs, you need to enclose them in quotes or refer to cells which have string values in them.

Example: =importrange("abcd123abcd123", "sheet1!A1:C10")

"abcd123abcd123" is the value in the "key=" attribute on the URL of the target spreadsheet and "sheet1!A1:C10" is the range which is desired to be imported.

Note: In order to use ImportRange, you need to have been added as a viewer or collaborator to the spreadsheet from which ImportRange is pulling the data. Otherwise, you'll get this error: "#REF! error: The requested spreadsheet key, sheet title, or cell range was not found."

[Here’s is a copy of the completed spreadsheet used in exercises 4, 5 and 6]

Exercise 4: Importing data from Guardian Datastore

  1. Visit and click Get the Data, then DATA: download the full spreadsheet
  2. Keep this window open and create a new spreadsheet.
  3. In cell A1 enter=ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104") - notice how the key and range are entered

Lets now reshape the data so we can generate some graphs. Lets first calculate the change in rank between 2011 and 2012

  1. In cell E2 enter the formula =B2-A2
  2. Fill this formula for the rest of the rows (there are a couple of ways of doing this including copying cell E2, highlighting the other empty cells in Column E and pasting, or whilst E2 is active grab and drag the bottom right corner of the cell
  3. Now we want to get a list of the countries included in column D. To do this in cell G2 enter the formula =UNIQUE(D2:D102)
  4. Now we want to sum the rank difference per country by entering the following formula in cell H2:=SUMIF(D$2:D$102,G2,E$2:E$102)
  5. Copy this value down for the remaining rows
  6. Select the data range G2:H16 and Insert > Chart > Bar chart

University world ranking delta

Graph of change in top 100 world university ranking.

Is the process for producing this chart valid? Is it displaying a meaningful representation of the data? Is this chart a lie?

Important: Notice that the calculation for France has an error:

Broken value

This is because on row 93 the source data doesn’t have a number value. Because we’ve used ImportRange to get the data we can’t edit it as are changes get overwritten by the importRange formula in cell A1. In our scenario we can remove the calculated value in E93 or use a formula to handle the error. Other ways around this are to flatten the imported data by copying all of it and paste as values (other solutions exist which we cover later) 


Syntax: =ImportData(URL)

URL is the URL of the CSV or TSV file. This imports a comma- or tab-separated file.

Note: The limit on the number of ImportData functions per spreadsheet is 50.

Exercise 5: Importing CSV data from Google Maps

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. In cell A1 of the new sheet enter the formula =FILTER(Sheet1!A:E,Sheet1!D:D="United Kingdom") to filter the data on sheet1 where column D is the United Kingdom
  3. Now in column F1 enter the formula =ImportData(""&C2) and press return
  4. Copy this cell down for the remaining rows

You should now have a table that looks a little like this:

ImportData from Google Maps

Import... and QUERY

Syntax: =QUERY(data, query, headers)

Data - An array of data. For example: A1:B6, data!B2:H6, ImportRange(spreadsheet_key, [sheet!]range),FILTER(sourceArray, arrayCondition_1, …)

Query - A query string for applying data operations. The query operates on column IDs directly from the input range and uses a subset of the SQL language. For example, "select E," "select A , B," "sum(B),C group by C," "select D where D < 'Nick' ." In certain instances, for example when using FILTER as a data source, column identifiers are Col1, Col2 etc. For more information on creating queries read see Google Visualization API Query Language

Headers (optional) - A number specifying the number of header rows in the input range. If omitted, or set to -1, the number of header rows is guessed from the input range. This parameter enables transformation of multi-header rows range input to be transformed to a single row header input which the QUERY supports.

Exercise 6: Import and QUERY

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. Form your original sheet (Sheet1) copy the importRange formula in cell A1
  3. In your new sheet (Sheet3) paste the formula you just copied inside a QUERY formula shown below

    =QUERY(ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104"), "SELECT Col1, Col2, Col3, Col4, Col2-Col1 LABEL Col2-Col1 'Difference'")

You should now have a table that looks like this:

QUERY for difference

How it works

The QUERY function imports the data and using the Google query language selects columns 1 to 4 and also adds a fifth by taking the difference between Col2 and Col1, this new column is labeled as Difference. Notice that on row 93 the French entry no longer has an error, but is blank.

We could continue exercise 4 and get a summary chart using UNIQUE and SUMIF. An alternative would be to use the QUERY formula again to do this for us by:

  1. In Sheet 3 enter the following formula in cell G1 =QUERY(A:E,"SELECT D, SUM(E) WHERE D <> '' GROUP BY D ORDER BY D")

This time we are setting the data source as all the data in columns A to E in Sheet3. Next we are creating a query that selects column D and a sum of column E where D is no blank and grouped by the value in column D (the country names).

A reminder that here’s a copy of the completed spreadsheet used in exercises 4, 5 and 6


Hopefully you’ve seen that Google Sheets (Spreadsheets) can be a useful tool for importing data from other sources and reshaping it to fit your needs. The last set of exercises are more advanced so don’t worry if you don’t fully understand what is happening, they are there as an indication of what is possible and hopefully inspire you to find out more.

Three points worth remembering when using import formula:

  • The data is a live link - if the data source changes or disappears  your imported data will also change or disappear (this can be both an advantage and disadvantage).
  • The ambiguity of a cell - because a spreadsheet cell can a value and a formula sorting values generated by formulas can lead to unforeseen problems.
  • There are other ways of getting data into Google Sheets - this guide has only looked at ‘Import...’ formula for getting pulling data in. There are other ways of populating sheets using Google Forms, Google Apps Script and 3rd party services like IFTTT (for an example of this last one see ‘IFTTT: IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet -

A reminder that the Google Docs version of the guide is here and contains more examples. If you get stuck leave a comment.


Repositories are living archives. In terms of the support it must provide for stored files, it must take into account two important functions of the files it holds:

  1. Access: The files are held so that users can access them. This means that they must be stored in formats that can be used by today's intended audience
  2. Preservation: The files are held so that users in 5, 10, 50, or more years can still access them. This means that they must be stored in formats that can be used by future audiences, or in formats that can easily be migrated

These two considerations are not always complementary. A file format that is good for access today may not be a format that is easy to migrate, but a format that is easy to migrate may not be easy to read.

The text above is taken from the JISC infoNet Digital Repositories infoKit. An added complication when considering the deposit of OER is if you are not using a ‘No Derivatives’ licence how can you support remix/editing.  Here’s a scenario taken from the WikiEducator:

A teacher wants to make a collage. She imports several PNG photos into Photoshop and creates the collage. She saves the file as a PSD and exports a copy as a PNG to post on the web. While others can edit the PNG, it would be a lot easier to edit the PSD file. However, in order to use PSD files, the person has to have a copy of Photoshop.

Already it’s starting to get more tricky. PSD is a proprietary file format developed and owned by Abobe and used in Photoshop. You can actually open and edit PSD files in open source tools like GIMP (I’m not sure how legally Gimp can do this – waiting for a response from OSSWatch Update: I've had a response. Upshot 'it can be awkward on all levels'. I'll point to a related blog post when it's published. Post by Scott Wilson at OSS Watch on using proprietary file formats in open source projects). Similarly you can use open source alternatives to Microsoft Office like LibreOffice to open and edit DOC/XLS/PPT etc but in this case Microsoft's proprietary file formats under their Open Specification Promise, which if you read this page on Wikipedia itself has a number of issues and limitations.

The next issue is, as highlighted by Chris Rusbridge in his Open letter to Microsoft on specs for obsolete file formats, the OSP doesn’t cover older file formats. So if you were an earlier adopter publishing OER in editable formats there is a danger that the format you used won’t be suitable down the line.

I’m mindful of the Digital Repository infoKit’s last point of guidance

Be practical: Being overly-strict about file formats may mean collecting no files leading to an empty repository! A sensible approach must be used that weighs up the cost and benefits of different file formats and the effort required to convert between them.

Should OER file formats be tomorrow’s problem?


In CFHE12 Week 2 Analysis: Data! Show me your data and I’ll show you mine I highlighted some of the issues with collecting RSS feeds for participant blogs. The main issues are:

  • participant knowing their blog’s RSS feed
  • providing a feed filtered for a specific tag
  • data entry (missing http://, whitespace etc)
  • automatic registration of a feed with another system

This last one is very dependant on the system you are using for aggregating participant contributions. gRSShopper (developed by Stephen Downes) is an integrated solution whilst, as far as I’m aware, the FeedWordPress plugin used in ds106 and others requires some manual data entry, but bulk import is possible.

Before outlining my vision of a cMOOC registration system there is a basic decision about what you want to aggregate feeds on. Given the issue with getting Tag Feeds for a variety of Blogging Platforms I’m swaying towards asking participants to use a course identifier in each post title rather than as a tag/category/label. This make feed detection easier and whilst not familiar with the backend of gRSShopper think it would be a trivial bit of extra code and I’m already aware of extra plugins for FeedWordPress to filter posts. I will however provide outlines for both:

Registration flow with course posts by tag/category/label

Part of this is a modification of the existing registration process used in ds106.

  1. Optional: Ask user to generate a post in their blog with course tag (you could provide some set text advertising course)
  2. Enter details:
    1. name, social media accounts etc
    2. blog homepage
    3. blogging platform
  3. From blog url/platform display guessed (auto-detected) feed (if you’re using optional step this can be validated with auto-detection). If not a recognised blogging platform or tag/category/label feed not available instruct participant to include course tag in all post titles.
  4. Submit details

Registration flow with course posts by title

  1. Optional: Ask user to generate a post in their blog with course tag (you could provide some set text advertising course)
  2. Enter details:
    1. name, social media accounts etc
    2. blog homepage
  3. From blog url display guessed (auto-detected) feed (if you’re using optional step this can be validated with auto-detection).
  4. Submit details

Another aspect not mentioned here is letting the user edit their feed.

That’s my suggestion anyway. Your thoughts very welcome! BTW Yishay Mor at the OU has started thinking about the wider functionality of a cMOOC aggregation system.

1 Comment

Focusing on some of the data behind Current/Future State of Higher Education (CFHE12) has been very stimulating and has got me thinking about the generation and flow of data on many levels. Having recently produced some data visualisations of ds106 for OpenEd12 it was reassuring that one of the first questions was “is the #ds106 data openly licensed?”. Reassuring because it is good to see that people are thinking beyond open content and open learning experiences and thinking about open data as well. So what data is available around CFHE12? In this post I look at data feeds available from CFHE12, see what we can get and suggest some alternative ways of getting the data and pulling it in to other services for analysis. Finally I highlight the issues with collecting participant feeds filtered by tag/category/label.

edfuture homepage sidebar

Working down some of the menu options on the CFHE12 home page lets see what we’ve got and how easy it is to digest.

Newsletter Archives

This page contains a link to each ‘Daily Newsletter’ sent out by the gRSShopper (Stephen Downes’ RSS/content collection, remix and distribution system). I’m not familiar with how/if the Daily data can be exported by an officially API, but with tools like Google Spreadsheet, Yahoo Pipes and others it's possible to extract a link to each edition of the Daily using some basic screen scraping techniques like using XPath. So in this sheet in cell A2 I can get a list of archive pages using =ImportXML("","//a/@href"). Using the ImportXML using the resulting list of pages it’s possible to get a comma separated list of all the posts in each Daily (column B).

The formula in column B includes a QUERY statement and is perhaps worthy of a blog post in it's own right. Here it is: =IF(ISBLANK(A2),"",JOIN(",",QUERY(ImportXML(A2,"//a[.='Link']/@href"),"Select * WHERE not(Col1 contains(''))"))). In brief the pseudocode is: if the cell is blank return nothing otherwise comma join the array of results from importXML for links which use he text ‘Link’ where it doesn’t contain a link to Note: there is a limitation of 50 importXML formula per spreadsheet so I’ll either have to flatten the data or switch to Yahoo Pipes

The resulting data is of limited use but it’s useful to see how many posts have been published via gRSShopper and by who:

CFHE12 posts per day CFHE12 posts per domain

It’s at this point tat we start to get an sign that the data isn’t entirely clean. For example, in cell A153 I’ve queried the Daily Newsletter posts from this blog and I get four results shown below:

You can see there is a double post and actually, at time of writing I've only made two posts tagged with cfhe12. Moving on, but coming back to this point later, lets now look at the feed options.


The course has feed options for: Announcements RSS; Blog Posts RSS; and OPML List of Feeds. I didn’t have much luck with any of these. The RSS feeds have data but aren’t valid RSS and the OPML (a file format which can be used for bundling lots of blog feeds together) only had 16 items and was also not valid (I should really have a peak at the source for gRSShopper and make suggestions for fixes, but time and lack of Perl knowledge has prevented that so far). I did attempt some custom Apps Script to capture the Blog Posts RSS to this sheet, but I’m not convinced it’s working properly, in part because the source feed is not valid. There are other feeds not listed on the home page I might dig into like the Diigo CFHE12 Group which I’m collecting data from in a Google Spreadsheet using this IFTTT recipe.

Generating an OPML and Blog Post RSS from ‘View List of Blogs’ data


All is not lost. gRSShopper also generates a page of blogs it’s aggregating. With a bit more XPath magic (=ImportXML("","//a[.='XML']/@href")) I can scrape the XML links for each registered blog into this sheet. Using the Spreadsheet -> OPML Generator I get this OPML file of CFHE12 blogs (because the spreadsheet and OPML generator sit in the cloud this link will automatically update as blogs are added or removed from the Participant Feeds page). For more details on this recipe see Generating an OPML RSS bundle from a page of links using Google Spreadsheets.

Blog posts RSS

Earlier I highlighted a possible issue with posts being included in the Daily Newsletter. This is because it can be very tricky to get an RSS feed for a particular tag/category/label from someone's blog. You only need to look at Jim Groom’s post on Tag Feeds for a variety of Blogging Platforms to get an idea of the variation between platforms. It’s worth underlying the issue here, each blogging platform has a different way of getting a filtered RSS feed for a specific tag/category/label. Also, in certain cases it’s not possible to get a filtered RSS feed. When a student registers a feed for an online course it can be difficult for them to identify their own blogs RSS feed, let alone a filtered feed.

CFHE12 PipeworkAs an experiment reusing the Participant Feeds page as a data source I’ve come up with this Yahoo Pipe which fetches all the feeds and tries to filter the results. It’s slightly crude in the way it’s filtering posts by looking for a course tag: as a category/tag (if exposed in the source feed), or in the post title or in the post content. Using this pipe it’s currently returning 48 items (although I it says 36 when outputting in a different file format) compared to the 77 from the Daily Newsletter Archives. The nice thing about pipes is I can get the data in different formats (e.g. RSS, JSON, CSV) so more mashing up is possible.

Before you go off thinking Yahoo Pipes is the answer for all your open course RSS aggregation there are some big questions over reliability and how this solution would scale. It’s also interesting to note all the error messages because of bad source feeds:

This Pipe ran successfully but encountered some problems:

warning Error fetching Response: Not Found (404)

warning Error fetching Response: OK (200). Error: Invalid XML document. Root cause: org.xml.sax.SAXParseException; lineNumber: 1483; columnNumber: 5; The element type "meta" must be terminated by the matching end-tag " ".

warning Error fetching Response: Not Found (404)

warning Error fetching Response: Bad Request (400)

warning Error fetching Response: Not Found (404)

warning Error fetching Response: OK (200). Error: Invalid XML document. Root cause: org.xml.sax.SAXParseException; lineNumber: 5; columnNumber: 37; The entity "rsaquo" was referenced, but not declared.

warning Error fetching Results may be truncated because the run exceeded the allowed max response timeout of 30000ms.

Rather than trying to work with messy data one strategy would be to start with a better data source. I have a couple of thoughts on this I’ve shared in Sketch of a cMOOC registration system.


So what have I shown? Data is messy. Data use often leads to data validation. Making any data available means someone else might be able to do something useful with it. In the context cMOOCs getting a filtered feed of content isn’t easy.

Something I haven’t touched upon is how the data is licensed. There are lots of issues with embedding license information in data files. For example, I’m sure technically the OPML file I generated should be licensed CC-BY-NC-SA CFHE12 because this is the license of the source data? I’m going to skip over this point but welcome your comments (you might also want to check the Open Data Licensing Animation from the OERIPR Support Project).

[I’ve also quietly ignored getting data from the course discussion forums and Twitter archive (the later is here)]

PS Looking forward to next weeks CFHE12 topic Big data and Analytics ;)

The JISC OER Rapid Innovation programme is coming to a close and as the 15 projects do their final tiding up it’s time to start thinking about the programme as a whole, emerging trends, lesson learned and help projects disseminate their outputs. One of the discussions we’ve started with Amber Thomas, the programme manager, is how best to go about this. Part of our support role at CETIS has been to record some of the technical and standards decisions taken by the projects. Phil Barker and I ended up having technical conversations with 13 of the 15 projects which are recorded in the CETIS PROD Database in the Rapid Innovation strand. One idea was see if there were any technology or standards themes we could use to illustrate what has been done in these areas. Here are a couple of ways to look at this data.

To start with PROD has some experimental tools to visualise the data. By selecting the Rapid Innovation strand as selecting ‘Stuff’ we get this tag cloud. We can see JSON, HTML5 and RSS are all very prominent. Unfortunately some of the context is lost as we don’t know without digging deeper which projects used JSON etc. 

PROD Wordcloud

To get more context I thought it would be useful to put the data in a network graph (click to enlarge).

NodeXL Graph

NodeXL Graph - top selectedWe can now see which projects (blue dots) used which tech/standards (brown dots) and again JSON, HTML5 and RSS are prominent. Selecting these (image right) we can see it covers most of the projects (no. 10), so these might be some technology themes we could talk about. But what about the remain projects?

As a happy accident I put the list of technologies/standards into Voyant Tools (similar to Wordle but way more powerful – I need to write a post about it) and got the graph below:

Voyant Tools Wordcloud

Because the wordcloud is generated from words rather than phrases the frequency is different: : api (16), rss (6), youtube (6), html5 (5), json (5). So maybe there is also a story about APIs and YouTube.

2 Comments data model data model
Originally uploaded by psd
I thought it would be useful to give a summary of some of the tools I use/developed at CETIS to monitor the pulse of the web around our and JISC work. All of these are available for reuse and documented to varying degrees. All of the tools also use Google Spreadsheets/Apps Script which is free for anyone to use with a Google account, and all the recipes use free tools (the exception being owning a copy of Excel, but most institutions have this as standard).


Hashtag archiving, analysis and interfacing

Hashtag archiving, analysis and interfacingUsed with: CETIS COMMS, OERRI Programme, #UKOER, …
What it does: It’s a Google Spreadsheet template which can be setup to automatically archive Twitter searches. The template includes some summaries to show top contributors and frequency or tweets. There are a number of add-on interfaces that can be used to navigate the data in different ways, including TAGSExplorer and TAGSArc.

More info:

Monitoring Twitter searches and combining with Google Analytics

Monitoring Twitter searches and combining with Google AnalyticsUsed with: CETIS COMMS
What it does: Archives all tweets linking to to the domain and combines with our Google Analytics data to monitor influential distributors of our work.

More info:

RSS Feed Activity Data Monitoring

RSS Feed Activity Data MonitoringUsed with: CETIS COMMS, OERRI Programme
What it does: Gives a dashboard view of the total social shares from a range of services (Facebook, Twitter, Google+ for a single or combination of RSS feeds. At CETIS we also monitor the social popularity of blogs referencing by using a RSS feed from Google’s Blog Search e.g.

More info:

Post Activity

Blog Activity Data Feed Template OverviewUsed with: CETIS COMMS
What it does: Gives more detailed activity data around socially shared urls combining individual tweets from Topsy, Delicious, and post comments.

More info:

Post Directory

Post DirectoryUsed with: OERRI Programme
What it does: Dashboards all the project blogs from the OERRI Programme and monitors when they release blog posts with predefined tags/categories. The dashboard also combines the social monitoring techniques mentioned above so that projects and the programme support team can monitor social shares for individual blog posts.

More info:

Automatic final report generation

OERRI DashboardUsed with: OERRI Programme
What is does: As an extension to the Post Directory this tool combines project blog posts from a predefined set of tags/categories into a final report as an editable MS Word/HTML document. Currently only the original post content, including images, is compiled in individual reports but it would be easy to also incorporate some of the social tracking and/or post comments data.

More info:


As well as standalone tools I’ve documented a number of recipes to analysis monitoring data.

Twitter conversation graph

Twitter conversation graphUsed with: #moocmooc, #cfhe12
What it does: Using data from the Twitter Archiving Google Spreadsheet template (TAGS) this recipe shows you how you can use a free Excel add-on, NodeXL, to graph threaded conversations. I’m still developing this technique but my belief is there are opportunities to give a better overview of conversations within hashtag communities, identifying key moments.

More info:

Community blogosphere graph

Community blogosphere graphUsed with: #ds106
What it does: Outlines how data from blog posts (in this case a corpus collected by the FeedWordPress plugin used in DS106) can be refined and graphed to show blog post interlinking within a community. An idea explored in this recipe is using measures used in social network analysis to highlight key posts.

More info:

Activity data visualisation (gource)

Activity data visualisation (gource)Used with: #ukoer
What it does: Documents how data can be extracted (in this case records from Jorum) and cleaned using Google Refine (soon to be renamed OpenRefine). This data is then exported as a custom log file which can be played in an open source visualisation tool called Gource. The purpose of this technique is to give the viewer a sense of the volume and size of data submitted or created by users within a community.

More info:

So now go forth and reuse!


In a recent post I showed that how given some urls it was easy to generated a templated report as a word document. This was partly done to show how blog posts from a JISC funded programme could be used to generate reports. One of the issues with this solution is not all the projects used WordPress as their blogging platform making it harder to get at some of the data. One thought I had was rather than dictating a particular platform JISC could aggregate all programme posts in a central database. This isn’t entirely new as at CETIS we already aggregate links to posts for selected programmes via our PROD database, the difference here would be as well as a link the post content would also be ingested making it easy to query and analysis the data from a single source. As I recently wrote about in Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine this technique is common in cMOOCs, and in particular highlighted how Digital Storytelling (DS106) do this already using the FeedWordPress plugin.

As part of a double hit I thought I’d see if there was a case for JISC implementing this type of infrastructure, and because I’ve got an interest in applying learning anlaytics to open online courses, have a look at the data stored by ds106 and see what analysis I could do. In the next series of posts I share my notes and sketches around extracting FeedWordPress data, supplement it with 3rd party sources and analysing dialog/activity around posts. In this first post I’m going to outline how all the data was imported into a local WordPress database, queried to pull out a data slice, refined and mined, before visualisation/analysis. What we end up with is a view of how ds106 blog posts are interlinked by hyperlinks within the post body. For the next in the series I record some notes on comment extraction and finally how to get more activity data from social network sites.

The data source

If JISC were hosting the data then in theory it would be  a lot easier to get to. As my test data comes from ds106 some extra preparatory work was required. Fortunately #ds106er Alan Levine (@cogdog) was able to provide me with a MySQL dump of data from their WordPress install. The dump contained over 700,000 lines of SQL commands and was over 500Mb. Problem one was then how to get this back into a local MySQL database to make slice, dice and export easier.

For a local Apache/MySQL/PHP server I used  XAMPP Lite. Importing the data wasn’t straight forward as despite tweaking the config kept having ‘allocated memory’ errors. The solution was ‘BigDump: Staggered MySQL Dump Importer’, which was able to automatically split and run the import. Even with this I got a couple of ‘MySQL has gone away’ errors so used Vim (a text editor that can handle very large files) to drop 5 blog post inserts (given ds106 had over 20,000 posts I can live with the lose).

The selected data export

Wordpress Database DiagramThose familiar with the backend of WordPress will know it has a relational structure with a number of tables holding different parts of the data. Because I knew I wanted to explore ‘ds106 dialog’ and that to do this the data would have to be compiled with other sources I needed to export a slice of the data. Whilst I was does this it also made sense to make some of the data more meaningful. For example, the wp_posts table which has most of the data I needed uses an author id number where as it would be better to get this as a display name by joining it with the wp_users data. Fortunately because MySQL is … well a ‘query language’ this is relatively easy to do by using the command below. To talk you through it I’m selecting a couple of columns from the wp_posts table and joining it with some other data on the condition that it’s a published post. As well as author display name you’ll see that I’m also returning a wfw:commentRSS. This is the comment RSS feed caught by the FeedWordPress plugin (I think it comes from here). I got wind of this metadata after reading Jim Groom’s Displaying Distributed Comments on the Hardboiled Blog. I’ll show how this is used later.

       wp_postmeta.meta_value AS comment_feed, 
FROM   wp_posts 
       LEFT JOIN wp_postmeta 
              ON = wp_postmeta.post_id 
                 AND wp_postmeta.meta_key = 'wfw:commentRSS' 
       LEFT JOIN wp_users 
              ON wp_posts.post_author = 
WHERE  wp_posts.post_type = 'post' 
       AND wp_posts.post_status = 'publish'

Running this query in phpMyAdmin gives us an option to exporting as a csv giving almost 20,000 blog posts (one per row) to play with which is plenty.

The refinement

apply-operationsUsing Google Refine we can import the csv file choosing comma separated and headers in first row. Next we want to process the post content to extract outbound links. Below a the rough steps I used which you can download and apply to your own project (the steps in the file vary slightly as I noticed that the guid links had ?p={number} instead of the post permalink. I ended getting these when fetch the comment feed and extracting the link. More detail about this in the next post. Note to self: issue with non-permalinks).

  1. Create column post_as_date at index 3 based on column post_date using expression grel:value.toDate("Y-m-d H:m:s")
  2. Create column outbound_links at index 5 based on column post_content using expression grel:forEach(value.parseHtml().select("a[href~=]"),e,e.htmlAttr("href")).join("|||")
  3. Split multi-valued cells in column outbound_links
  4. Create column source_links at index 9 based on column guid using expression grel:row.record.cells["guid"].value[0]

The key step is 3. which mines all the posts for <a href> tags and extracts them into a new column joining them together with ‘|||’


Using Refine we can split the outbound_links column to put one link per row (column dropdown, Edit cells > Split multi-valued cells). The last step is to make sure the post source link is included in each row using fill down the right and secure way. Filtering out blank outbound_link rows and Export using Custom tabular exporter (setting I used here) I get some data that looks like this:

The data meets Gephi and NodeXL

With this data we can open in Excel and then import into NodeXL using Import > From Open Workbook, more about that later, or alternatively using Gephi change the column headings in the .csv version to source and target and import via the ‘data laboratory’ tab (for graph manipulation in Gephi Tony Hirst has a great tutorial).

Sticking with Gephi for now you can generate the images below. On the left is the entire graph which depicts the ~59,000 links contained in ~20,000 ds106 blog posts. The image on the right is zoomed in to the inset where we can start seeing each webpage referenced as a dot connected by a line which is a link in a post. Whilst this is pretty, it’s also pretty useless.

All links from ds106 links Close-up of post links

So where to go with this? In the paper A Theoretical Proposal of Learning Communities Management Through the Analysis of the Organizational Blogosphere the authors propose using the measure of ‘betweenness centrality’ (BC) as a way of identifying influential learners within a blogging collective:

The betweeness centrality has been considered in literature (Marsden, 2002) as a way to find the most valuable nodes within a social network. The strategic function of these characters is fundamental for the system of learners, because they have the widest view of what happens in the net. This is due to the fact that they intercept the majority of the knowledge and information flows within the learning community. The people with the highest betweeness centrality can share their vision on the main knowledge claims of the community in an open debate.

Extrapolating this slightly lets consider individual blog posts rather than individual learners, using BC as a way to highlight posts that may bridge concepts, or support the community of learners in some way. Rather than examine all links in the blog posts lets first consider internal linking within the ds106 community.

To do this lets turn back to NodeXL which I find a lot easier to get data in and manipulate. Opening our export file from Google Refine and then using Import > From Open Workbook we can import all the data source and target being edge properties, the rest vertex 1 properties. This gives you over 75,000 connections between over 59,000 links, back at square one. A nice feature of NodeXL is to tell it to ‘skip’ some data (edge, vertex or group). The data is kept in the spreadsheet so you can include it later but it’s not used for any calculations. I only want to analysis ds106 posts that link to other ds106 posts. Here’s brief notes on the way I did it (other ways are possible and might even be better).

Autofill options

On the ‘Edges’ sheet in the ‘Other columns’ section I created 4 columns named: Community Link, Self Link, Combined and

Internal Link, then added the following formula in each column:

  • =IF(ISERROR(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30 ,FALSE)),0,IF(LEN(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30, FALSE))>12,1 ,0)) – because we only have dates, titles, names for posts made by the ds106 community, if this detail is blank on the vertices sheet then the edge is outwith the community (if it exists enter 1, otherwise 0) Tip: after doing this I copied all the Community Link values and Paste Special, Values to save recalculation on edit/open.
  • =IF([@[Vertex 1]]=[@[Vertex 2]],0,1) – test if edge is self-link
  • =[@[Community Link]]*[@[Self Link]] – multiplying these together as a simple operator
  • =IF(ISERROR(OR(FIND("/",[@[Vertex 1]],12), FIND("/",[@[Vertex 2]],12))), 0, IF(LEFT([@[Vertex 1]], FIND("/",[@[Vertex 1]], 12))=LEFT([@[Vertex 2]], FIND("/",[@[Vertex 2]],12)), 1,0)) – used to test if edge is two posts by the same author


Using the Autofill we can chose to skip edges based on the ‘Combined’ value (and while we are at it set the edge style based on the ‘Self Link’ value). Once this is done you can use NodeXL to calculate Groups and Metrics knowing that it’ll skip the edges it doesn’t need.

Once we’ve done all of this and applied some chart options this is the graph we get.

ds106 interlinked blog posts

Taking a closer look at one of the groupings (G1) shown below we can see how ds106 blog posts (the dots) link to each other, a dotted line indicating it’s the same post author referring to their own work.

Group G1

The first impression

We now have a dataset of ds106 blog posts and the webpages that they individually link to. There are some interesting patterns and more work to be done to interpret these. Whilst this technique has surfaced interlinks within the community there is no context in which they are made. Ranking the posts by betweenness centrality we get the following top 10:


It’s interesting that most of these come from tutors (although given the amount of co-creation the role of tutor/student is very blurred), which in some ways isn’t that surprising given they will be summarising and highlighting key work.

The Limitations

Hopefully by this point you are screaming at the screen because of the limitations of this analysis. The focus has been on blog post content, which I hope I’ve shown can be used to identify interesting moments. The big limitation here is it’s looking at a narrow slice of activity - how students are linking to blog posts without any real context. Another staple of blogs is comments. Comments provide an easy way for a reader to feedback to the author continuing the dialogue sharing thoughts, ideas and  reflections. It’s also all well and good me doing this but the real interesting thing would be to put this graphical overview in the hands of tutors and students to see if it helps them. I imagine it wouldn’t be hard to wrap the MySQL query initially used in some PHP and visualisation library like d3.js and provide users with some situational awareness of what is going on in their community. If you would like to explore the data the NodeXL file is here for download.

As part of this post was also about the feasibility of using FeedWordPress to aggregate JISC project blogs here are some of my notes:

  • Exporting post content was possible with MySQL access (interfaces would be easy to design)
  • Issue encountered with how post links stored (blogspot/blogger feeds use a internal guid=, where as other guids are links to the original posts). This was solved by fetching the comment feed and parsing the page link from the response
  • Need to be aware that FeedWordPress can't get content for partial rss feeds but plugins are available
  • Registering a feed is a manual process
  • Issue with mixed-mode feeds (if you are automatically pulling in all content if the feed source has a mixed use e.g. blog is mixture of personal and work, then data isn't as clean)
  • Getting the data into Google Refine made it easy to mine post content e.g. in one expression I'd calculated that ds106 has published over 4 million words

There are other technicalities of FeedWordPress I'd want to explore like how post edits are handled. Overall it was a lot easier having a single data source.

Your Thoughts

At this point it would be very useful to have your thoughts on this technique. And remember I’m a ‘maker of things’ so your suggestions may get incorporated into a functional tool ;)

In my next post in this series I’ll share my notes on building and analysing comments data.

1 Comment

As I mentioned in Filtering a Twitter hashtag community for questions and responses I’ve been asked to do some analysis of the Current/Future State of Higher Education (CFHE12) course. Week 1 has mainly been about creating a toolchain that makes it easier to hit a button and get some insight. The focus has mainly been on tweets with the #cfhe12 hashtag. I’m still scratching my head as to what this all means but there are already discussions to extend the scope trying to establish more context by also looking at blog and discussion forum posts. The danger I also have as a ‘maker of things’ as questions emerge I want to make things to help find the answers.

To easy into this lets start with an overview here are some key stats for 7-13th October 2012 (BST) (and already I resisting the temptation to create an overview template):

  • 762 Tweets
  • 305 Links
  • 172 RTs
  • 244 Unique twitter accounts
  • 14% (n.104) of tweets were in @reply to another person using #cfhe12

This sheet contains more details including a summary of who tweeted the most and got the most @mentions and the ‘Dashboard’ sheet which let me know that this was the most retweeted tweet:

Below are two graphs summarising the Twitter activity for week 1 of #cfhe12 (LHS) and another course earlier in the year #moocmooc (you can click on both of these for interactive versions).

summary of #cfhe12 tweets for week 1
#cfhe12 week 1 tweets

Summary of tweets from #moocmooc
#moocmooc tweets

It’s notable that the volume and proportion of tweets and @replies is higher in #moocmooc. Part of this could be down to the fact that #moocmooc was a condensed course that was one week long. Other factors may include the chosen infrastructure and how this was promoted, size of course and who was participating.

Extracting a conversation graph, which is shown below, there isn’t a great deal of @replies for week 1. In the graph each dot represents a single tweet and dots are joined if the person is @replying that tweet. I probably need to find a way for you to interact with this graph, but for now I’ve prepared these pages with conversations for groups G1-G4:

cfhe12 week 1 conversation graph
[The above graph data can be downloaded from the NodeXL Graph Gallery]

Exploring G3 and G4 some of the limitations of this technique become apparent. For example clicking on the date in the first tweet in G4 reveals the full text from Twitter, which includes text from G3 i.e. they are the same conversation and should be grouped together.

So more work to do, more things to think about, more tools needed to make sense of this easier. In the meantime any of your observations are greatly welcome.