OpenRefine

1 Comment

I started writing this last week so the intro doesn’t make sense. Slides from the presentation I did are here (all pictures so probably also makes very little).

To paraphrase Stephen Heppell (I’ve misquoted this before):

Content is king, but community is sovereign

The UKOER Programme is having it’s final meeting next week and while the final set of projects come to a close a strong community has formed and I’m sure will continue. Something I was interested in doing is looking at how the community has evolved over time. I’ve previously looked at data around the #ukoer hashtag, but not everyone uses Twitter so I thought I look for another data source.  As email is still a strong component in most peoples everyday lives I started poking around OER-DISCUSS JISCMail (Listserv) list:

A public list for discussion about the release, use, remix and discovery of Open Educational Resources (OER). Managed jointly by OU SCORE Project, and the JISC / HE Academy OER Programme.

As far as I could see there are limited options for getting data out of JISCMail (some limited RSS/Atom feeds) so cue the music for a good out fashioned scrape and refine. Whilst I’ll walk you through this for OER-DISCUSS the same recipe can be used for other public lists.

Source survey

Instead of going straight into the recipe I wanted to record some of the initial thought processes in tackling the problem. This usually begins with looking at what you’ve got to work with. Starting with the list homepage I can see some RSS/Atom feeds, that don’t take me far, instead I turn my attention to the list of links for each months archives. Clicking through to one of these and poking around the HTML source (I mainly use Chrome so a right click in the page gives the option to Inspect Element) I can see that the page uses a table template structure to render the results – good. Next I checked if the page would render even when I was logged out of JISCMail, which I can – good. So far so good.

page source

Next a step back. This looks scrapable so has anyone done this before. A look on Scraperwiki turns up nothing on Listserv or JISCMail, so next a general Google search. Looking for terms like ‘listserv data scrape’ are problematic because there are lots of listserv lists about data scraping in general. So we push on. We’ve got a page with links to each months archives and we know each archive uses a table to layout results. Next it’s time to start thinking about how we get data out of the tables. Back in the Chrome Element Inspector we can see that the source contains a lot of additional markup for each table cell and in places cells contain tables within them. At this point I’m think OpenRefine (nee Google Refine).

Scraping list of archive links

A feature of OpenRefine I use a lot is fetching data from a url. To do this we need a list of urls to hit. Back on the list homepage I start looking at how to get the links for each month’s archive. Hover over the links I can see they use a standard sequence with a 2-digit year {yy} and month {mm}

/cgi-bin/webadmin?A1=ind{yy}{mm}&L=OER-DISCUSS

I could easily generate these in a spreadsheet but I’m lazy so just point a Chrome extension I use called Scraper to find the part of the page I want and import to a Google Spreadsheet.  

Scraper

[another way of doing this is creating a Google Spreadsheet and in this case entering the formula =ImportXml("https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OER-DISCUSS","//tr[2]/td/ul/li/a/@href")

Fetching and Refining the Data

Import projectFinally we can fire up OpenRefine. You could create a project by using the Google Data option, which is used to import data from your Google Docs, instead as it’s not a huge amount of data I use the Clipboard option. At this point the preview will probably separate the data using ‘/’ and use the first row as a column heading so you’ll want to switch to comma or and de-select ‘Parse next’.

  1. Next we want to fetch each month’s archive page by using the Column 1 dropdown to Edit column > Add column by fetching url using the GREL expression "https://www.jiscmail.ac.uk"+value using the column name month_raw
  2. This pulls in each month’s archive page in raw html. Now we want to parse out each row of data in a new column by selecting the dropdown from month_raw and selecting Edit column > Add column based on this column  using the GREL expression forEach(value.parseHtml().select("table.tableframe")[1].select("tr"),v,v).join(";;;") with the column name rows_raw – this selects the second table with class ‘tableframe’ and joins each row with a ‘;;;’
  3. Next from the rows_raw column use Edit cells > Split multi-valued cells using ;;; as the separator
  4. Again from the rows_raw column dropdown select Edit column > Add column based on this column using the GREL expression forEach(value.parseHtml().select("td"),v,v).join(";;;") with the column name rows_parsed – this joins each <td> with a ;;; which will let us spilt the values into new columns in the next step
  5. Now from the rows_parsed column select Edit column > Split into several columns using the separator ;;;

Column split

You should now have something similar to above with columns and rows split out, but still messy with html in the cells. We can cleat these up using Edit cells > Transform using variations of value.parseHtml().htmlText()

Here are the steps I used (the complete operation history you can use in Undo/Redo is here – using this apply all the actions starting with the list of monthly urls)

  1. Text transform on cells in column rows_parsed 4 using expression grel:value.parseHtml().htmlText().replace(" lines","").toNumber()
  2. Rename column rows_parsed 4 to lines
  3. Text transform on cells in column rows_parsed 3 using expression grel:value.parseHtml().htmlText().toDate("EEE, dd MMM y H:m:s")
  4. Rename column rows_parsed 3 to date
  5. Text transform on cells in column rows_parsed 2 using expression grel:value.parseHtml().htmlText().replace(" <[log in to unmask]>","")
  6. Rename column rows_parsed 2 to from
  7. Create column snippet at index 4 based on column rows_parsed 1 using expression grel:value.split("showDesc('")[1].split("','")[0].unescape("html").parseHtml().htmlText()
  8. Create column link at index 4 based on column rows_parsed 1 using expression grel:"http://jiscmail.ac.uk"+value.parseHtml().select("a")[0].htmlAttr("href")
  9. Text transform on cells in column rows_parsed 1 using expression grel:value.parseHtml().htmlText()
  10. Rename column rows_parsed 1 to subject
  11. Create column subject_normal at index 4 based on column subject using expression grel:trim(value.replace(/^Re:|^Fwd:/i,""))

You’ll probably notice some of the rows don’t contain the data we need. An easy way to remove these is use a timeline facet on the date column selecting non-time, blank and error and then from the All column dropdown menu select Edit rows > Remove all matching rows.

Tony Hirst has a great post on faceting tricks. Something not covered is clustering data using facets. We use this as a way to join rows where authors have multiple logins eg Tony Hirst and Tony.Hirst

To do this add a text facet to the author/from column and click Cluster:

Facet view

Finding the right settings is a bit of trial and error combined with a bit of knowledge about your dataset. In this example I saw that there were rows for Pat Lockley and Patrick Lockley so tried some settings until I got a hit (in this case using nearest neighbour and PPM – which by all accounts is the last resort). You might also need to run clustering a couple of times to catch most of the variations

Clustering columns

What can we do with this data?

In Dashboarding activity on public JISCMail lists using Google Sheets (Spreadsheets) I was able to get an indication of the overall activity of the list. Now that author names are clustered I can get a more accurate picture of the top contributors using a text facet on the author column:

Pat Lockley 269, Amber Thomas 122, Phil Barker 82 

I was tempted to mine the individual posts further using the techniques and code posted by Adam Cooper (CETIS), but it didn’t look like I could easily swap the data source. A couple of questions posed by Phil Barker (CETIS) were:

The easiest way I found to get this was to use the Export > Custom tabular exporter (using this settings file), import into Google Sheets (Spreadsheets) and use a couple of formula to get this summary page (opening the summary page will let you see he formula I’ve used):

And there is much more you could do, but I’ll leave it there. If nothing else I hope you get an idea of some of the thought processes involved in extracting data. As always if something is unclear get in touch.