Archive for the 'Google Refine' Category

Integrating Google Spreadsheet/Apps Script with Google Refine to update existing spreadsheets

Working on some OER Visualisation Project work today I found I needed to get some additional information from an external web service for a Google Spreadsheet I was working on. I could have of course just used Google Apps Script, a native feature within Google Spreadsheets, to do a UrlFetch and get the information but my source data was a bit messy and timeouts on Apps Script prevent you from getting lots of data in one hit. Instead it made more sense to download the data into Google Refine, clean it up, call external web services to get additional information, then somehow get it back into the original spreadsheet.

As I recently posted Integrating Google Spreadsheet/Apps Script with R: Enabling social network analysis in TAGS it wasn’t much of a cognitive leap to try a similar trick of publishing the spreadsheet as a service to allow a basic API write. In the R example, however, I used the POST method to pass data and whilst I’m pretty sure with the right Jython libraries you could do something similar in Refine, I wanted to keep it simple so here’s my method for using GET instead.

  1. In a Google Spreadsheet open Tools >  Script editor.. and paste the following code (there is some tweaking required to where you want the data to go and the secret passphrase – in this example I want to pass back a value named geo 
  2. Once tweaked Run > Setup (this get a copy of the spreadsheet id need to run as a service)
  3. Open Share > Publish as service..  and check ‘Allow anyone to invoke’ with ‘anonymous access’, not forgetting to ‘enable service’. You’ll need a copy of the service URL for later on. Click ‘Save’
    Publish as service
  4. Back in Google Refine select a column with some data you want to pass and choose Edit column > Add column by fetching URLs…
  5. The expression you use will be something like the one below where we have the service url from step 3, the ‘secret’ to match the one set in step 1 and what ever data you want to include 
    "https://docs.google.com/macros/exec?service=AKfycbw1VL_kNYEwhqELgccMgEvM9gJ3t1zhBr6bhLs84g&secret=lemon&idx="+cells['idx'].value+"&geo="+escape(value,"url")
    Add column dialog

Note: You’ll see from the script my import included an idx number which matched a row number, but as long as I didn’t add any additional rows or change the order I could have used row.index in the expression instead.

You also might want to play around with the throttling. I had problems with my first go because this was either set too low or my values weren’t url encoded.

Instead if anyone finds this useful or if they show how to do in Jython. The amount of Google Spreadsheet work I do I’m sure I’ll use this trick again ;)  

OER Visualisation Project: How is OER being shared – Quick look at Jorum and #ukoer Twitter archive [day 24]

This might be slightly off-topic for the OER Visualisation project, but I followed an idea, did – what I think are – some interesting things with an archive of tweets and thought I would share. This line of thought was triggered by a tweet from Terry McAndrew in which he asked:

@mhawksey Have you a visualisation planned of JORUM ‘customers’ of OER (and the rest of it for that matter).

Tracking who views or uses OER material can be tricky but not impossible the main issue comes when someone else like me comes along and wants to see who else has viewed, used, remixed the resource. For example, with the Jorum ukoer resources the only usage data I could get was each resource page view and even getting this required scraping over 8,000 pages. This is mentioned in day 18, but I realise I haven’t gone into any detail about how Google Refine was used to get this data – if someone wants me to I will reveal all.

A recent development in this area is the US Learning Registry project which is looking to maximise the use of activity data to support resource discovery and reuse. On Lorna Campbell’s CETIS blog there is a very useful introduction to the Learning Registry announcing JISCs involvement in a UK node. The post includes the following use case which helps illustrate what the project is about:

“Let’s assume you found several animations on orbital mechanics. Can you tell which of these are right for your students (without having to preview each)? Is there any information about who else has used them and how effective they were? How can you provide your feedback about the resources you used, both to other teachers and to the organizations that published or curated them? Is there any way to aggregate this feedback to improve discoverability?

The Learning Registry is defining and building an infrastructure to help answer these questions. It provides a means for anyone to ‘publish’ information about learning resources. Beyond metadata and descriptions, this information includes usage data, feedback, rankings, likes, etc.; we call this ‘paradata’”

Lorna’s post was made in November 2011 and since then the Mimas have started The JLeRN Experiment (and if you are a developer you might want to attend the CETIS contributors event on the 23rd Jan).

Plan A – Inside-out: Repository resource sharing data

All of this is a bit late for me but I thought I’d see what ‘paradata’ I could build around ukoer and see if there were any interesting stories to be told as part of the visualisation project. This is an area I’ve visited before with a series of ‘And the most socially engaging is …’ posts which started with And the most engaging JISC Project is… For this I used Google Spreadsheet to get social share counts (Facebook, Twitter and more) for a list of urls. One of the issues with this technique is Google Spreadsheets timeout after 5 minutes so there is a limit to the number of links you can get through -this is however not a problem for Google Refine.

Taking 380 of the most viewed Jorum ukoer tagged resources (approximately 5% of the data) I used Google Refine to

  1. ‘Add column by fetching URL’ passing the resource link url into Yahel Carmon’s Shared Count API -  using the expression "http://api.sharedcount.com/?url="+escape(value,"url") 
  2. Add column based on the results column parsing each of the counts – e.g. using expressions similar to parseJson(value)['Twitter']

At this point I stopped parsing columns because it was clear that there was very little if any social sharing of Jorum ukoer resources (here is a spreadsheet of the data collected).  In fact the most tweeted resource which Twitter records as having 7 tweets gets most of these following my tweet as it being the most viewed resource.

So what might be going on here. Is just a issue for national repositories? Are people consuming ukoer resources from other repositories? Are Jorum resources not ranking well in search engines? Are resources not being marketed enough?

I don’t have answers to any of those questions, and maybe this is just an isolated case, but the ‘marketing’ aspect interests me. When I publish a blog post I’ll push it into a number of communication streams including RSS, a couple of tweets, the occasional Google+. For posts I think are really worthwhile I’ll setup a twitter search column on Tweetdeck with related keywords and proactively push posts to people I think might be interested (I picked up this idea from Tony Hirst’s Invisible Frictionless Tech Support. Are we doing something similar with our repositories?  

Plan B – Outside-in: Community resource sharing data

There’s probably a lot more to be said about repositories or agencies promoting resources. To try and find some more answers, instead of looking from the repository perspective of what is being shared, I thought it’d be useful to look at what people are sharing. There are a number of ways you could do this like selecting and monitoring a group of staff. I don’t have time for that, so decided to use data from an existing community  who are likely to be using or sharing resources aka the #ukoer Twitter hashtag community. [There are obvious issues with this approach, but I think it’s a useful starter for ten]

Having grabbed a copy of the #ukoer Twapper Keeper archive using Google Refine before it disappeared I’ve got over 8,000 tweets from 8th March 2010 to 3rd January 2012.  My plan was to extract all the links mentioned in these tweets, identify any patterns or particularly popular tweets.

Extracting links and expand shortened urls

As most tweets now get links replaced with t.co shortened urls and the general use of url shortening the first step was to extract and expand all the links mentioned in tweets. Link expansion was achieved using the longurl.org API, which has the added bonus of returning meta description and keywords for target pages. Here’s a summary of the Google Refine actions I did (taken from the undo/redo history):

  1. Create new column links based on column item – title by filling 8197 rows with grel:filter(split(value, " "),v,startsWith(v,"http")).join("||")
  2. Split multi-valued cells in column links
  3. Create column long_url at index 4 by fetching URLs based on column links using expression grel:"http://api.longurl.org/v2/expand?url="+escape(value,"url")+"&format=json&user-agent=Google%20Refine&title=1&meta-keywords=1"
  4. Create new column url based on column long_url by filling 5463 rows with grel:parseJson(value)['long-url']

Searchable table of #ukoer linksWith long urls extracted the data was exported and uploaded to Google Spreadsheet so that a list of unique urls and their frequencies could be calculated. Here is the Spreadsheet of data.  From the refined data there are 2,482 different links which appear 6,220 times in the #ukoer archive. Here is the searchable table of extracted links with frequencies (sorry for the small font – can’t seem to find a way to control column width using Google Visualisation API … anyone?).

Not surprisingly a number of domain root urls appear at the top of the list. More work needs to be done to match resource sharing to different OER sources, but you can start doing simple filtering to find out what’s there. Something for the initial analysis I find interesting is that the top common link in the archive is to Oxfords Open Advent Calendar, which was a daily posting highlighting some of their OER resources. This could be interpreted as underlying the need for OER resources to be more effectively marketed. I’ll let you decide.

PS out of interest I put the list of 2,500 odd links back into Google Refine and extracted social share counts. I haven’t had a chance to look at the data closely but if you want to play a copy of it and a meta-data enhanced version of the #ukoer archive is here. Please share any findings ;)

Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine

Last month I posted Free the tweets! Export TwapperKeeper archives using Google Spreadsheet, which was a response to the announcement that TwapperKeeper would be removing public access to archives on the 6th January. This solution was limited to archives smaller than 15,000 tweets (although minor tweaking could probably get more). Since then Tony Hirst has come up with a couple of other solutions:

One of the limits these solutions have is they only collect the data stored on TwapperKeeper missing lots of other juicy data like in_reply_to, location, retweet_count (here’s what a tweet used to look like, now there is more data). Whilst this data is probably of little interest to most people for people like me it opens the opportunity to do other interesting stuff. So here’s a way you can make a copy of a Twapper Keeper archive and rebuild the data using Google Refine.

  1. You’re going to need a copy of Google Refine and install/run it
  2. Visit the Twapper Keeper page of the archive you want. On the page copy the RSS Permalink into the URL box in Google Refine adding &l=50000 to the end e.g. the ukoer archive is http://twapperkeeper.com/rss.php?type=hashtag&name=ukoer&l=50000 and click Next.
  3. In the preview window that appears switch ‘Parse data as’ to XML files. Scroll the top pane down to hover over the ‘item’ tag and click
    Refined parse xml 
  4. You should now have a preview with the data split in columns. Enter a Project name and click ‘Create Project’
    Refined Columns
  5. From the ‘item – title’ column dropdown select Edit column > Add column based on this column…
  6. In the dialog that opens enter a New column name ‘id_str’ and the Expression smartSplit(value," ")[-1] (this splits the cell and returns the last group of text)
  7. From the new id_str column dropdown select Edit column > Add column by fetching URLs… and enter a name ‘meta’, change throttle delay to 500 and enter the expression "https://api.twitter.com/1/statuses/show.json?id="+value+"&include_entities=true" (that’s with quotes), then click OK.  [What we’ve done is extract a tweet id and then asked refine to fetch details about it from the Twitter API]
  8. Wait a bit (it took about 4 hours to get data from 8,000 tweets)

Once you have the raw data you can use Refine to make new columns using the expression parseJson(value) then navigate the object namespace. You might find it useful to paste a cell value into http://jsonviewer.stack.hu/ to see the structure. So to make a column which extracts the tweets full name you’d use parseJson(value).user.name

So don’t delay ‘free the tweets’

About

This blog is authored by Martin Hawksey+ JISC CETIS Learning Technology Advisor (OER Programme Support)
jisc cetis logo

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License. CC-BY mhawksey