Google Spreadsheet

1 Comment

It’s been a while since I’ve done a review post but as this year has been a blur if for no one else I wanted to look at my posts from 2013. Rather than a look at everything I wanted to extract some themes and for this first review I wanted to look back at some of the work I’ve done around Google Apps Script and Google Spreadsheets/Sheets. ...continue reading

2 Comments

Google recently (1st October 2013) announced improved segmentation in Google Analytics using age, gender and interests

It been interesting to read some of Tony Hirst’s posts on the use of Google Analytics within education. The thread goes back to 2008 with Library Analytics but most recently Tony has writing about this in an open course context such as MOOC Platforms and the A/B Testing of Course Materials and MOOC Busting: Personal Googalytics… which looks at the  idea of collecting and feeding back performance data to users from across platforms.

When Daphne Koller was on the early Coursera push one of the aspects that caught my eye was using student generated data (aka their answers to questions) to in course design, in particular, to identify misconceptions or incorrectly set questions. To see what I mean here’s a cued clip from a presentation Koller gave at the Centre for Distance Education back in 2012.

YouTube: The Online Revolution: Education at Scale

Merging the two lines of thought I wondered if there was a way you could use Google Analytics to create a similar feedback mechanism. My starting point was Google Analytics Event tracking. From the Event Tracking documentation:

Event Tracking is a method available in the ga.js tracking code that you can use to record user interaction with website elements, such as a Flash-driven menu system. This is accomplished by attaching the method call to the particular UI element you want to track. When used this way, all user activity on such elements is calculated and displayed as Events in the Analytics reporting interface. Additionally, pageview calculations are unaffected by user activity tracked using the Event Tracking method. Finally, Event Tracking employs an object-oriented model that you can use to collect and classify different types of interaction with your web page objects.

Examples include:

  • Any Flash-driven element, like a Flash website, or a Flash Movie player
  • Embedded AJAX page elements
  • Page gadgets
  • File downloads
  • Load times for data

Essentially anything you can trigger with a bit of JavaScript is up for grabs. Looking at setting up event tracking each event can include:

  • category (required) - The name you supply for the group of objects you want to track.
  • action (required) - A string that is uniquely paired with each category, and commonly used to define the type of user interaction for the web object.
  • label (optional) - An optional string to provide additional dimensions to the event data.
  • value (optional) - An integer that you can use to provide numerical data about the user event.
  • non-interaction (optional) - A boolean that when set to true, indicates that the event hit will not be used in bounce-rate calculation.

So we can capture events and have some control over how they are described. What might we want to catch? Lets start by looking at was multiple choice questions (MCQs). Looking at the anatomy of an event this is one way we might want to encode it:

  • category: ‘MCQ’
  • action: ‘right’ or ‘wrong’
  • label: a question identifier. This needs to be unique and might be something like coursecode_module_section_question (having a consistently structured label will help filter the data later)
  • value: this is optional but as it needs to be an integer this restricts you a bit. You may want to use time taken to respond, confidence based mark etc.

I should say before you get carried away with tracking that:

The first 10 event hits sent to Google Analytics are tracked immediately, thereafter tracking is rate limited to one event hit per second.

To see how this works I’ve created this example page with a simple MCQ. This is a ‘live’ example with some crude code to push events to my Google Analytics account. You’ll notice on the page a response graph generated from the GA data. I’ll explain how that was made later.

GA Real-time eventsThe fist thing to note is that we can now see responses in real-time via the Google Analytics admin interface. The interface is not really geared for MCQs and there is a complication of who has access to the Analytics dashboard, but given that there is a Real Time Reporting API in beta a custom slice’n’dice should be possible in the future (I’ve got beta access so this might be one I revisit if/when Events get added to the API).

Similarly the Content Events report gives us access to historic data but again it has accessibility issues in terms of who has access to the Google Analytics account. On the plus side tweaking the display from the default ‘data’ view  to ‘performance’ gives a basic bar chart which is more intuitive for this type of data.

Default data view for Content > Events
Default data view for Content > Events
Performance data view for Content > Events
Performance data view for Content > Events

Segmentation and cohort analysis

There are some other built-in Google Analytics features that may also support analysis of the data including filtering:

GA Filtering

or switching from a ‘line chart’ to a ‘motion chart’ (there are limits on what can be used for x-y values so some experimentation is required) and adding event reports to custom dashboards which may pull in other GA data.

GA Motion Chart

This is where is potentially get even more interesting as the new Google Analytics Advanced Segmentation* allows you to do cohort analysis. The built-in segments are perhaps not relevant for this scenario but the custom options have lots of potential. Google provide 6 segment templates for ‘Demographics’, ‘Technology’, ‘Behavior’, ‘Data of First Visit’, ‘Traffic Sources’ and ‘E-commerce’ but it is easy for you to add custom conditions and sequences for segmentation.

GA custom conditions and sequences for segmentation

*I’m not sure if Google are still following this out but noticed the new UI and segmentation options were only available in my Google Apps GA account, my standard @gmail account not having this option.

Examples of conditions/sequences you might want to explore include combining Tony’s suggestion of using Analytics A/B testing with event tracking e.g. identifying any correlation with content to performance or if someone visiting page x did they perform better in the test. It is also worth noting that:

Previously, advanced segments were based on visits. With the new segment, a new option is provided to create user segment. In a user segment, all visits of the users who fit the segment criterias will be selected (such as specific demographics or behaviors). It will be a useful technique when you need to perform user level analysis.

This is particularly useful as “Google Analytics customers are prohibited from sending personal information to Google.” [ref]. So while named individual level analysis isn’t possible you can get down to a user level.

Distributing data

On a practical level whilst these options potentially open some interesting avenues for exploration Google Analytics account administration is still not easy. Whilst this area has been recently improved the granularity of permissions is very course, an all or nothing approach. There is a growing list of tools/add-ins that integrate with Google Analytics which let you create custom workflows for data distribution. This is an area Google appear to be working on recently announcing the Google Analytics superProxy which is a  web application that runs on Google App Engine to allow the distribution of GA data.  This uses the Google Analytics Reporting APIs to define data queries and generate data files. Along similar lines (and announced before Google) I’ve published a similar solution that works in Google Drive (Using Google Spreadsheets as a Google Analytics Data Bridge). Below is an example query I using the the MCQ example at the beginning of the post. It's currently using a very specific filter to exctract the data for all the event labels beginning EMD101_Mod1_1.1_Q1_, but if using a standardise labeling you could include results for the entire module or course. I'm also not using an segment filters. As well as using standard segments you can also use custom segments

Google Analytics Query ExporterAs I outlined in my original post there is a number of ways that these slices of Google Analytics data can be shared or consumed into other tools. In the example above the data is written (and refreshed every hour) to the sheet below, Google Sheets providing a convenient environment for sharing and querying data with the relative familiarity of a spreadsheet interface.

At this point I’d imagine some of you are wondering why go through all of this bother when your VLE is able to do similar, if not better, levels of reporting. My eye is primarily on the open education context where the institutional  VLE is usually not and option. It also potentially provides a more holistic data source where you can experiment with content and resources across your little oasis (like ocTEL).

So what do you think? Will you be event tracking your MCQs?

1 Comment

From the postbag Marjolein Hoekstra (CleverClogs) writes:

Short description
Can you make a Google Script for me that compares two strings character by character? If differences are found, the script should point these out. If no differences are found at all, the script should put out the text "[ id. ]" .

Detailed description
I have two columns containing lists of horizontally identical, but sometimes almost identical text strings. This is on purpose. Each row has another couple of words that need to be compared.

I'd like to compare them on a character by character basis, and then point out in the second column at which positions it differs from the first, for example like this:

A2: ABCDE

B2: ABKDE

If you compare these two, you'll see that cell B2[3] has 'K' where A2[3] reads 'C'.

My envisioned formula would then populate cell C2 with: "[ – – K – – ]"

As far as I can tell, I'd need a Google Script that parses both strings character by character and output "–" when they are identical, and output the value of the character string from B2. It should be relative simple, with a FOR loop. Thing is, I've never written a Google Script, and it's a bit daunting for me to start on my own.

Note that LEN (A) is always identical to LEN (B)

Background info
In case you're interested in the actual use case: I want to use this formula to compare strings of Chinese characters, where the first column contains the traditional writing of these characters (typically requiring more strokes) and the second column containing the simplified writing of those same characters. Sometimes the characters are different, sometimes they are not. You can see this clearly in the screenshot below.

The Google Spreadsheet is used as input for a flashcard deck I'm building, using the iPhone app Flashcard Deluxe (top-notch system, highly flexible) [also available for Android].

Screenshot:

Google Spreadsheet Example Flashcard Deluxe

There's no need to use Chinese characters to test the formula, I'm just providing this so that you know in what context the formula will be used.

The Solution

My initial thought was to use existing formula to SPLIT the cell text into individual character values and then do a comparison but unfortunately the SPLIT formula requires a character to split on. So instead I turned to Google Apps Script and wrote the following custom formula:

function stringComparison(s1, s2) {
  // lets test both variables are the same object type if not throw an error
  if (Object.prototype.toString.call(s1) !== Object.prototype.toString.call(s2)){
    throw("Both values need to be an array of cells or individual cells")
  }
  // if we are looking at two arrays of cells make sure the sizes match and only one column wide
  if( Object.prototype.toString.call(s1) === '[object Array]' ) {
    if (s1.length != s2.length || s1[0].length > 1 || s2[0].length > 1){
      throw("Arrays of cells need to be same size and 1 column wide");
    }
    // since we are working with an array intialise the return
    var out = [];
    for (r in s1){ // loop over the rows and find differences using diff sub function
      out.push([diff(s1[r][0], s2[r][0])]);
    }
    return out; // return response
  } else { // we are working with two cells so return diff
    return diff(s1, s2)
  }
}

function diff (s1, s2){
  var out = "[ ";
  var notid = false;
  // loop to match each character
  for (var n = 0; n < s1.length; n++){
    if (s1.charAt(n) == s2.charAt(n)){
      out += "–";
    } else {
      out += s2.charAt(n);
      notid = true;
    }
out += " ";
  }
  out += " ]"
  return (notid) ? out :  "[ id. ]"; // if notid(entical) return output or [id.]
}

One of the things to be aware of is Google Apps Script formulas are associated with a spreadsheet. You can't globally use a custom formula unless the script is attached. Fortunately when copying a spreadsheet you also get a copy of the script, so providing templates is a way around this.

With this limitation in mind I thought I’d have another go  at cracking this with built-in formula … and guess what it is possible. The key to unlocking this was when playing with the REGEXREPLACE formula I accidentally turned ‘ABCDE’ into ‘,A,B,C,D,E,’ by using =REGEXREPLACE(A20,"(.*?)",","). My RegEx is terrible so I’ll let someone else explain how this works in the comments, but getting to this point meant I could use a combination of SPLIT and REGEXREPLACE to do a character by character comparison on two cells of text. The final version of the formula goes (comparing cell A14 to B14):

=IF(EXACT(A14,B14),"[ id. ]","[ "&JOIN(" ",ARRAYFORMULA(REGEXREPLACE(SPLIT(REGEXREPLACE(B14,"(.*?)",","),","),SPLIT(REGEXREPLACE(A14,"(.*?)",","),","),"–")))&" ]")

My rough workings are embeded below. You can also make a copy of the entire project including the Apps Script solution here.

Update: Bruce Mcpherson has posted an alternative formula to do this which goes like:

"[ " & CONCATenate(ARRAYFORMULA(if(mid(A31, row(indirect("x1:x"&len(A31))) ,1)=mid(B31,row(indirect("x1:x"&len(A31))),1)," – "," "&mid(B31,row(indirect("x1:x"&len(A31)))&" ",1) ))) &" ]"

As you will see from the comments thread on that post Marjolein was having problems using my version with a Chinese characterset. Adding this to the example spreadsheet I'm unable to replicate the error but have encountered the problem here. If anyone can spot the difference I'd welcome your thoughts?

Update 2: Bruce pointed out that "the likely issue is that the columns with the problem are times - the characters mean AM. The same thing would probably happen with numbers. Have you tried wrapping the cell references in concatenate() to convert to a string?"

I said: ah I see what you mean 时 is being interpreted as 上午12:00:00. Not sure how I'd wrap the concatenate with my regexreplace. Your solution looks better all round so rather than loosing sleep I'd go with that

5 Comments

This was the question that came in over the wire this morning:

My first thought was using the Google Visualisation API Query Language which can would let you do a SQL type query and  LIMIT the response to 1 record and OFFSET to a particular row. The Google Code Playground has a nice example of this to play with to see what you can do, even better Tony Hirst’s Guardian Data Explorer helps you generate the query to generate an html view (although Tony hasn’t implemented to LIMIT and OFFSET.  So below is an example spreadsheet:

Example Spreadsheet

… and using the query

https://spreadsheets.google.com/tq?tqx=out:html&tq=select * LIMIT 1 OFFSET 3&key=rYQm6lTXPH8dHA6XGhJVFsA&gid=0

We get … (the important bit is ‘OFFSET 3’ where offset 0 = row 2, offset 1 = row 3, etc.)

Google Visualization html row

So a couple of issues. First the spreadsheet needs to be File > Published to the web and the result is also read-only:

Thinking about the other view offered by Google on Spreadsheets it occurred to me the mobile view might be a solution. The mobile view if you don’t use a native app is List view (here is more about List View):

List View 

… which allows you to edit a row of data

Edit row in List View

the final trick was to change the rows per page to 1 and then use the page query to select the row e.g. to open the spreadsheet to edit row 5 we set &page=4 (the header isn't counted as a row). You can also select the sheet by changing the &ampgid= number. Tip: Open your spreadsheet and switching to 'List View' will help you build the URL.

https://docs.google.com/spreadsheet/lv?key={your_key}&type=view&gid=0&f=true&sortcolid=-1&sortasc=true&page=4&rowsperpage=1

Edit single row in List View

So there you go how to open a Google Spreadsheet at a specific row for editing for Marjolein to use. The perfect solution … well almost?

Update: Saqib Ali has kindly passed on this trick from one of the Google team (I don't think it would work in the original scenario, but still very useful to know

Insert/Comment at a specific cell, and, in that comment "plus someone" - that is, type "+" and then the person's email or name NOTE: you will get a list of people in your domain as soon as you type the "+", filtered as you start typing their name/email the person you "plussed" will get an email with both the contents of the cell and your comment the link in that email will take them DIRECTLY TO THE TARGET CELL with the comment activated.

Share this post on:
| | |
Posted in Google Spreadsheet, How-to on by .

6 Comments

The concepts used in this post are very similar to the Google Analytics superProxy (announced after my solution ;) the main difference is this working in Google Drive meaning data access authentication is built in.

Access control to Google Analytics data is very coarse. Via the web interface you can have full access or none at all. Given the growing interest in data driven decision making at all levels of the institution I thought it would be beneficial to demonstrate how Google Spreadsheets and Google Apps Script could be used to selectively distribute Google Analytics data. This solution was developed for my session at the Institutional Web Managers Workshop 2013 (IWMW13) which I talk about in more detail at the end of this post.

Google Analytics Query ExporterSo here is the general flow. Google Apps Script, which is part of Google Drive, is used by Google Analytics admins to create custom queries which pull back slices of data. The admins can then control how these slices are distributed either within the security of Google Drive or published to the web. Also because Google App Script feature ‘script triggers’, which can be time based, admins can preconfigure mini scripts to decide when and how the data is distributed. Examples include:

  • Add data slice to a Google Spreadsheet (Spreadsheets feature options for publishing to the web)
  • Turn data slice into a .csv file and attached to an email for the recipient
  • Turn data slice into a .csv file and copy/share via Google Drive
  • Turn data slice into a .csv file and publish on Google Drive
  • All the above with different file formats including json

Here is the template I’ve developed which you can copy and use followed by some setup instructions (also there’s a setup video):

*** Google Spreadsheet GA Query Exporter ***

  1. File > Make a copy of this spreadsheet (need to be logged
  2. Open Tools > Script editor and then:
    File > Upgrade authorisation experience;
    Once it's upgraded still in the Script editor click Resources > Use Google's APIs; and
    Click the 'Google API Console' link at the bottom of the dialog window;
  3. In the Google Apis Console switch the Analytics API 'on' and read/accept the terms (you may have to accept two seperate terms windows)
  4. Close the Google Apis Console window and OK the Google API Services window in the Script editor
  5. Before closing the Script editor Run > authoriseGA to authenticate script

Once authenticated there are a number of ways to run the script. You can enter a query and click 'Get data' for one off fetches or open Tools > Script editor and modify the example cron jobs to automate collection and distribution of data (the cron jobs allow functionality that includes emailing/sharing links to csv files)

To create multiple queries in the same spreadsheet duplicate this sheet and modify the query

Tip: You can use formula to build the start and end dates e.g. yesterday is =(TODAY())-1

To help admins construct these data slices the tool is designed to work in collaboration with the official Google Analytics Query Explorer 2. With this users can experiment with queries like this one that returns the top blog posts based on visits. Within the Query Explorer I can export the Query URI and import to a sheet in the template. The general workflow is demonstrated in the video below which shows how to setup an example query/export:

Using the example query from above I’ve created this template sheet. Similar to the example in the video in cells E9 and E10 I use a formula to dynamically create a date range. Assuming I want to run the query on the 1st of each month for last months stats we start with the end-date generated using =TODAY()-1. To get the start-date we use the value of this field to get the 1st of last month using =DATE(YEAR(E10),MONTH(E10),1)

I could open the spreadsheet every month and manually click the ‘Get data’ button but instead I can setup a script trigger to run as a Month timer of the 1st between 1am and 2am.

Setup script trigger/cron job

The code I run is a little scriptlet like:

function exampleCronJobWriteCSVtoParticularFolder(){
  // copy new data as csv and then email as attachment (copy also kept in Google Drive)
  var sheetO = getData("input"); // name of sheet with query to execute e.g. 'input'
  var csv = saveAsCSV(sheetO.getName(), sheetO.getName()+" "+Utilities.formatDate(new Date(), "GMT", "yyyyMMdd"));
  var folder = DriveApp.getFolderById("0B6GkLMU9sHmLbThITlNvb2dzREE"); // folder id from url (navigate to folder and extract for your own folders)
  folder.addFile(csv);
  MailApp.sendEmail([email protected], "New data", "A new csv for '"+sheetO.getName()+"' has been created and is in the folder "+folder.getUrl() );
  // MailApp.sendEmail(recipient, subject, body, options) - More info https://developers.google.com/apps-script/reference/mail/mail-app
}

… which gets the data and then publishes it as a csv to this folder emailing a link to listed recipients. As the folder has been shared we can get access via Google Drive Host with the link https://googledrive.com/host/0B6GkLMU9sHmLbThITlNvb2dzREE/top-posts%2020130701.csv

public folder

The template includes several example scriplet examples for you to play with and modify.  For those who are interested the slides for my IWMW13 session are on slideshare. The session was designed to introduce Google Apps Script followed by the query exporter template. I only made this yesterday so it might be a bit alpha. Feel free to leave issue and suggestions in the comments. I’m sure there are a number of clarifications required.

1 Comment

A couple of days ago Mat Morrison posted:

I've been messing around with Excel; trying to create a "punchcard chart" that will let me visualise data by hour and day at the same time. This is where I've got to so far, using nothing more than COUNTIFS and SUMIFS.

It works, more or less, although there are some bits to tidy up; notably that y-axis, which should really read Sunday through Saturday or the like.

Here's a link to the Excel workbook if you want to try this out: https://docs.google.com/file/d/0B26bhH2SxecqcDN4c3JvYTlfYTA/edit?usp=sharing

Punchcard example in Excel by Mat Morrison

Knowing that Google Sheets/Spreadsheets permits the QUERY formula which allows SQL like data manipulation I thought I’d give it a try. The result is mixed. As thought reshaping the data was relatively straight forward using a query along the lines of:

Query data in google sheets

 =QUERY(raw!A:A,"SELECT COUNT(A) WHERE HOUR(A) = "&C2&" AND DAYOFWEEK(A) = "&D2&" LABEL COUNT(A) ''")

where raw!A:A is some datetime data (in this case imported from TAGS).  Here's the source spreadsheet if you want a closer look at the formula.

So far so good. Next to graph the data. This is where the trouble starts. First here’s the result (click for the interactive version):

punchcard - bubble plot in google sheets

For reference: I've had problems embedding interactive Google charts in blog posts. On self-hosted WordPress sites the data source url gets mangled by the editor. My current fix is to use the Raw HTML plugin which preserves the url. Worth also considering that interactive charts may not appear in RSS feed readers. You can try to get around this with the noscript tag (see this gist for example) [Also Cloudflare users need to note this]

Looking at the bubble chart option it says:

The first column in the table should be text, and represents the label of that bubble. The numbers in the second column are plotted on the x axis. The numbers in the third column are plotted on the y axis. The optional fourth column should be text, and determines the bubble colour. The optional fifth column is numeric, and determines the size of the bubble.

Okay … so in the first image you’ll see I’ve had to start spoofing columns. In B:B I’ve had to create a column of unique labels. In E:E I needed to create a dataset for results where the count is zero. This is because the bubble chart will render bubbles even if the size is zero. To get around this I’ve tried removed the fill on zero values. Another aspect I’m not entirely convinced with is the bubble scaling, but I’ll let you decide.

The really worrying aspect of all this is given the recent announcement of Deprecation of Gadgets in Google Spreadsheets which allowed you to embed your own charts in Google Sheets (here’s an example gadget I wrote to include a d3.js force layout graph), come July/August  if I want to visualise data in a spreadsheet all I’m going to be left with is Google’s crappy charts … sigh :(

Share this post on:
| | |
Posted in Gadget, Google Spreadsheet on by .

2 Comments

One of the nice things about open courses like ocTEL is whilst having your own blogging space wasn’t a mandatory requirement we were aware that a number of participants had setup one up anyway. Conscious that activities like blogging can be very lonely and also aware that this is often a critical moment in motivating people to engage in reflective writing we’ve been encouraging tutors to show these blogs some extra love, which seems to have gone down well.

This raises the question that given blogging is optional and we haven’t asked people to self-declare if they are new to blogging how do we identify this community? With this question in mind I quickly (initial pass took me about 5 minutes thanks in part to earlier work in this area) put together a spreadsheet which took the participant blog list and got feed counts.

post table

How it was made

If you look at Sheet 1 you’ll see there is a static list of Feed urls (Column B). This was obtained by using the Chrome Screen Scrape extension. The reason it is static is I wanted a column where tutors could make comments and a dynamic import would screw this up. Looking at Sheet 2 cell A1 you can see how to do a dynamic list using the importXML function (see this post for more examples of spreadsheet import options).

To get the post counts and dates some Google Apps Script was required to write a custom function (custom function, you ask? Where have you been, read my blog). After that it was a bit of conditional formatting.

What’s missing

It would be useful to have a dynamic list of feeds which could retain comment cells (ideas on how to do this very welcome. My thought was to do a cron job to read the sheet, fetch more feeds then output the result)

Another metric it would be useful to add a comment count. So 1 post 3 comments would be identified as a lower priority than 1 post 0 comments.

So do you think this spreadsheet is potentially a useful tool in the open course toolbox? How would you improve it?

1 Comment

Sankey DiagramSankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity.” Wikipedia.

I first came across Sankey diagrams by (the OKFN’s latest School of Data’s contributors ;) Tony Hirst in d3.js Powered Sankey Diagram. Subsequently Bruce McPherson showed how to create Sankey diagrams from Excel also using d3.js.

Having collect some survey data for Analytics and Institutional Capabilities at #cetis13 (here’s a copy of the form) we were looking for a way to take us beyond the Google Form reporting and gaining extra insight. In particular I was interested in trying to see if there were any relationships between the multiple choice questions. Using a Sankey diagram seemed like a possible solution and my colleague David Sherlock quickly came up with a modification of Mike Bostock’s Sankey example to accept a csv input (I’ll link if/when it gets written up).

Seeing this I thought it might be useful to make a wrapper to generate Sankey diagrams for data stored in a Google Sheet. The solution was relatively straight forward, using the Google Visualisation API to get the data in the right shape for d3js. An example of the prototype is here

There’s no interface yet for you to select a spreadsheet, sheets, columns etc but you can take you Google Spreadsheet ‘publish to the web’ and then add the following data to the url.

One of the issues with this solution is you might not want to make all your data available. To get around this I’ve written a Google Apps Script that lets you use a custom formula to preformat the data. To see this in action this Spreadsheet contains an example. The formula is in cell A1 and uses the format =setSankey(datarange, cols , separator)

  • datarange – sheet/cell reference for source data eg 'Form Responses - Edit'!A2:D Note must start with column A
  • cols – comma separated list of columns to use to generate a chart for eg "B,C,D"
  • separator {optional} – used to split multi value cells defaults to ", " eg " | "

To use this in your own spreadsheets open Tools > Script editor and copy the code from this gist. Here’s an example url using pre-processed data. My main difference is the addition of the &output=1 to the querystring.

Obviously creating your own querystrings to render the data isn’t ideal and it would be relatively straight forward to create a UI wrapper similar to the one used in EDGESExplorer, but it’s something I reluctant to do unless there is enough demand. The other consideration is the question – does the sankey diagram provide useful insight for the type of data or is it just more ‘damn lies’.

It would have of course been nice to write a Google Gadget to include this in a Spreadsheet … but Google are discontinuing those :(.

4 Comments

As part of LAK13 I’ve already written a series of blog posts highlighting a couple of ways to extract data from Canvas VLE. Prompted by a question by On and my colleague Sheila MacNeill I wanted to show you a way of getting feed data into a spreadsheet without using any code. The solution is to use Yahoo Pipes, but as this post will highlight this isn’t entirely straight forward and you need to be aware of several tricks to get the job done. As LAK13 isn’t using Google Groups for this post I’ll be using the Learning Analytics Google Group as a data source.

Sniffing for data

First we need to find a data source. Looking for an auto-detected RSS/Atom feed by visiting the group homepage reveals nothing. [I always forget browsers seem are moving away from telling you when they detect a feed. To get around this I use the Chrome RSS Subscription Extension which indicates with the orange RSS icon when a page has a feed.]

Browser with no feed detected Browser with feed detected

Looking for an official Google Groups API as an alternative method turns up this open issue from August 2007 for a Groups API aka there's no API :( Digging deeper we find Groups did have data feeds in their old interface. So with a bit of url magic I can land on the old Groups interface for Learning Analytics which gives us the orange light

Google Groups Old Interface with Feeds

Requesting the View all available feeds page we get some additional feed options:

Google Groups View all available feeds

At this point I could grab the Atom links and with a bit of tweaking process it with my existing Google Apps Script Code, but lets look at a ‘no code’ solution.

Feeding Google Sheets with Yahoo Pipes

At this point it’s worth reminding you that you could use the importFeed formula in a Google Spreadsheet which would import the data from a Google Group. The issue however is it’s limited to the last 20 items so we need a better way of feeding the sheet.

A great tool for manipulating rss/atom (other data) feeds is Yahoo Pipes. Pipes gives you a drag and drop programming environment where you can use blocks to perform operations and wire the outputs together. I learned most of my pipework from the Pipemaster – Tony Hirst and if you are looking for a starting point this is a good one.

Yahoo Pipes - Edit interface

Here I’ve created a pipe that takes a Google Group shortname does some minor manipulation, which I’ll explain later, and output a result. When we run the pipe we get some export options:

Yahoo Pipe - Run pipe

The one I’m looking for is .csv because it'll easily import into Google Sheets, but it’s not there … Just as we had to know the old Google Group interface has RSS feeds, with Yahoo Pipes we have to know the csv trick. Here’s the url for ‘Get as JSON’:

http://pipes.yahoo.com/pipes/pipe.run?_id=14ffe600e0c0a9315007b922e41be8ad&_render=json&group=learninganalytics

and if we swap &_render=json for &_render=csv by magic we have a csv version of the output (whilst we are here also notice the group name used when the pipe is run is also in the url. This means if we know the group shortname we don’t need to enter a name a ‘Run pipe’, we can build the url to get the csv.

Now in a Google Spreadsheet if you enter the formula =importData("http://pipes.yahoo.com/pipes/pipe.run?_id=14ffe600e0c0a9315007b922e41be8ad&_render=csv&group=learninganalytics")we get the groups last 100 messages in a spreadsheet.

Extra tricks

There were a couple of extra tricks I skipped worth highlighting. RSS/Atom feeds permit multilevel data, so an element like ‘author’ can have sub elements like ‘name’, ‘email’. CSVs on the other hand are 2D, rows and columns.

Illustration of nested structure of atom feed

When Yahoo Pipes generates a csv file it ignores sub elements, so in this case it’ll generate an author column but won’t include name or email. To get around this we need to pull the data we want into the first level (doing something similar for content).

Rename block

The next little trick is to get the feed dates in a format Google Spreadsheets recognise as a date rather than a string. In the feed dates are in ISO 8601 format e.g. 2013-03-05T13:13:06Z. By removing the ‘T’ and ‘Z’ Google Spreadsheets will automatically parse as a date. To do this we use a Regex block to look for T or Z (T|Z) replacing with a single space (which is why the ‘with’ box looks empty).

Regex block

I’ve wrapped the data in a modified version of the dashboard used for the Canvas data feed.

*** Google Groups Activity Dashboard ***

Google Groups Activity Dashboard

Limitations

A couple of big limitations to be aware of:

How long will Google Group data feeds last

Given we’ve had to dig out the data feeds from the old Google Group interface my suspicion is once this is shut off for good the feeds will also disappear. Who knows, Google may actually have a Group API by then ;s

Limited to last 100 messages

The eagle eyed amongst you will have spotted I was able to increase the number of messages returned to 100 by adding num=100 to the feed query. This is the limit though and you can’t use paging to get older results. There are a couple of ways you could store the feed results like using the FeedWordPress plugin. I’m experimenting with using IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet, but as the data isn’t stored nicely (particularly dates which are saved like ‘February 15, 2013 at 01:48PM’ *sigh*) it makes it harder to use.

I think that’s me in terms of ways for extracting discussion boards data … for now. One other technique related to Google Spreadsheets is screen scraping using importXML. To see this in action you can read Dashboarding activity on public JISCMail lists using Google Sheets (Spreadsheets).

4 Comments

In my last post on Canvas Network Discussion Activity Data I mentioned I was a little disappointed to not be able to use social network analysis (SNA) modelling techniques on the Canvas Network discussion boards. My main barrier was accessing the data via the Canvas API using my preferred toolset. Fortunately Brian Whitmer at instructure.com (the company behind Canvas) posted a comment highlighting that as a student on the course it was easy for me to get access to this data using a token generated on my Canvas profile page. With this new information in this post I’ll cover three main areas:

  • a very quick introduction into techniques/opportunities for analysing threaded networks using SNA;
  • how I retrieved data from the Canvas platform for the #LAK13 discussions; and finally
  • some analysis using the NodeXL add-on for Microsoft Excel (Windows).

On Friday 1st March at 4pm GMT I’ll also be having a live Hangout on Air with Marc Smith, one of the original creators and continued project coordinator of NodeXL. The live feed will be embedded below embedded here and you can ask question via Twitter or Google+ using the combined hashtags #lak13 and #nodexl  e.g. ‘What the best place to find out more about NodeXL? #lak13 #nodexl’. For the hangout session we’ll look at how easy it is to use NodeXL to analyse a Twitter hashtag community in a couple of clicks. [The rest of this post is introducing a more advanced use of NodeXL so if I lose you in the rest of this post fear not as the session will be a lot easier going]

Opportunities for analysing threaded networks using SNA

Hello graph

WiredUK friend/follower graphOne of the dangers of SNA is people see images like the one to the right and miss the point of using this modelling technique (the image is from one of my first goes at using NodeXL so I’d be the first to admit there’s room for improvement). If you do there are couple of things to bear in mind. Often these images are exhaust, generated as a snapshot of a visual and/or exploratory analytic. A certain level of literacy is required to understand the underlying structure of the graph. Taking this last point a threshold concept for me was understanding that a basic concept of these graphs are more often than not constructed from an edge list which is essentially two columns of data representing and start point and end point to a node. For example, all the names in columns Vertex 1 and Vertex 2 are nodes and each row represents an edge between the nodes so in this case Beth –> Adam generates the highlighted edge.

Basic edge list and force layout

There is a lot more to understand about these types of graphs, but this basic concept means I know if I have any relationship data its easy to graph and explore.

Hello SNAPP

At this point it’s worth mentioning the browser plugin SNAPP.

The Social Networks Adapting Pedagogical Practice (SNAPP) tool performs real-time social network analysis and visualization of discussion forum activity within popular commercial and open source Learning Management Systems (LMS). SNAPP essentially serves as a diagnostic instrument, allowing teaching staff to evaluate student behavioural patterns against learning activity design objectives and intervene as required a timely manner.

Valuable interaction data is stored within a discussion forum but from the default threaded display of messages it is difficult to determine the level and direction of activity between participants. SNAPP infers relationship ties from the post-reply data and renders a social network diagram below the forum thread. The social network visualization can be filtered based upon user activity and social network data can be exported for further analysis in NetDraw. SNAPP integrates seamlessly with a variety of Learning Management Systems (Blackboard, Moodle and Desire2Learn) and must be triggered while a forum thread is displayed in a Web browser.

The social network diagrams can be used to identify:

  1. isolated students
  2. facilitator-centric network patterns where a tutor or academic is central to the network with little interaction occurring between student participants
  3. group malfunction
  4. users that bridge smaller clustered networks and serve as information brokers

The paper referencing SNA research supporting these areas was presented at LAK11 (if you don’t have access also available in the presentation’s slidedeck).  The paper Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights (Hansen, Shneiderman & Smith, 2010) also highlights simple ways to identify question people, answer people and discussion starters which are all potentially very useful within courses for identifying network clusters individuals might want to join/follow.

Retrieving data from Canvas

Hopefully with that quick intro you can see there might be some value in using SNA from threaded discussion analysis. Reading the SNAPP overview hopefully you spotted that it currently doesn’t support extracting data from Canvas discussion boards. This is an opportunity to understand some of the analysis SNAPP is doing behind the scenes.

Hello Google Apps Script

If you have been following my posts you’ll see that I favour using Google Apps Script as a lightweight tool for extracting data. Thanks to Brian (Instructure) I’ve got a way to access the Discussion Topics API. Looking at the API documents I decided the best way to proceed was to get all of the LAK13 discussion topics (top level information) and use this to get the full topic data. If you speak JSON we are essentially turning this:

Data as rendered in Canvas

into this:

Data in JSON

finally getting this (web version here):

Data in spreadsheet

The code to do this is available here. I’m going to spare you the details of the code but here are the instructions is you’d like to export data from other Canvas hosted discussion boards. If you’re not interested in that you can just jump to the next section.

Generating an edge list (extracting data) from Canvas to Google Sheets

  1. Create a new Google Spreadsheet and then in Tool > Script editor copy in the code from here
  2. If you are not pulling data from LAK13 you need to edit values in lines 2-4. If you visit your course homepage hopefully you can decode the url pattern based on the example for LAK13 https://learn.canvas.net/courses/33 (I should also point out you need to be enrolled on the course to receive data. Also read Canvas API Policy)
  3. Next you need an access token which is generated from your Canvas Profile Settings page. Scroll down to the bottom and click New Access Token, filling in a purpose and leaving expires blank. Make a copy of the token as it’s needed for the next step (I added a copy to a .txt file just in case the next step didn’t work
  4. Back in the Script Editor in Google Spreadsheets click File > Project Properties. In the ‘Project properties’ tab click  ‘+ Add row’ and replace (name) with access_token and (value) with the token you got from Canvas before clicking Save
    Project Properties
  5. Make sure everything is saved in the Script Editor and then Run > getCanvasDiscussionEdges, wait for the script to finish and on Sheet1 you should have a bunch of data to play with.

Using NodeXL to analyse Canvas Discussions

There are a number of different questions we could ask of the LAK13 data. The particular one I want to look at is who are the core community members stimulating/facilitating discussion (e.g. applying a connectivist theory who are the people you might want to connect with). To do this we need to (I’m assuming you’ve already installed NodeXL):

  1. Download the data extracted to the Google Spreadsheet (File > Download as > Microsoft Excel). [If you just want the data I’ve extracted here’s the download link – the data is automatically refreshed nightly]
  2. Open the download file in Excel and in the created_at column select all and Format Cells as General (I needed to do this because NodeXL was miss formating dates on import)
  3. Start a new NodeXL Template (I use the NodeXL Excel Template option from my windows Start menu)
  4. From the NodeXL ribbon you want to Import > From Open Workbook
    Import > From Open Workbook
  5. In the import dialog vert1_name and vert2_name are edges, anything else prefixed with ‘vert’ is assigned to the corresponding Vertex n property column and everything else is an  Vertex 1 property:
    import dialog
  6. Once imported you can open the Edges sheet, select the created_at column and Format Cells reassigns a date/time format.
  7. In Prepare Data chose ‘Count and merge duplicate edges’ and select Count and Vertex1 and Vertex 2
    Count and merge duplicate edges
  8. In the Graph section of the NodeXL ribbon we want to make this a directed graph (replies are directed) and choose you layout algorithm (I usually go Harel-Koren Fast Multiscale)
    make this a directed graph
  9. Next we want to prepare the data we want to analyse. In the Autofill Columns (within Visual Properties portion of the ribbon) set Edge Visibility to ‘topic_id’ and in Edge Visibility Options set ‘If the source column number is: Not equal to 558’ Show otherwise Skip (this will skip edges that are responses to the Pre-course discussion forum – I’ll let you question this decision in the comments/forum)
    Edge Visibility Options
  10. Click Ok then Autofill
  11. Next open the Vertices sheet and select all the rows (Ctrl+A) and from the Visibility option select ‘Show if in an Edge’
    Show if in an Edge
  12. Now find the row in the Vertices sheet for George Siemens and Skip (doing this were creating a ‘what if George wasn’t there’ scenario
    ‘what if George wasn’t there’ scenario
  13. Open the Graph Metrics window and add Vertex in-degree, vertex out-degree, Vertex betweenness and closeness centrality and Top items (in the Top items options you’ll need to add these as the metrics you want top 10s for), finally click Calculate metrics.
    Graph Metrics window

At this point you could use the calculated metrics to weight nodes in a graph, but for now I’m going to skip that. You should now have a Top Items sheet with some useful information. In the Betweenness Centrality list you should have these names:

  • Martin Hawksey
  • Simon Knight
  • Alex Perrier
  • Khaldoon Dhou
  • Rosa Estriégana Valdehita
  • Maha Al-Freih
  • Suzanne Shaffer
  • Maxim Skryabin
  • Bryan Braul
  • Peter Robertso

Excluding the pre-course discussion forum and George Siemens the discussions these people have engaged with provide the shortest paths to other people engaging in discussions on the Canvas site. Strategically these are potentially useful people  within the network that you might want to follow, question or engage with.

Getting to this point obliviously hasn’t been straight forward and had SNAPP been available in this instance it would have turned this in to a far shorter post. Programmatically using tools like R we could have arrived at the same answer with a couple of lines of code (that might be my challenge for next week ;). What it has hopefully illustrated is if you have data in an edge format (two column relationships) tools like NodeXL make it possible for you use SNA modelling techniques to gain insight. (I’m sure it also illustrates that data wrangling isn’t always straight forward, but guess what that’s life).

NodeGL for LAK13 discussionsSo at this point some of you might be wondering what does the LAK13 reply network look like. I could give you a flat image but why don’t you use my NodeXL online graph viewer to explore it yourself or download the data from the NodeXL Graph Gallery.