3 Comments

You may have noticed I missed an analysis of week 5 of CFHE12, but hopefully I’ll capture what I wanted to say and more in this post. In this post I want to pull together a couple of ideas around some of the measurable user activity generated as part of CFHE12. This will mainly focus around Twitter and blogs and will ignore other channels like the course discussion forum, webinar chat or other spaces participants might have discovered or created. I conclude that there are some simple opportunities to incorporate data from twitter into  other channels, such as, summary of questions and retweets.

Twitter Activity Overview

The headline figures for tweets matching the search ‘#CFHE12 OR edfuture.net OR edfuture.mooc.ca OR edfuture.desire2learn.com’  for 8th October to 18th November 2012 (GMT):

  • 1,914 Tweets from 489 different Twitter accounts
  • 1,066 links shared
  • 472 retweets
  • 10% (n=206) of the tweets were in @reply to another Twitter account
  • Contributions from accounts in 45 countries (top 5  United States of America – 166; Canada – 50; United Kingdom – 43; Australia – 28; Germany – 12) [1]

Map of #cfhe12 participants

Looking at week-by-week distribution of contributors and contributions it can be seen that after the initial first 2 weeks the number of tweets posted each week remained consistent around 200. 75% of the Twitter accounts (n=374) contributed tweets to 1 week of CFHE12.

Twitter contributors and tweets Number of weeks participants contributed in

Comparing Twitter activity with blog posts aggregated with gRSSHopper doesn’t reveal any correlation but it’s interesting to note that whilst the volume of tweets remain relatively consistent for weeks 3 to 6 there is a significant drop in blog posts between week 4 and 5.

CFHE12 number of tweets and blogs

Looking at distribution of tweets over day and time shows a lull on Thursdays, but a peak around 1900hrs GMT which would appear to coincide with the usual time slot used for tweetchats.

CFHE12 Tweets by day of week CFHE12 Tweets by time of day

Getting more technical having collected the friend follower relationships for Twitter accounts using #CFHE12 for each of the weeks it possible to analyse new connections between community members. At the end of week 1 the top 52 contributors were joined by 286 follow relationships. By the end of the course 45 new follow relationships were created increasing the graph density [0.103774 –> 0.120101597] and reducing geodesic distance [2.008136 –> 1.925296]

The graph below highlights the new relationships (bold line). Nodes are sized by the number of new connections (as part of the archive the friend/follower count is captured with each tweet so it may be possible to do further analysis). Its interesting to note that BarnetteAndrew is an isolated node in G7.

cfhe12 interconnection growth

Refining the signals from the Twitter feed

Adam Cooper (CETIS) has recently posted some tips from a presentation John Campbell on development of Signals at Purdue, which includes using a spreadsheet as a starting point as a way to find out what you need. I’ve already got some basic tools to overview a Twitter archive but used the CFHE12 data to experiment with some more.

By week

Adding to the existing Twitter Activity sparklines it’s been possible to extract a summary of week-by-week activity (a basic traffic light). Whilst this is in a way a duplication of the data rendered in the sparkline it has been useful to filter the participants based on queries like ‘who has contributed in week 1’ and ‘who as contributed in all the weeks’. If you were wanting to take this to the next level you’d combine it with the community friendship graph and pay extra attention to the activity of your sub community bridges (for more info on this see Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights).

CFHE12 Activity data

Conversation matrix

Graphing the conversations (@reply, @mentions and retweets) using TAGSExplorer gives this ball of mess.

CFHE12 TAGSExplorer View

Trying to find a more structured way of presenting the data I’ve experimented with an adjacency matrix (shown below or interactive version here)[2]. Each cell is colour coded to indicate the number of interactions (replies, mentions and retweets) between users. For example we can see that gsiemens has had the most interactions with barrydahl. Scanning along the rows gives you a sense of whether a person was interacting with a large number or select few other accounts. For example, pgsimoes has interactions mostly with suifaijohnmak. Filtering tweets for pgsimoes (possible from the link in the left column) it looks like it’s an automatic syndication of suifaijohnmak’s blog posts.

CFHE12 conversation adjacency matrix

Do you have any questions?

CFHE12 Any Questions?

At the beginning of CFHE12 I posted Any Questions? Filtering a Twitter hashtag community for questions and responses. This is a crude tool which filters out tweets with ‘?’ which might indicate they are a question. By counting the number of tweets in the archive which reply to a question you get the following breakdown

  • Total questions 309
  • Questions with replies 44

An important point to note is that as only tweets that meet the search criteria are archived there may be more responses ‘off tag’. The danger in a medium like Twitter used in courses like CFHE12 is that questions may go unanswered, misconceptions are not corrected, feedback is never given.

Part of the issue in the case of CFHE12 is participants are allowed to use tools like Twitter as they please. While this suits ‘visitors and residents’ some additional structure may be beneficial. Two simple approaches would be to direct participants to include the course tag in their reply and highlighting current questions by either using the ‘Any Questions’ tool or in the case of courses using gRSSHopper instead of including all the latest course tweets in the daily email alert filter the search for ‘CFHE12 AND ?

Retweet-me

retweetsFor a while I’ve had a basic function that extracted the tweets with the most retweets, but <sigh> I think it on the list of developments I’ve never got to blog about. The routine is a relatively simple bean counter that goes through a list of tweets, removes any hyperlinks and crudely shortens the text by 90% (to account for any annotations) and counts the number of matches before returning a list of the top 12. Slicing the data for each week I get these tables. There is probably more analysis required of what is being retweeted before making a decision about how this data could be used. My main question for the #cfhe12 twitter community is doing they have a sense of what is being retweeted the most. The other angle is pushing some of this data into other communication channels like the Daily Newsletter or discussion forums.

Summary

So hopefully the summary data I extracted and experimentation with new data views has been useful. Something that has been reinforced in my own mind that more value could be easily gained using Twitter by either providing guidance on use and/or incorporating data from Twitter in other channels more effective (rather than dumping everything into a daily email select some key data). Now that I’ve got a template which splits some of the data into weekly slices it should be easier to deploy and pass data into other systems buy changing some of the dates on the summary sheet.

But what do you think? Are there any particular data views you found useful? If you’ve participated in a community that uses Twitter a lot what additional tools would you find useful to keep track of what is going on?

Get the Data

Notes

[1] Countries were reconciled by extracting location recorded in twitter account profile and generating geo-coordinates using recipe here. Locations were extracted for 404 accounts. Co-ordinates were uploaded to GeoCommons and analysed with a boundary aggregation to produce this dataset.

[2] the matrix was generated by exporting conversation data from TAGSExplorer by adding the query &output=true to the url e.g. like this, importing into NodeXL then filtering vertices based on a list of top contributors. This was exported as a Matrix Workbook and imported into the Google Spreadsheet. Conditional formatting was used to heatmap the cells.

1 Comment

I started writing this last week so the intro doesn’t make sense. Slides from the presentation I did are here (all pictures so probably also makes very little).

To paraphrase Stephen Heppell (I’ve misquoted this before):

Content is king, but community is sovereign

The UKOER Programme is having it’s final meeting next week and while the final set of projects come to a close a strong community has formed and I’m sure will continue. Something I was interested in doing is looking at how the community has evolved over time. I’ve previously looked at data around the #ukoer hashtag, but not everyone uses Twitter so I thought I look for another data source.  As email is still a strong component in most peoples everyday lives I started poking around OER-DISCUSS JISCMail (Listserv) list:

A public list for discussion about the release, use, remix and discovery of Open Educational Resources (OER). Managed jointly by OU SCORE Project, and the JISC / HE Academy OER Programme.

As far as I could see there are limited options for getting data out of JISCMail (some limited RSS/Atom feeds) so cue the music for a good out fashioned scrape and refine. Whilst I’ll walk you through this for OER-DISCUSS the same recipe can be used for other public lists.

Source survey

Instead of going straight into the recipe I wanted to record some of the initial thought processes in tackling the problem. This usually begins with looking at what you’ve got to work with. Starting with the list homepage I can see some RSS/Atom feeds, that don’t take me far, instead I turn my attention to the list of links for each months archives. Clicking through to one of these and poking around the HTML source (I mainly use Chrome so a right click in the page gives the option to Inspect Element) I can see that the page uses a table template structure to render the results – good. Next I checked if the page would render even when I was logged out of JISCMail, which I can – good. So far so good.

page source

Next a step back. This looks scrapable so has anyone done this before. A look on Scraperwiki turns up nothing on Listserv or JISCMail, so next a general Google search. Looking for terms like ‘listserv data scrape’ are problematic because there are lots of listserv lists about data scraping in general. So we push on. We’ve got a page with links to each months archives and we know each archive uses a table to layout results. Next it’s time to start thinking about how we get data out of the tables. Back in the Chrome Element Inspector we can see that the source contains a lot of additional markup for each table cell and in places cells contain tables within them. At this point I’m think OpenRefine (nee Google Refine).

Scraping list of archive links

A feature of OpenRefine I use a lot is fetching data from a url. To do this we need a list of urls to hit. Back on the list homepage I start looking at how to get the links for each month’s archive. Hover over the links I can see they use a standard sequence with a 2-digit year {yy} and month {mm}

/cgi-bin/webadmin?A1=ind{yy}{mm}&L=OER-DISCUSS

I could easily generate these in a spreadsheet but I’m lazy so just point a Chrome extension I use called Scraper to find the part of the page I want and import to a Google Spreadsheet.  

Scraper

[another way of doing this is creating a Google Spreadsheet and in this case entering the formula =ImportXml("https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OER-DISCUSS","//tr[2]/td/ul/li/a/@href")

Fetching and Refining the Data

Import projectFinally we can fire up OpenRefine. You could create a project by using the Google Data option, which is used to import data from your Google Docs, instead as it’s not a huge amount of data I use the Clipboard option. At this point the preview will probably separate the data using ‘/’ and use the first row as a column heading so you’ll want to switch to comma or and de-select ‘Parse next’.

  1. Next we want to fetch each month’s archive page by using the Column 1 dropdown to Edit column > Add column by fetching url using the GREL expression "https://www.jiscmail.ac.uk"+value using the column name month_raw
  2. This pulls in each month’s archive page in raw html. Now we want to parse out each row of data in a new column by selecting the dropdown from month_raw and selecting Edit column > Add column based on this column  using the GREL expression forEach(value.parseHtml().select("table.tableframe")[1].select("tr"),v,v).join(";;;") with the column name rows_raw – this selects the second table with class ‘tableframe’ and joins each row with a ‘;;;’
  3. Next from the rows_raw column use Edit cells > Split multi-valued cells using ;;; as the separator
  4. Again from the rows_raw column dropdown select Edit column > Add column based on this column using the GREL expression forEach(value.parseHtml().select("td"),v,v).join(";;;") with the column name rows_parsed – this joins each <td> with a ;;; which will let us spilt the values into new columns in the next step
  5. Now from the rows_parsed column select Edit column > Split into several columns using the separator ;;;

Column split

You should now have something similar to above with columns and rows split out, but still messy with html in the cells. We can cleat these up using Edit cells > Transform using variations of value.parseHtml().htmlText()

Here are the steps I used (the complete operation history you can use in Undo/Redo is here – using this apply all the actions starting with the list of monthly urls)

  1. Text transform on cells in column rows_parsed 4 using expression grel:value.parseHtml().htmlText().replace(" lines","").toNumber()
  2. Rename column rows_parsed 4 to lines
  3. Text transform on cells in column rows_parsed 3 using expression grel:value.parseHtml().htmlText().toDate("EEE, dd MMM y H:m:s")
  4. Rename column rows_parsed 3 to date
  5. Text transform on cells in column rows_parsed 2 using expression grel:value.parseHtml().htmlText().replace(" <[log in to unmask]>","")
  6. Rename column rows_parsed 2 to from
  7. Create column snippet at index 4 based on column rows_parsed 1 using expression grel:value.split("showDesc('")[1].split("','")[0].unescape("html").parseHtml().htmlText()
  8. Create column link at index 4 based on column rows_parsed 1 using expression grel:"http://jiscmail.ac.uk"+value.parseHtml().select("a")[0].htmlAttr("href")
  9. Text transform on cells in column rows_parsed 1 using expression grel:value.parseHtml().htmlText()
  10. Rename column rows_parsed 1 to subject
  11. Create column subject_normal at index 4 based on column subject using expression grel:trim(value.replace(/^Re:|^Fwd:/i,""))

You’ll probably notice some of the rows don’t contain the data we need. An easy way to remove these is use a timeline facet on the date column selecting non-time, blank and error and then from the All column dropdown menu select Edit rows > Remove all matching rows.

Tony Hirst has a great post on faceting tricks. Something not covered is clustering data using facets. We use this as a way to join rows where authors have multiple logins eg Tony Hirst and Tony.Hirst

To do this add a text facet to the author/from column and click Cluster:

Facet view

Finding the right settings is a bit of trial and error combined with a bit of knowledge about your dataset. In this example I saw that there were rows for Pat Lockley and Patrick Lockley so tried some settings until I got a hit (in this case using nearest neighbour and PPM – which by all accounts is the last resort). You might also need to run clustering a couple of times to catch most of the variations

Clustering columns

What can we do with this data?

In Dashboarding activity on public JISCMail lists using Google Sheets (Spreadsheets) I was able to get an indication of the overall activity of the list. Now that author names are clustered I can get a more accurate picture of the top contributors using a text facet on the author column:

Pat Lockley 269, Amber Thomas 122, Phil Barker 82 

I was tempted to mine the individual posts further using the techniques and code posted by Adam Cooper (CETIS), but it didn’t look like I could easily swap the data source. A couple of questions posed by Phil Barker (CETIS) were:

The easiest way I found to get this was to use the Export > Custom tabular exporter (using this settings file), import into Google Sheets (Spreadsheets) and use a couple of formula to get this summary page (opening the summary page will let you see he formula I’ve used):

And there is much more you could do, but I’ll leave it there. If nothing else I hope you get an idea of some of the thought processes involved in extracting data. As always if something is unclear get in touch.

3 Comments

I’ve written a very long blog post which I’ll publish soon on text-mining public JISCMail (Listerv) lists using OpenRefine. It concludes with displaying list activity, posts over time and individual activity. The technique I used isn’t straight forward but as the output might be of benefit to other people like Brian Kelly who reported the Decline in JISCMail Use Across the Web Management Community I wondered if there was a better way of doing it. Here’s my answer:

*** JISCMail Public List Activity Overview Template ***
[Give it 30 seconds to render the results] 

JISCMail Public List Activity Overview Template

By making a copy of this spreadsheet and entering the url of the homepage of a public JISCMail List like OER-DISCUSS, it goes off and collects each months archives for almost the last 3 years, graphs the overall list activity as well as letting you see individual contributions (a limitation is matching variations in display names so in the OER-DISCUSS example Pat Lockley and Patrick Lockley get counted separately even though they are the same person).

How it works

On the data sheet cell A2 uses importXML to grab all the archive links. In cell B2 the importHTML function is used to grab the table of posts on each month’s archive page and does a QUERY to return post author names, the values being turned into a string from an array using JOIN. In cell A53 a UNIQUE list of author names (minus ‘) is generated using a combination of SPLIT and JOINS. This data is then used on the Dashboard sheet (to get the SPARKLINES I had to write a custom function using Google Apps Script.

function getActivity(user,source) {
  var output = [];
  for (i in source){
    var rows = source[i][0];
    var count = rows.match(new RegExp(user,"g"));
    if (count){
      output.push([count.length]);
    } else {
      if (source[i][0]!=""){
        output.push([0]);
      }
    }
  }
  output.reverse();
  return output;
}

If you are interested in learning more about the functions used I recently posted Feeding Google Spreadsheets: Exercises in using importHTML, importFeed, importXML, importRange and importData (with some QUERY too). You should be able to use this template with any other public JISCMail list. Any questions get in touch.

Search Archive feature in TAGSExplorer

At IWMW12 I made a searchable/filterable version of TAGS Spreadsheets. This feature lets you use the Google Visualisation API to filter tweets stored in a Google Spreadsheet (more about TAGS). It has been available via a separate web interface for some time but I’ve never got around to publicizing it. As TAGSExplorer also uses the Google Visualisation API to wrap the same data in a different visualisation tool (predominantly d3.js) it made sense to merge the two. So now in any existing TAGSExplorer archive (like this one for #jiscel12) you have should now also have a button to ‘Search Archive’.

The archive view has some basic text filtering from tweeted text and who tweeted the message as well as a time range filter (dragging the handles indicated). The scattered dots indicate when messages were tweeted. The denser the dots, the more tweets made.

I’ve hastily thrown this together so feedback very welcome.

8 Comments

It’s here folks. The most advanced aggregation and visualisation of tweets for the JISC Innovating e-Learning 2012 online conference taking place next week. Over two years ago I started developing a Google Spreadsheet to archive tweets and since not only have I been evolving the code I’ve been creating tools which use the spreadsheet as a data source. It’s pleasing to see these tools being used for a wide range of projects from citizen journalism,  to a long list of academics, students and community groups, and even TV broadcasters.

I’ve been a little remise in posting some of the latest developments and I’ll have to cover those soon. For now here’s your #jiscel12 Twitter basecamp.  

Overview of features

 #jiscel12 Twitter basecamp

Whilst I probably just looks like another spreadsheet you should explore:

A. The ability to easily filter archive by person

The ability to easily filter archive by person
[Still need to document]

B. The TAGSExplorer conversation overview

TAGSExplorer conversation overview
[TAGSExplorer: Interactively visualising Twitter conversations archived from a Google Spreadsheet]

C. The entire searchable/filterable archive

entire searchable/filterable archive
[Still need to document]

D. The question and answer filter

question and answer filter
[Any Questions? Filtering a Twitter hashtag community for questions and responses]

Dashboard

image[Contains a number of summaries – I find ‘most RTs in last 24hrs’ one of the most useful (how this works also need documenting]

Currently these are automatically updating every hour, but I’ll probably crank up the frequency next week. Your thought on these always gratefully received ;) 

8 Comments

This week saw me submit my application to the Shuttleworth Foundation  to investigate and implement cMOOC architecture solutions. It seems rather fitting that for this week’s CFHE12 exploration that an element of this is included. It’s in part inspired by a chance visit from Professor David Nicol to our office on Friday. It didn't take long for the conversation to get around to assessment and David’s latest work in the cognitive processes of feedback, particularly student generated, which sounds like it directly aligns to cMOOCs. It was David’s earlier work, which I was involved in around assessment and feedback principles that got me thinking about closing the feedback loop. In particular, the cMOOC model promotes participants working in their own space, the danger is with this distributed network participants can potentially become isolated nodes, producing content but not receiving any feedback from the rest of the network.

Currently within gRSSHopper course participants are directed to new blog posts from registered feeds via the Daily Newsletter. Below is a typical entry:

CFHE12 Week 3 Analysis: Exploring the Twitter network through tweets
Martin Hawksey, JISC CETIS MASHe
Taking an ego-centric approach to Twitter contributions to CFHE12 looking at how activity data can be extracted and used [Link] Sun, 28 Oct 2012 15:35:17 +0000 [Comment]

One of the big advantages of blogging is that most platforms provide an easy way for readers to feedback their own views via comments. In my opinion this is slightly complicated when using gRSSHopper as it provides it’s own commenting facility, the danger being discussions can get broken (I imagine what gRSSHopper is trying to do is cover the situation when you can’t comment at source).

Even so commenting activity, either from source posts or within gRSSHopper itself, isn't included in the daily gRSSHopper email. This means it’s difficult for participants to know where the active nodes are. The posts receiving lots of comments, which could be useful for vicarious learning or through making their own contributions. Likewise it might be useful to know where the inactive nodes are so that moderators might want to either respond or direct others to comment.

[One of the dangers here is information overload, which is why I think it’s going to start being important to personalise daily summaries, either by profiling or some other recommendation type system. One for another day.]

To get feel for blog post comment activity I thought I’d have a look at what data is available, possible trends and provide some notes on how this data could be systematically collected and used.

Overview of cfhe12 blog post comment activity

Before I go into the results it’s worth saying how the data was collected. I need to write this up as a full tutorial, but for now I’ll just give an outline and highlight some of the limitations.

Data source

An OPML bundle of feeds extracted in week 2 was added to an installation of FeedWordPress. This has been collecting posts from 71 feeds filtering for posts that contain ‘cfhe12’ by using the Ada FeedWordPress Keyword Filters plugin. In total 120 posts have been collected between 5th October and 3rd November 2012 (this compares to the 143 links included in Daily Newsletters). Data from FeedWordPress was extracted from the MySQL database using same query used in the ds106 data extraction as a .csv file.

This was imported to Open (née Google) Refine. As part of the data FeedWordPress collects a comment RSS feed per post (a dedicated comment feed for comments only made on a particular post – a number of blogging platforms have a general comment feed which outputs comments for all posts). 31 records from FeedWordPress included ‘NULL’ values (this appears to happen if FeedWordPress cannot detect a comment feed, or the original feed comes from a Feedburner feed with links converted to feedproxy). Using Refine the comments feed was fetched and then comment authors and post dates were extracted. In total 161 comments were extracted and downloaded into MS Excel for analysis

Result

Below is a graph of cfhe12 posts and comments (the Excel file is also available on Skydrive). Not surprisingly there’s a tail off in blog posts.

CFHE12 Blog Posts and Comments

Initially looking at this on a per post basis (shown below left) showed that three of the posts were been commented on for over 15 days. On closer inspection it was apparent this was due to pingbacks (comments automatically left on posts as a result of it being commented in another post). Filtering out pingbacks produced the graph shown on the bottom right.

CFHE12 blog comments timeline  CFHE12 blog comments timeline (without pingbacks)

Removing pingbacks, on average 3.5 days after a post was published comments would have stopped but in this data there is a wide range from 0.2 days to 17 days. It was also interesting to note that some of the posts have high velocity, Why #CFHE12 is not a MOOC! receiving 8 comments in 1.3 days and Unfit for purpose – adapting to an open, digital and mobile world (#oped12) (#CFHE12) gaining 7 comments in 17 days (in part because the post author took 11 days to respond to a comment).

Looking at who the comment authors are is also interesting. Whilst initially it appears 70 authors have made comments it’s apparent that some of these are the same author using different credentials making them ‘analytically cloaked’ (H/T @gsiemens).

analytically cloaked

Technical considerations when capturing comments

There are technical consideration when monitoring blog post comments and my little exploration around #cfhe12 data has highlighted a couple:

  • multiple personas - analytically cloaked
  • pingbacks in comments – there are a couple of patterns you could use to extract these but not sure if there is a 100% reliable technique
  • comment feed availability – FeedWordPress appears to happily detect WordPress and Blogger comment feeds if not passed through a Feedburner feedproxy. Other blogging platforms look problematic. Also not all platforms provide a facility to comment
  • 3rd party commenting tools – commenting tools like Disqus provide options to maintain a comment RSS feed but it may be down to the owner to implement and it’s unclear if FeedWordPress registers the url
  • maximum number of comments – most feeds limit to the last 10 items. Reliable collection would require aggregating data on a regular basis.

This last point also opens the question about whether it would be better to regularly collect all comments from a target blog and do some post processing to match comments to the posts your tracking rather than hit a lot of individual comment feed urls. This last point is key of you want to reliably track and reuse comment data both during and after a cMOOC course. You might want to refine this and extract comments for specific tags using the endpoints outlined by Jim Groom, but my experience from the OERRI programme is that getting the consistent use of tags by others is very difficult.

Discovering TEL-Map Mediabase

Twitter hasn’t completely abolished 3rd party clients just yet. The text is the red circle is generated from the details a users/company submits when they create an application that uses the Twitter API. As part of the registration the user has to provide a url for the application. In this example ‘TEL-Map Mediabase’ redirects to Learning Frontiers, which is powered by Tel-Map. I should probably know more about TEL-Map because one of the partners is CETIS (before my time). 

But what about ‘Mediabase’. Well a search of ‘TEL-Map Mediabase’ returns the page ‘MediaBase - Collecting, Analysing and Visualizing Large Collections of Social Software Artifacts’ which contains the presentation by Ralf Klamma.

So you basically have a system that can crawl and analysis predefined source feeds, analyse the collected data and either manual or automatically tag links, which can be pushed to a portal or tweeted from the system (and much more). Anyone else thinking cMOOC infrastructure?

[If the system is as good as I think it is I’m expecting another tweet from TEL-Map Mediabase]

Here's some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.

1 Comment

For week 3 of cfhe12 analysis I thought I’d turn back to the Twitter data. I’m currently trying to prepare a Shuttleworth Fellowship application which has got me thinking more about the general premise of cMOOCs that “knowledge is distributed across a network of connections, and therefore that learning consists of the ability to construct and traverse those networks”  (from week 1 of cck11).

The aspect, which features in my Shuttleworth application, is providing mechanisms that aggregate data from distributed sub-networks which then can be processed to produce actionable insights to tutors or participants. The process I plan to adopted is to look at the data using heavyweight tools, like NodeXL, or just applying a bit of curiosity (this person has stopped tweeting, why? etc), and then converting some of these patterns into very lightweight applications or views to remove the complexity and highlight key parts of the data.

Some examples for you:

Summary of CFHE12 participant activity

Tweets

Tweets from CFHE12 are being collected in this Google Spreadsheet. As part of this template there are a number of summary views, one of these being a breakdown of individual participant activity. As part of this sparklines are used to display someone's twitter activity. Looking at gsiemens you can see there is steady activity posting 45 tweets tagged #cfhe12. Towards the bottom of the graph is ViplavBaxi, who after initial high activity is no longer contributing to the hashtag. So what has happened to ViplavBaxi? There are a number of possible answers but let me highlight a couple which also highlights the limitation of the technique:

  • they have lost interest in the course ot time commitments prevent them from contributing (high drop outs aren’t unexpected in MOOCs)
  • no longer using #cfhe12 hashtag – the archive is only of #cfhe12 so if the have joined a sub community communicating without the hashtag it’s not recorded
  • found a different communication channel – this technique is only looking at Twitter activity, the person may have moved to another network channel like the discussion forum

Another interesting activity summary is for dieGoerelebt. They are one of the top 5 contributors in terms number of tweets, but recently their activity has trailed off. You can also see the ‘@s’ column, which is the number of times they’ve been mentioned in tweets is one of the lowest. Is the decline in activity a result of the lack of engagement?

The next question that springs to my mind is what did these people say. Within the spreadsheet it’s easy to filter what they said. To let you see too I’ve got this simple web interface primed with filtered tweets (I modified an existing tool I’ve developed to do this – unfortunately I’ve never documented it, but as I use it more and more I must get around to it):

Summary of CFHE12 participant activity with RT percentageFrom visual inspection dieGoerelebt had a high proportion of retweets. This is confirmed when I added a percentage of tweets that are retweets.

Something I noted in the filtered view for a persons tweets was that a lot of the context is lost (I can see they are @replying to someone, but I don’t know what they said.

To help with this I started looking at modifying the twitter questions filter I built to enable a view of the conversation.

This is a start, but as I noted when I published the question filter clicking through messages like the one showed below reveal there is more of the conversation that is missing.

 Part of the conversation

Bigger picture

Summary

So again I start exploring some ideas that branch off into many more avenues to follow. One thought is that the micro analysis of tweets might not my beneficial or practical, and given the issues with extracting a full conversation from Twitter a macro view might be better. Providing a summary of overall activity and the mode in which Twitter is being by people may be of the most use to tutors and participants to identify people they might want to connect with. As always your thoughts are greatly appreciated.

In this post I’ve taken an ego-centric approach contributions. In the next couple of days I’ll share an ego-centric approach to community connections.

22 Comments

On Monday I’ll be helping Tony (Hirst) deliver a workshop at Internet Librarian International … W2 – Appropriating IT: innovative uses of emerging technologies. Tony has already posted a draft outline of the day. For the event I wanted to prepare some notes on using Google Spreadsheets/Sheets (I very confused about what they are called now. There was a very quiet announcement here of a change).

I’ve come up with some practical exercises for participants to try. There will be a lot of contextualisation going on as part of the workshop, the main idea is to seed some ideas of what’s possible. Here’s the Google Docs version of the guide. The focus is on what you can do with the importHTML, importFeed, importXML, importRange and importData spreadsheet formula with a few others, including QUERY, thrown into the mix. Using QUERY isn’t straight forward but it opens up lots of possibilities in how you reshape data (extra examples of these are included in the Google Docs version).

Note: All text and images with the exception of text in grey boxes is licensed Creative Commons Attribution 3.0 Unported License. CC-BY Hirst & Hawksey (if this isn’t clear check the source or get in touch)

importHtml

Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page.

Query is either “list” or “table” indicates what type of structure to pull in from the webpage. If it’s “list,” the function looks for the contents of <UL>, <OL>, or <DL> tags; if it’s “table,” it just looks for <TABLE> tags.

Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml(“http://en.wikipedia.org/wiki/Demographics_of_India“; “table”;4). This function returns demographic information for the population of India.

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 1: Importing a html table and graphing the result

  1. Create a new Google Spreadsheet
  2. In cell A1 enter the formula to import a table from Wikipedia=importHTML("http://en.wikipedia.org/wiki/2012_Summer_Olympics_medal_table","table",3)
  3. Select Insert > Chart and then select the data ranges for country name and total medals
    Selecting chart data
  4. While still in the Chart Editor select the Charts tab, then Maps > Geo charts - regions
  5. Still in the Chart Editor select Customise and change the No value colour to blank.
  6. Finally click Insert

You should now have a chart that looks like this:

World map with missing data

Notice that the chart has No values (black) for most of the countries.  To fix this we need to remove the country codes in brackets. One way to do this is trim the text from the left until the first bracket “(“. This can be done using a combination of the LEFT, FIND and ARRAYFORMULA (ARRAYFORMULA allows you to apply formulas to multiple cells).

  1. In cell H2 enter the formula =ARRAYFORMULA(LEFT(B2:B86,FIND("(",B2:B86)-2)) this should create a column of country names without brackets)
  2. Click on your Chart and select Advanced edit.
  3. Make sure you are on the Start tab in the Chart Editor and edit the data range forSheet1!B2:B86 to Sheet1!H2:H86 then click Update

Your chart should now look like this (my solution http://goo.gl/8qUI9):

World map without missing data

importFeed

Syntax: =ImportFeed(URL; [feedQuery | itemQuery]; [headers]; [numItems]). The arguments to the function are as follows:

URL is the url of the RSS or ATOM feed.

feedQuery/itemQuery is one of the following query strings: "feed", "feed title", "feed author", "feed description", "feed url", "items", "items author", "items title", "items summary", "items url", or "items created". The feed queries return properties of the feed as a whole: the feed's title, the feed's author, etc.Note: To get the data included in the feed, you need to do an "items" request.

  1. the "feed" query returns a single row with all of the feed information.
  2. the "feed <type>" query returns a single cell with the requested feed information.
  3. the "items" query returns a full table, with all of the item information about each item in the feed.
  4. the "items <type>" query returns a single column with the requested information about each item.
  5. if a query is given that begins with "feed", the numItems parameter isn't necessary and is replaced by the option headers param.
  6. if a query is given that begins with "items", the numItems parameter is expected as the 3rd param, and headers as the 4th.
  7. headers - "true" if column headers is desired. This will add an extra row to the top of the output labeling each column of the output.

Example: =ImportFeed("http://news.google.com/?output=atom")

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 2: Importing a RSS feed and getting social share counts

  1. Open http://goo.gl/aai2p
  2. In cell B5 enter the RSS feed url for a blog (or you can use http://feeds.feedburner.com/MASHe) and hit enter

You should end up with something like:

RSS social counts

An aside: Spreadsheet Addiction by Patrick Burns (http://goo.gl/P6pQP) - highlights the dangers of using spreadsheets for analytics. Particular issues include the ambiguity of a cell being a value or a formula. For example, if I sort cells on the value in the Twitter count column all the data is lost because cells are sorted as values but actually contain formula which get broken.

How it works

In cell A11 is the formula =IF(ISBLANK(B5),,IMPORTFEED(B5,"items",FALSE)) If the feed url is not blank this fetches the RSS feed defined in B5. Results are returned in cells A11:E30. You may have noticed that column E is hidden this is because it contains the feed item description.

The social share counts are returned by a custom function written in Google Apps Script (https://script.google.com). Google Apps Script is similar to Excel Macros, written using a JavaScript syntax. If you open Tools > Script editor in your spreadsheet you can see some of the custom script powering the spreadsheet. This includes the getSharedCount formula used in cells F11:F30 which passes the post url to the SharedCount.com API and returns social share counts. The code used is:

function getSharedCount(sourceLink){
//var url = "https://mashe.hawksey.info/2012/02/oer-visualisation-project-fin-day-40-5/"
 var url = extractLink(sourceLink);
 var cache = CacheService.getPublicCache(); // using Cache service to prevent too many urlfetch
 var cached = cache.get("C"+url);
 if (cached != null) { // if value in cache return it
   //var test = cached.split(",")
   return cached.split(",");
 }
 try {
   var options =
   {
     "method" : "get",
     "contentType" : "application/json"
   };
   var response = UrlFetchApp.fetch("http://api.sharedcount.com/?url="+encodeURI(url), options);
   var data = Utilities.jsonParse(response.getContentText());
   var output = [];
   for (i in data){
     if (i == "Facebook"){
       output.push(data[i].total_count)
     } else {
       output.push(data[i]);
     }
   }
   cache.put("C"+url, output, 86400); // cache set for 1 day
   return output;
 } catch(e){
   Logger.log(e);
 }
}

For more examples of Google App Script see http://scoop.it/t/gas.

importXML

Syntax: =ImportXML(URL, query)

URL is the URL of the XML or HTML file.

Query is the XPath query to run on the data given at the URL. Each result from the XPath query is placed in its own row of the spreadsheet. For more information about XPath, please visithttp://zvon.org/xxl/XPathTutorial/Output/.

Example: =importXml("http://www.toysrus.com"; "//a/@href"). This returns all of the href attributes (the link URLs) in all the <a> tags on http://www.toysrus.com homepage.

Note: The limit on the number of ImportXML functions per spreadsheet is 50.

Exercise 3: Turn a page of RSS Feeds into an OPML file

  1. Create a new spreadsheet
  2. In cell A1 enter the text ‘Title’ and in cell B2 ‘Url’
  3. Now in cell A2 enter the formula=ImportXML("http://edfuture.mooc.ca/feeds.htm","//b/a")
  4. Cell B2=ImportXML("http://edfuture.mooc.ca/feeds.htm","//a[.='XML']/@href")
  5. File > Publish to the web and click ‘Start publishing’, copy the link in the bottom box then ‘Close’
  6. Visit http://opml-generator.appspot.com/ and paste your spreadsheet link in the box and copy the generated link into your broswer address bar
How it works

Using XPath we can identify parts of a XML (including HTML) page we want to extract. The screenshow below shows how parts of the page are identified. [I always struggle with XPath so use browser extensions to help (Scraper and XPath Helper)]. The results are pulled into the spreadsheet as live data so if the source page is updated the data in the spreadsheet will also be updated.

XPath parts

ImportRange

Syntax: =ImportRange(spreadsheet-key, range)

Spreadsheet-key is a STRING which is the key value from the spreadsheet URL.

Range is a STRING representing the range of cells you want to import, optionally including the sheet name (defaults to first sheet). You can also use a range name if you prefer. Given that the two arguments are STRINGs, you need to enclose them in quotes or refer to cells which have string values in them.

Example: =importrange("abcd123abcd123", "sheet1!A1:C10")

"abcd123abcd123" is the value in the "key=" attribute on the URL of the target spreadsheet and "sheet1!A1:C10" is the range which is desired to be imported.

Note: In order to use ImportRange, you need to have been added as a viewer or collaborator to the spreadsheet from which ImportRange is pulling the data. Otherwise, you'll get this error: "#REF! error: The requested spreadsheet key, sheet title, or cell range was not found."

[Here’s http://goo.gl/b8FXC is a copy of the completed spreadsheet used in exercises 4, 5 and 6]

Exercise 4: Importing data from Guardian Datastore

  1. Visit http://goo.gl/j6RBU and click Get the Data, then DATA: download the full spreadsheet
  2. Keep this window open and create a new spreadsheet.
  3. In cell A1 enter=ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104") - notice how the key and range are entered

Lets now reshape the data so we can generate some graphs. Lets first calculate the change in rank between 2011 and 2012

  1. In cell E2 enter the formula =B2-A2
  2. Fill this formula for the rest of the rows (there are a couple of ways of doing this including copying cell E2, highlighting the other empty cells in Column E and pasting, or whilst E2 is active grab and drag the bottom right corner of the cell
  3. Now we want to get a list of the countries included in column D. To do this in cell G2 enter the formula =UNIQUE(D2:D102)
  4. Now we want to sum the rank difference per country by entering the following formula in cell H2:=SUMIF(D$2:D$102,G2,E$2:E$102)
  5. Copy this value down for the remaining rows
  6. Select the data range G2:H16 and Insert > Chart > Bar chart

University world ranking delta

Graph of change in top 100 world university ranking.

Is the process for producing this chart valid? Is it displaying a meaningful representation of the data? Is this chart a lie?

Important: Notice that the calculation for France has an error:

Broken value

This is because on row 93 the source data doesn’t have a number value. Because we’ve used ImportRange to get the data we can’t edit it as are changes get overwritten by the importRange formula in cell A1. In our scenario we can remove the calculated value in E93 or use a formula to handle the error. Other ways around this are to flatten the imported data by copying all of it and paste as values (other solutions exist which we cover later) 

ImportData

Syntax: =ImportData(URL)

URL is the URL of the CSV or TSV file. This imports a comma- or tab-separated file.

Note: The limit on the number of ImportData functions per spreadsheet is 50.

Exercise 5: Importing CSV data from Google Maps

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. In cell A1 of the new sheet enter the formula =FILTER(Sheet1!A:E,Sheet1!D:D="United Kingdom") to filter the data on sheet1 where column D is the United Kingdom
  3. Now in column F1 enter the formula =ImportData("http://maps.google.com/maps/geo?output=csv&q="&C2) and press return
  4. Copy this cell down for the remaining rows

You should now have a table that looks a little like this:

ImportData from Google Maps

Import... and QUERY

Syntax: =QUERY(data, query, headers)

Data - An array of data. For example: A1:B6, data!B2:H6, ImportRange(spreadsheet_key, [sheet!]range),FILTER(sourceArray, arrayCondition_1, …)

Query - A query string for applying data operations. The query operates on column IDs directly from the input range and uses a subset of the SQL language. For example, "select E," "select A , B," "sum(B),C group by C," "select D where D < 'Nick' ." In certain instances, for example when using FILTER as a data source, column identifiers are Col1, Col2 etc. For more information on creating queries read see Google Visualization API Query Language

Headers (optional) - A number specifying the number of header rows in the input range. If omitted, or set to -1, the number of header rows is guessed from the input range. This parameter enables transformation of multi-header rows range input to be transformed to a single row header input which the QUERY supports.

Exercise 6: Import and QUERY

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. Form your original sheet (Sheet1) copy the importRange formula in cell A1
  3. In your new sheet (Sheet3) paste the formula you just copied inside a QUERY formula shown below

    =QUERY(ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104"), "SELECT Col1, Col2, Col3, Col4, Col2-Col1 LABEL Col2-Col1 'Difference'")

You should now have a table that looks like this:

QUERY for difference

How it works

The QUERY function imports the data and using the Google query language selects columns 1 to 4 and also adds a fifth by taking the difference between Col2 and Col1, this new column is labeled as Difference. Notice that on row 93 the French entry no longer has an error, but is blank.

We could continue exercise 4 and get a summary chart using UNIQUE and SUMIF. An alternative would be to use the QUERY formula again to do this for us by:

  1. In Sheet 3 enter the following formula in cell G1 =QUERY(A:E,"SELECT D, SUM(E) WHERE D <> '' GROUP BY D ORDER BY D")

This time we are setting the data source as all the data in columns A to E in Sheet3. Next we are creating a query that selects column D and a sum of column E where D is no blank and grouped by the value in column D (the country names).

A reminder that here’s a copy of the completed spreadsheet used in exercises 4, 5 and 6 http://goo.gl/b8FXC

Summary/Keypoints

Hopefully you’ve seen that Google Sheets (Spreadsheets) can be a useful tool for importing data from other sources and reshaping it to fit your needs. The last set of exercises are more advanced so don’t worry if you don’t fully understand what is happening, they are there as an indication of what is possible and hopefully inspire you to find out more.

Three points worth remembering when using import formula:

  • The data is a live link - if the data source changes or disappears  your imported data will also change or disappear (this can be both an advantage and disadvantage).
  • The ambiguity of a cell - because a spreadsheet cell can a value and a formula sorting values generated by formulas can lead to unforeseen problems.
  • There are other ways of getting data into Google Sheets - this guide has only looked at ‘Import...’ formula for getting pulling data in. There are other ways of populating sheets using Google Forms, Google Apps Script and 3rd party services like IFTTT (for an example of this last one see ‘IFTTT: IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet - http://goo.gl/nB0vD)

A reminder that the Google Docs version of the guide is here and contains more examples. If you get stuck leave a comment.