Archive for the 'Analytics' Category

Page 2 of 5

WordPress Stats in R: Generating postview sparklines using SparkTable

After posting Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet) Tony quickly came back with WordPress Stats in R, which includes a handy custom function for querying the WordPress Stats API. Something I wanted to do in my original post was to generate sparklines for post activity. I wasn’t able to find a way of doing this in Google Sheets  because of the number of data points (n. 366) and didn’t think Google Apps Script would handle the dataset because of timeouts. So given the start Tony had made with R it seemed like the ideal opportunity to refresh my R skills.

So below is the result (here’s the interactive version), which is followed by an explanation of how it was made.

SparkTable of postview 

Processing the data in R

Thanks to the code provided by Tony it’s easy to get postviews data back from WordPress using:

wp.postviews=wordpress.getstats.demo(APIKEY, BLOGURL, 'postviews', days='366', end='2012-12-31',  period='day', limit=-1)

This returns a data frame that looks like this:

Note that 478 different post_titles are returned compared to 178 post_permalinks showing some of the data is missing and in fact wrong. This isn’t important for this particular project. I’ve uploaded a sample of the returned data here. The data returned includes a day stamp, post title and number of views. Posts with no views on that day are not returned.

For the sparklines I initially headed off down a dead end (lines 99-107) making a pivot table in R. This wasn’t entirely wasted effort because the process of casting this and handling null values was used later in the project.  Having got stuck at this point searching for more information on sparklines in R turned up this question on stackoverflow and this set of slides referencing the sparkTable package. Initially I tried dropping and renaming some of the columns from the original data but on line 96 when trying the reshapeExt I got:

Error in `[<-.data.frame`(`*tmp*`, , attr(dat, "reshapeLong")[["timevar"]],  : 
  replacement has 32504 rows, data has 32538

Searching for variations around the error message didn’t turn anything up and if you know what is wrong I’d be very grateful for guidance. Instead I decided to follow a hunch and instead of using partial time series data filled in the blanks by casting and then melting i.e. from this:

data from worpress

to this (filling in missing data with ‘0’)

data from worpress casted

before going back to this

data from worpress then melted

Doing this got me past reshapeExt and outputting a sparkTable as html (the other option is LaTeX) embedding 478 sparkBars as png images. The final step was to wrap the html using jQuery DataTables (the main thing was to add <thead> and <tbody> tags). 

Was it worth it? I’ll let you decide if the product is any good, the process of doing it was definitely useful. So how would you improve it?

Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet)

Like other wordpress.com and self-hosted wordpress blogs using Jetpack I got my annual report. Some of the headlines include:

The report then lists the top posts (minus the totals, which are available from the stats panel)

These are the posts that got the most views on JISC CETIS MASHe in 2012.

  1. Export Twitter Followers and Friends using a Google Spreadsheet 77 COMMENTS March 2011
  2. Twitter Archiving Google Spreadsheet TAGS v3 17 COMMENTS January 2012
  3. The best Google Spreadsheet Event Manager (w/h Sites, Contact, Calendar integration) ever, ever, ever 4 COMMENTS November 2010
  4. Twitter: How to archive event hashtags and create an interactive visualization of the conversation 13 COMMENTS November 2011
  5. Using Google Apps Script for a event booking system (Spreadsheet to Calendar & Site | Form to Spreadsheet, Email and possible Contacts) 23 COMMENTS March 2010

These results are biased towards posts that have had the most time to collect views. What about the posts made later in the year? Or, other posts like IFTTT that attracted a lot of traffic in one day?

Remembering Tony Hirst’s Making Use of WordPress Stats which pulls data from the WordPress Stats API into a Google Sheet (Spreadsheet) I created a modification that pulls in a years worth of stats and lets me explore some of these questions. Below is a summary of WordPress Postviews Stats Analysis for MASHe (columns are sortable by clicking on the column heading).

Notes

  • AVERAGE of Views is the average post views per day. Days with no views are not counted
  • MAX of Views is the maximum post views in one day
  • SUM of Views is the total number of views a post has received
  • COUNTA of Views is the total number of days a post received at least one view

This reveals posts like Feeding Google Spreadsheets (from October 2012) and Mozilla Open Badges Issuer Gadget for Google Sites (December 2012) have good average daily views basic projections putting them in the top 10 for 2013.

There’s more digging to be done, but in the meantime if you want to have a look at your own wordpress stats grab a copy of the template below and have a go (and if you make any improvements/useful insights leave a note ;)

*** WordPress Postviews Stats Analysis Template ***
[File > Make a copy for an editable version]

Creation notes

The code comments highlight the main gotchas:

function wordpressStatPostviews(api_key, blog_uri, end, days, limit) {
  // build url for api. Found better results using json rather than csv (csv didn't get  years worth)
  var url = "http://stats.wordpress.com/csv.php?api_key="+api_key+"&blog_uri="+blog_uri+"&end="+Utilities.formatDate(end, "GMT", "yyyy-MM-dd")+"&days="+days+"&limit="+limit+"&table=postviews&format=json";
  var output = [["Link","Title","Title - Short","Date","Views"]]; // initialise return array
  try {
    var options = { "method" : "get" }; // initialise options for UrlFetchApp
    var response = UrlFetchApp.fetch(url, options); // fetch url
    if (response.getResponseCode() == 200) { // if response
      var raw = Utilities.jsonParse(response.getContentText()); // parse text respose into json
      raw.reverse(); // reorder to oldest first - no real need but helped with debugging
      for (i in raw){ // data is returned as object per day so iterate across
        var postdate = raw[i].date; // pull post date (as string) considered converting to date object but not required
        for (j in raw[i].postviews){ // in each date object there is array of views per post, interate across to push into returned 2d array 
          var apost = raw[i].postviews[j]; // abreviating
          // noticed a bug in returned data. Views for homepage have correct post_id but post_title is pulled from last interation
          var title = apost.post_id === 0 ? "Homepage" : apost.post_title; 
          // creating a short title for published table (no cell wrapping)
          var titleShort = title.length > 50 ? title.substring(0,50)+"..." : title;
          // push row to output
          output.push([blog_uri+"/?p="+apost.post_id,title, titleShort, postdate,  apost.views]);
        }
      }
      return output;
    }
  } catch(e) {
    throw e;
  }
}

Twitter provides proof of real-time engagement with the public: How to record it as evidence

Alistair Brown has written an interesting post on the LSE Impact of Social Sciences blog – Proving dissemination is only one half of your impact story: Twitter provides proof of real-time engagement with the public. The post highlights the case of how a journal paper was picked up by a university media office, which ended in the author being interviewed on BBC Radio 4’s Today programme. As part of this Alistair highlights that:

REF impact involves an assessment of “significance” as well as “reach,” so the mere fact that research has been disseminated to a wide audience does not constitute an impact by itself; one has also to show the effect it has on those to whom it is disseminated. For this reason, citing the fact that a researcher has appeared on a primetime radio show with several million potential listeners might be one element of an impact statement, but one needs also to evidence that the audience has actively listened to what was being put out, and that it has affected, changed or benefitted them in some way

In the age of the second screen Alistair goes on to highlight how Twitter can be used as evidence of engagement, listeners tweeting personal reflections, feedback or just disseminating the information more widely. But as Alistair points out:

When a piece of academic work receives broadcast media coverage, then, it is useful to have a strategy in place to gather emerging responses, and it is also far easier to do this as it happens rather than retrospectively.

A strategy is required because, as Alistair points, out the Twitter search is limited to the last 7 days. While there are ways to view this activity in realtime how do you capture the evidence.  Here’s my response to the problem:

Here is the spreadsheet template I mention. So have you got a strategy for recording impact evidence from Twitter?

Guest Post for Big Data Week #bdw13: Getting Creative with Big Data and Google Apps

Getting Creative with Big Data and Google Apps I was recently asked to write a guest post for Big Data Week on using Google Apps as an interface for Big Data. For the post I decided to revisit an old recipe which uses Google Sheets (Spreadsheets) and Google Apps Script to interface the Twitter and Google Analytics API. One of the results is the bubble graph shown below which shows who has been tweeting my blog posts, how many visits their tweet generated, the number of retweets and how many followers the person has (click on the image for the interactive version). You can read more about his this was done and get a copy of the template in Getting Creative with Big Data and Google Apps

Visits, rewteets bubble graph

CFHE12 Analysis: Summary of Twitter activity

You may have noticed I missed an analysis of week 5 of CFHE12, but hopefully I’ll capture what I wanted to say and more in this post. In this post I want to pull together a couple of ideas around some of the measurable user activity generated as part of CFHE12. This will mainly focus around Twitter and blogs and will ignore other channels like the course discussion forum, webinar chat or other spaces participants might have discovered or created. I conclude that there are some simple opportunities to incorporate data from twitter into  other channels, such as, summary of questions and retweets.

Twitter Activity Overview

The headline figures for tweets matching the search ‘#CFHE12 OR edfuture.net OR edfuture.mooc.ca OR edfuture.desire2learn.com’  for 8th October to 18th November 2012 (GMT):

  • 1,914 Tweets from 489 different Twitter accounts
  • 1,066 links shared
  • 472 retweets
  • 10% (n=206) of the tweets were in @reply to another Twitter account
  • Contributions from accounts in 45 countries (top 5  United States of America – 166; Canada – 50; United Kingdom – 43; Australia – 28; Germany – 12) [1]

Map of #cfhe12 participants

Looking at week-by-week distribution of contributors and contributions it can be seen that after the initial first 2 weeks the number of tweets posted each week remained consistent around 200. 75% of the Twitter accounts (n=374) contributed tweets to 1 week of CFHE12.

Twitter contributors and tweets Number of weeks participants contributed in

Comparing Twitter activity with blog posts aggregated with gRSSHopper doesn’t reveal any correlation but it’s interesting to note that whilst the volume of tweets remain relatively consistent for weeks 3 to 6 there is a significant drop in blog posts between week 4 and 5.

CFHE12 number of tweets and blogs

Looking at distribution of tweets over day and time shows a lull on Thursdays, but a peak around 1900hrs GMT which would appear to coincide with the usual time slot used for tweetchats.

CFHE12 Tweets by day of week CFHE12 Tweets by time of day

Getting more technical having collected the friend follower relationships for Twitter accounts using #CFHE12 for each of the weeks it possible to analyse new connections between community members. At the end of week 1 the top 52 contributors were joined by 286 follow relationships. By the end of the course 45 new follow relationships were created increasing the graph density [0.103774 –> 0.120101597] and reducing geodesic distance [2.008136 –> 1.925296]

The graph below highlights the new relationships (bold line). Nodes are sized by the number of new connections (as part of the archive the friend/follower count is captured with each tweet so it may be possible to do further analysis). Its interesting to note that BarnetteAndrew is an isolated node in G7.

cfhe12 interconnection growth

Refining the signals from the Twitter feed

Adam Cooper (CETIS) has recently posted some tips from a presentation John Campbell on development of Signals at Purdue, which includes using a spreadsheet as a starting point as a way to find out what you need. I’ve already got some basic tools to overview a Twitter archive but used the CFHE12 data to experiment with some more.

By week

Adding to the existing Twitter Activity sparklines it’s been possible to extract a summary of week-by-week activity (a basic traffic light). Whilst this is in a way a duplication of the data rendered in the sparkline it has been useful to filter the participants based on queries like ‘who has contributed in week 1’ and ‘who as contributed in all the weeks’. If you were wanting to take this to the next level you’d combine it with the community friendship graph and pay extra attention to the activity of your sub community bridges (for more info on this see Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights).

CFHE12 Activity data

Conversation matrix

Graphing the conversations (@reply, @mentions and retweets) using TAGSExplorer gives this ball of mess.

CFHE12 TAGSExplorer View

Trying to find a more structured way of presenting the data I’ve experimented with an adjacency matrix (shown below or interactive version here)[2]. Each cell is colour coded to indicate the number of interactions (replies, mentions and retweets) between users. For example we can see that gsiemens has had the most interactions with barrydahl. Scanning along the rows gives you a sense of whether a person was interacting with a large number or select few other accounts. For example, pgsimoes has interactions mostly with suifaijohnmak. Filtering tweets for pgsimoes (possible from the link in the left column) it looks like it’s an automatic syndication of suifaijohnmak’s blog posts.

CFHE12 conversation adjacency matrix

Do you have any questions?

CFHE12 Any Questions?

At the beginning of CFHE12 I posted Any Questions? Filtering a Twitter hashtag community for questions and responses. This is a crude tool which filters out tweets with ‘?’ which might indicate they are a question. By counting the number of tweets in the archive which reply to a question you get the following breakdown

  • Total questions 309
  • Questions with replies 44

An important point to note is that as only tweets that meet the search criteria are archived there may be more responses ‘off tag’. The danger in a medium like Twitter used in courses like CFHE12 is that questions may go unanswered, misconceptions are not corrected, feedback is never given.

Part of the issue in the case of CFHE12 is participants are allowed to use tools like Twitter as they please. While this suits ‘visitors and residents’ some additional structure may be beneficial. Two simple approaches would be to direct participants to include the course tag in their reply and highlighting current questions by either using the ‘Any Questions’ tool or in the case of courses using gRSSHopper instead of including all the latest course tweets in the daily email alert filter the search for ‘CFHE12 AND ?

Retweet-me

retweetsFor a while I’ve had a basic function that extracted the tweets with the most retweets, but <sigh> I think it on the list of developments I’ve never got to blog about. The routine is a relatively simple bean counter that goes through a list of tweets, removes any hyperlinks and crudely shortens the text by 90% (to account for any annotations) and counts the number of matches before returning a list of the top 12. Slicing the data for each week I get these tables. There is probably more analysis required of what is being retweeted before making a decision about how this data could be used. My main question for the #cfhe12 twitter community is doing they have a sense of what is being retweeted the most. The other angle is pushing some of this data into other communication channels like the Daily Newsletter or discussion forums.

Summary

So hopefully the summary data I extracted and experimentation with new data views has been useful. Something that has been reinforced in my own mind that more value could be easily gained using Twitter by either providing guidance on use and/or incorporating data from Twitter in other channels more effective (rather than dumping everything into a daily email select some key data). Now that I’ve got a template which splits some of the data into weekly slices it should be easier to deploy and pass data into other systems buy changing some of the dates on the summary sheet.

But what do you think? Are there any particular data views you found useful? If you’ve participated in a community that uses Twitter a lot what additional tools would you find useful to keep track of what is going on?

Get the Data

Notes

[1] Countries were reconciled by extracting location recorded in twitter account profile and generating geo-coordinates using recipe here. Locations were extracted for 404 accounts. Co-ordinates were uploaded to GeoCommons and analysed with a boundary aggregation to produce this dataset.

[2] the matrix was generated by exporting conversation data from TAGSExplorer by adding the query &output=true to the url e.g. like this, importing into NodeXL then filtering vertices based on a list of top contributors. This was exported as a Matrix Workbook and imported into the Google Spreadsheet. Conditional formatting was used to heatmap the cells.

Summary of social monitoring tools and recipes I use at JISC CETIS

scapetate.com data model
scapetate.com data model
Originally uploaded by psd
I thought it would be useful to give a summary of some of the tools I use/developed at CETIS to monitor the pulse of the web around our and JISC work. All of these are available for reuse and documented to varying degrees. All of the tools also use Google Spreadsheets/Apps Script which is free for anyone to use with a Google account, and all the recipes use free tools (the exception being owning a copy of Excel, but most institutions have this as standard).

Tools

Hashtag archiving, analysis and interfacing

Hashtag archiving, analysis and interfacingUsed with: CETIS COMMS, OERRI Programme, #UKOER, …
What it does: It’s a Google Spreadsheet template which can be setup to automatically archive Twitter searches. The template includes some summaries to show top contributors and frequency or tweets. There are a number of add-on interfaces that can be used to navigate the data in different ways, including TAGSExplorer and TAGSArc.

More info: http://mashe.hawksey.info/2011/11/twitter-how-to-archive-event-hashtags-and-visualize-conversation/

Monitoring Twitter searches and combining with Google Analytics

Monitoring Twitter searches and combining with Google AnalyticsUsed with: CETIS COMMS
What it does: Archives all tweets linking to to the .cetis.ac.uk domain and combines with our Google Analytics data to monitor influential distributors of our work.

More info: http://mashe.hawksey.info/2012/03/combine-twitter-and-google-analytics-data-to-find-your-top-content-distributors/

RSS Feed Activity Data Monitoring

RSS Feed Activity Data MonitoringUsed with: CETIS COMMS, OERRI Programme
What it does: Gives a dashboard view of the total social shares from a range of services (Facebook, Twitter, Google+ for a single or combination of RSS feeds. At CETIS we also monitor the social popularity of blogs referencing .cetis.ac.uk by using a RSS feed from Google’s Blog Search e.g. http://www.google.com/search?q=link:cetis.ac.uk&hl=en&tbm=blg&output=rss&num=20

More info: http://mashe.hawksey.info/2012/06/rss-feed-social-share-counting/

Post Activity

Blog Activity Data Feed Template OverviewUsed with: CETIS COMMS
What it does: Gives more detailed activity data around socially shared urls combining individual tweets from Topsy, Delicious, and post comments.

More info: http://mashe.hawksey.info/2012/08/blog-activity-data-feed-template/

Post Directory

Post DirectoryUsed with: OERRI Programme
What it does: Dashboards all the project blogs from the OERRI Programme and monitors when they release blog posts with predefined tags/categories. The dashboard also combines the social monitoring techniques mentioned above so that projects and the programme support team can monitor social shares for individual blog posts.

More info: http://mashe.hawksey.info/2012/08/how-jisc-cetis-dashboard-social-activity-around-blog-posts-using-a-splash-of-data-science/

Automatic final report generation

OERRI DashboardUsed with: OERRI Programme
What is does: As an extension to the Post Directory this tool combines project blog posts from a predefined set of tags/categories into a final report as an editable MS Word/HTML document. Currently only the original post content, including images, is compiled in individual reports but it would be easy to also incorporate some of the social tracking and/or post comments data.

More info: http://mashe.hawksey.info/2012/09/converting-blog-post-urls-into-ms-word-documents-using-google-apps-script-oerri/

Recipes

As well as standalone tools I’ve documented a number of recipes to analysis monitoring data.

Twitter conversation graph

Twitter conversation graphUsed with: #moocmooc, #cfhe12
What it does: Using data from the Twitter Archiving Google Spreadsheet template (TAGS) this recipe shows you how you can use a free Excel add-on, NodeXL, to graph threaded conversations. I’m still developing this technique but my belief is there are opportunities to give a better overview of conversations within hashtag communities, identifying key moments.

More info: http://mashe.hawksey.info/2012/08/first-look-at-analysing-threaded-twitter-discussions-from-large-archives-using-nodexl-moocmooc/

Community blogosphere graph

Community blogosphere graphUsed with: #ds106
What it does: Outlines how data from blog posts (in this case a corpus collected by the FeedWordPress plugin used in DS106) can be refined and graphed to show blog post interlinking within a community. An idea explored in this recipe is using measures used in social network analysis to highlight key posts.

More info: http://mashe.hawksey.info/2012/10/visualizing-cmooc-data-extracting-and-analysing-data-from-feedwordpress-part-1-ds106-nodexl/

Activity data visualisation (gource)

Activity data visualisation (gource)Used with: #ukoer
What it does: Documents how data can be extracted (in this case records from Jorum) and cleaned using Google Refine (soon to be renamed OpenRefine). This data is then exported as a custom log file which can be played in an open source visualisation tool called Gource. The purpose of this technique is to give the viewer a sense of the volume and size of data submitted or created by users within a community.

More info: http://mashe.hawksey.info/2011/12/google-refining-jorum-ukoer/

So now go forth and reuse!

Visualizing cMOOC data: Extracting and analysing data from FeedWordPress part 1 #ds106 #NodeXL

In a recent post I showed that how given some urls it was easy to generated a templated report as a word document. This was partly done to show how blog posts from a JISC funded programme could be used to generate reports. One of the issues with this solution is not all the projects used WordPress as their blogging platform making it harder to get at some of the data. One thought I had was rather than dictating a particular platform JISC could aggregate all programme posts in a central database. This isn’t entirely new as at CETIS we already aggregate links to posts for selected programmes via our PROD database, the difference here would be as well as a link the post content would also be ingested making it easy to query and analysis the data from a single source. As I recently wrote about in Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine this technique is common in cMOOCs, and in particular highlighted how Digital Storytelling (DS106) do this already using the FeedWordPress plugin.

As part of a double hit I thought I’d see if there was a case for JISC implementing this type of infrastructure, and because I’ve got an interest in applying learning anlaytics to open online courses, have a look at the data stored by ds106 and see what analysis I could do. In the next series of posts I share my notes and sketches around extracting FeedWordPress data, supplement it with 3rd party sources and analysing dialog/activity around posts. In this first post I’m going to outline how all the data was imported into a local WordPress database, queried to pull out a data slice, refined and mined, before visualisation/analysis. What we end up with is a view of how ds106 blog posts are interlinked by hyperlinks within the post body. For the next in the series I record some notes on comment extraction and finally how to get more activity data from social network sites.

The data source

If JISC were hosting the data then in theory it would be  a lot easier to get to. As my test data comes from ds106 some extra preparatory work was required. Fortunately #ds106er Alan Levine (@cogdog) was able to provide me with a MySQL dump of data from their WordPress install. The dump contained over 700,000 lines of SQL commands and was over 500Mb. Problem one was then how to get this back into a local MySQL database to make slice, dice and export easier.

For a local Apache/MySQL/PHP server I used  XAMPP Lite. Importing the data wasn’t straight forward as despite tweaking the config kept having ‘allocated memory’ errors. The solution was ‘BigDump: Staggered MySQL Dump Importer’, which was able to automatically split and run the import. Even with this I got a couple of ‘MySQL has gone away’ errors so used Vim (a text editor that can handle very large files) to drop 5 blog post inserts (given ds106 had over 20,000 posts I can live with the lose).

The selected data export

Wordpress Database DiagramThose familiar with the backend of WordPress will know it has a relational structure with a number of tables holding different parts of the data. Because I knew I wanted to explore ‘ds106 dialog’ and that to do this the data would have to be compiled with other sources I needed to export a slice of the data. Whilst I was does this it also made sense to make some of the data more meaningful. For example, the wp_posts table which has most of the data I needed uses an author id number where as it would be better to get this as a display name by joining it with the wp_users data. Fortunately because MySQL is … well a ‘query language’ this is relatively easy to do by using the command below. To talk you through it I’m selecting a couple of columns from the wp_posts table and joining it with some other data on the condition that it’s a published post. As well as author display name you’ll see that I’m also returning a wfw:commentRSS. This is the comment RSS feed caught by the FeedWordPress plugin (I think it comes from here). I got wind of this metadata after reading Jim Groom’s Displaying Distributed Comments on the Hardboiled Blog. I’ll show how this is used later.

SELECT wp_posts.id, 
       wp_posts.post_author, 
       wp_posts.post_date, 
       wp_posts.post_content, 
       wp_posts.post_title, 
       wp_posts.post_name, 
       wp_posts.guid, 
       wp_postmeta.meta_value AS comment_feed, 
       wp_users.display_name 
FROM   wp_posts 
       LEFT JOIN wp_postmeta 
              ON wp_posts.id = wp_postmeta.post_id 
                 AND wp_postmeta.meta_key = 'wfw:commentRSS' 
       LEFT JOIN wp_users 
              ON wp_posts.post_author = wp_users.id 
WHERE  wp_posts.post_type = 'post' 
       AND wp_posts.post_status = 'publish'

Running this query in phpMyAdmin gives us an option to exporting as a csv giving almost 20,000 blog posts (one per row) to play with which is plenty.

The refinement

apply-operationsUsing Google Refine we can import the csv file choosing comma separated and headers in first row. Next we want to process the post content to extract outbound links. Below a the rough steps I used which you can download and apply to your own project (the steps in the file vary slightly as I noticed that the guid links had ?p={number} instead of the post permalink. I ended getting these when fetch the comment feed and extracting the link. More detail about this in the next post. Note to self: issue with non-permalinks).

  1. Create column post_as_date at index 3 based on column post_date using expression grel:value.toDate(“Y-m-d H:m:s”)
  2. Create column outbound_links at index 5 based on column post_content using expression grel:forEach(value.parseHtml().select(“a[href~=]“),e,e.htmlAttr(“href”)).join(“|||”)
  3. Split multi-valued cells in column outbound_links
  4. Create column source_links at index 9 based on column guid using expression grel:row.record.cells["guid"].value[0]

The key step is 3. which mines all the posts for <a href> tags and extracts them into a new column joining them together with ‘|||’

image

Using Refine we can split the outbound_links column to put one link per row (column dropdown, Edit cells > Split multi-valued cells). The last step is to make sure the post source link is included in each row using fill down the right and secure way. Filtering out blank outbound_link rows and Export using Custom tabular exporter (setting I used here) I get some data that looks like this:

The data meets Gephi and NodeXL

With this data we can open in Excel and then import into NodeXL using Import > From Open Workbook, more about that later, or alternatively using Gephi change the column headings in the .csv version to source and target and import via the ‘data laboratory’ tab (for graph manipulation in Gephi Tony Hirst has a great tutorial).

Sticking with Gephi for now you can generate the images below. On the left is the entire graph which depicts the ~59,000 links contained in ~20,000 ds106 blog posts. The image on the right is zoomed in to the inset where we can start seeing each webpage referenced as a dot connected by a line which is a link in a post. Whilst this is pretty, it’s also pretty useless.

All links from ds106 links Close-up of post links

So where to go with this? In the paper A Theoretical Proposal of Learning Communities Management Through the Analysis of the Organizational Blogosphere the authors propose using the measure of ‘betweenness centrality’ (BC) as a way of identifying influential learners within a blogging collective:

The betweeness centrality has been considered in literature (Marsden, 2002) as a way to find the most valuable nodes within a social network. The strategic function of these characters is fundamental for the system of learners, because they have the widest view of what happens in the net. This is due to the fact that they intercept the majority of the knowledge and information flows within the learning community. The people with the highest betweeness centrality can share their vision on the main knowledge claims of the community in an open debate.

Extrapolating this slightly lets consider individual blog posts rather than individual learners, using BC as a way to highlight posts that may bridge concepts, or support the community of learners in some way. Rather than examine all links in the blog posts lets first consider internal linking within the ds106 community.

To do this lets turn back to NodeXL which I find a lot easier to get data in and manipulate. Opening our export file from Google Refine and then using Import > From Open Workbook we can import all the data source and target being edge properties, the rest vertex 1 properties. This gives you over 75,000 connections between over 59,000 links, back at square one. A nice feature of NodeXL is to tell it to ‘skip’ some data (edge, vertex or group). The data is kept in the spreadsheet so you can include it later but it’s not used for any calculations. I only want to analysis ds106 posts that link to other ds106 posts. Here’s brief notes on the way I did it (other ways are possible and might even be better).

Autofill options

On the ‘Edges’ sheet in the ‘Other columns’ section I created 4 columns named: Community Link, Self Link, Combined and

Internal Link, then added the following formula in each column:

  • =IF(ISERROR(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30 ,FALSE)),0,IF(LEN(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30, FALSE))>12,1 ,0)) – because we only have dates, titles, names for posts made by the ds106 community, if this detail is blank on the vertices sheet then the edge is outwith the community (if it exists enter 1, otherwise 0) Tip: after doing this I copied all the Community Link values and Paste Special, Values to save recalculation on edit/open.
  • =IF([@[Vertex 1]]=[@[Vertex 2]],0,1) – test if edge is self-link
  • =[@[Community Link]]*[@[Self Link]] – multiplying these together as a simple operator
  • =IF(ISERROR(OR(FIND(“/”,[@[Vertex 1]],12), FIND(“/”,[@[Vertex 2]],12))), 0, IF(LEFT([@[Vertex 1]], FIND(“/”,[@[Vertex 1]], 12))=LEFT([@[Vertex 2]], FIND(“/”,[@[Vertex 2]],12)), 1,0)) – used to test if edge is two posts by the same author

 

Using the Autofill we can chose to skip edges based on the ‘Combined’ value (and while we are at it set the edge style based on the ‘Self Link’ value). Once this is done you can use NodeXL to calculate Groups and Metrics knowing that it’ll skip the edges it doesn’t need.

Once we’ve done all of this and applied some chart options this is the graph we get.

ds106 interlinked blog posts

Taking a closer look at one of the groupings (G1) shown below we can see how ds106 blog posts (the dots) link to each other, a dotted line indicating it’s the same post author referring to their own work.

Group G1

The first impression

We now have a dataset of ds106 blog posts and the webpages that they individually link to. There are some interesting patterns and more work to be done to interpret these. Whilst this technique has surfaced interlinks within the community there is no context in which they are made. Ranking the posts by betweenness centrality we get the following top 10:

  1. http://bavatuesdays.com/the-daily-shoot-and-manic-ravings-about-ds106/
  2. http://bavatuesdays.com/innovation-in-elearning-interview/
  3. http://cogdogblog.com/2011/04/13/do-you-feel-lucky-wabbit/
  4. http://thisevilempire.com/blog/?p=442
  5. http://bavatuesdays.com/the-ds106-99-15-suburbia/
  6. http://bavatuesdays.com/the-ds106-99-13-valley-girl/
  7. http://gforsythe.ca/2011/04/15/ds106radio-memory-lane/
  8. http://bavatuesdays.com/week-12-13-assignment-el-mashup/
  9. http://cogdogblog.com/2012/02/19/week-5-in-review/
  10. http://cogdogblog.com/2012/02/14/calling-card-bullitt/

It’s interesting that most of these come from tutors (although given the amount of co-creation the role of tutor/student is very blurred), which in some ways isn’t that surprising given they will be summarising and highlighting key work.

The Limitations

Hopefully by this point you are screaming at the screen because of the limitations of this analysis. The focus has been on blog post content, which I hope I’ve shown can be used to identify interesting moments. The big limitation here is it’s looking at a narrow slice of activity – how students are linking to blog posts without any real context. Another staple of blogs is comments. Comments provide an easy way for a reader to feedback to the author continuing the dialogue sharing thoughts, ideas and  reflections. It’s also all well and good me doing this but the real interesting thing would be to put this graphical overview in the hands of tutors and students to see if it helps them. I imagine it wouldn’t be hard to wrap the MySQL query initially used in some PHP and visualisation library like d3.js and provide users with some situational awareness of what is going on in their community. If you would like to explore the data the NodeXL file is here for download.

As part of this post was also about the feasibility of using FeedWordPress to aggregate JISC project blogs here are some of my notes:

  • Exporting post content was possible with MySQL access (interfaces would be easy to design)
  • Issue encountered with how post links stored (blogspot/blogger feeds use a internal guid=, where as other guids are links to the original posts). This was solved by fetching the comment feed and parsing the page link from the response
  • Need to be aware that FeedWordPress can’t get content for partial rss feeds but plugins are available
  • Registering a feed is a manual process
  • Issue with mixed-mode feeds (if you are automatically pulling in all content if the feed source has a mixed use e.g. blog is mixture of personal and work, then data isn’t as clean)
  • Getting the data into Google Refine made it easy to mine post content e.g. in one expression I’d calculated that ds106 has published over 4 million words

There are other technicalities of FeedWordPress I’d want to explore like how post edits are handled. Overall it was a lot easier having a single data source.

Your Thoughts

At this point it would be very useful to have your thoughts on this technique. And remember I’m a ‘maker of things’ so your suggestions may get incorporated into a functional tool ;)

In my next post in this series I’ll share my notes on building and analysing comments data.

Notes from the Twitter backchannel at eAssessment Scotland 2012 #eas12

On Friday (31st August) I was at eAssessment Scotland 2012. This is an event I’ve had a long running involvement with and it’s always a pleasure to travel up to Dundee with around 300 other delegates to attend one of the premier eAssessment events in the UK (if not Europe the world). Sheila was also at the conference at has already posted some notes and there isn’t much for me to add (although I do want to write something on Open Badges after attending Doug Belshaw’s session). Instead I wanted to have a quick look at the conference twitter backchannel.

The archive of 1196 tweets taken between 23rd August and 3rd September has contributions from 184 twitter accounts. The median number of tweets per account was 2.

imagechart4

As well as monitoring the number of tweets being posted during the day, following on from my work on analysing threaded Twitter discussions from large archives using NodeXL #moocmooc, I included the number threaded replies* in the archive. Of the 1,196 #eas12 tweets 10% (n.115) were threaded replies. I was able to monitor this in real-time with the graph below which is embedded in the dashboard.

*the metadata from Twitter includes a ‘in reply to id’ field which identifies if the tweet is part of a thread. This isn’t 100% accurate as it is only recorded if the user uses a reply button.

Graph of tweets and threaded replies at #eas12
Graph of tweets and threaded replies at #eas12

At the time I commented:

interesting that #eas12 tweeters are in broadcast mode (red line bottom left graph) docs.google.com/spreadsheet/cc…

— Martin Hawksey (@mhawksey) August 31, 2012

which was responded to (as a threaded reply which creates all sorts of problems from Twitter’s embedded tweet option):

@mhawksey What does that say about us then… (/makes effort to think, edit and RT…) :D #eAS12

— Derek Jones (@plug103) August 31, 2012

@plug103 perhaps if the person at the front is in broadcast mode then it’s okay for us to be our own individual mini-broadcasters ;) #eas12

— Martin Hawksey (@mhawksey) August 31, 2012

So how does this compare with #moocmooc. Hopefully the graph below illustrates this (interactive version here), but in numbers the ‘#moocmooc archive contains 6,883 tweets where 44% (n.3046) are conversation threads.

Graph of tweets and threaded replies at #moocmooc
Graph of tweets and threaded replies at #moocmooc

It’s perhaps not surprising that different modes of twitter usage produce different conversation patterns. Next week I’ll be at ALT-C 2012 in Manchester where I’ll also be monitoring the twitter backchannel. This is a 3 day event aimed at learning technologists so it’ll be interesting to see if a different pattern of usage emerges.

Something else I’m reminded preparing the TAGSExplorer view of eas12 (graph of replies – solid lines, and in this case mentions – dashed lines, between twitter account using eas12 hashtag) is that there is more interaction going on that could be analysed (there’s a PhD in all of this – if Twitter keep the data flowing).

image

Other links

How JISC CETIS dashboard social activity around blog posts using a splash of data science

Warning: Very techie post, the noob version is here Google Spreadsheet Template for getting social activity around RSS feeds HT @dajbelshaw

In my last post announcing a Google Spreadsheet Template for getting social activity around RSS feeds I mentioned it was built from spare parts. It came out of on-going work at CETIS exploring activity data around CETIS blogs (including this one) and JISC funded projects. In this post I’ll highlight some of the current solutions we are using and how they were developed.

To put this work into context it aligns with my dabbling’s as a data scientist.

A data scientist … is someone who: wants to know what the question should be; embodies a combination of curiosity, data gathering skills, statistical and modelling expertise and strong communication skills. … The working environment for a data scientist should allow them to self-provision data, rather than having to rely on what is formally supported in the organisation, to enable them to be inquisitive and creative. [See this post by Adam Cooper for the origins of this definition]

It’s perhaps not surprising that the main tool I use for these explorations with data is Google Spreadsheets/Google Apps Script. The main affordances are a collaborative workspace which can store and manipulate tablature data using built-in and custom functionality. To illustrated this let me show how easy it is for me to go from A to B.

Example 1: OER Rapid Innovation Projects – Blog post social activity

I’ve already posted how we are Using Google Spreadsheets to dashboard project/course blog feeds #oerri. For this we have a matrix of project blog feeds and a set predefined categories/tags. As projects make different posts a link to them appears on the dashboard.

OERRI Project Post Directory

As part of the programme completion JISC are piloting a new project survey which will include reporting the ‘reach of the papers/articles/newsitems/blogposts out of the project’. Out of curiosity I wanted to see what social share and other activity data could be automatically collected from project blogs. The result is the page shown below, which is also now part of the revised OERRI Project Post Directory Spreadsheet. The important bit is the columns on the right with social counts from 10 different  social network services. So we can see a post by Xenith got 29 tweets, a post by bebop got 5 comments, etc.

Caveat: these counts might not include modified urls with campaign tracking etc …

OERRI Post Social Activity

How it works

The three key aspects are:

  • Getting a single sheet of blog posts for a list of RSS feeds.
  • Getting social counts (Likes, Tweets, Saves) for a range of services.
  • Allowing columns sorting on arrays of results returned by custom

Getting a single sheet of blog posts
You can use built in formula like importFeed to do this, but you are limited to a maximum 20 post items and you need to do some juggling to get all the results from 15 projects in one page. An alternative, which I quickly ruled out was fetching the rss xml using Google Apps Script and parsing the feed. The issue with this one is RSS feeds are usually limited to the last 10 posts and I wanted them all.

The solution was to use the Google Feed API (can’t remember if this is on Google cull list). The two big pluses from this are it can pull historic results (up to 100 items), and results can be returned in JSON which is easy to work with in App Script.

Here’s what the function looks like which is called as a custom formula in cell Data!A4 (formula used looks like =returnPostLinks(Dashboard!A3:B17,”postData”)):

function returnPostLinks(anArray,cacheName){
  var output = [];
  var cache = CacheService.getPublicCache(); // using Cache service to prevent too many urlfetch 
  var cached = cache.get(cacheName);
  if (cached != null) { // if value in cache return it
    output = Utilities.jsonParse(cached);
    for (i in output){
      output[i][3] = new Date(output[i][3]); // need to recast the date
    }
    return output; // if cached values just return them
  }
  // else make a fetch
  var options = {"method" : "get"};
  try { 
    for (i in anArray){ // for each feed url
      var project = anArray[i][0]; // exact the project name from 1st col
      if (project != ""){ // if it's not blank
        var url = "https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&num=100&q="+anArray[i][1];
        var response = UrlFetchApp.fetch(url , options);
        var item = Utilities.jsonParse(response.getContentText()).responseData.feed.entries; // navigate to returned items
        for (j in item){
          // for each item insert project name and pull item title, link, date, author and categories (other available are content snippet and content)
          output.push([project,item[j].title,item[j].link,new Date(item[j].publishedDate),item[j].author,item[j].categories.join(", ")]);
        }
      }
    }
    // cache the result to prevent fetch for 1 day
    cache.put(cacheName, Utilities.jsonStringify(output), 86400); // cache set for 1 day
    return output; // return result array
  } catch(e) {
    //output.push(e);
    return output;
  }
}

 

Getting social share counts

Collecting social share counts is something I’ve written about a couple of times so I won’t go into too much detail. The challenge is this project was to get results back from over 100 urls. I’m still not entirely sure if I’ve cracked it and I’m hoping more result caching helps. The issue is because the custom formula used in this appears in each row of column H on the Data sheet, when the spreadsheet opens it simultaneously tries to do 100 UrlFetches at the same time which Apps Script doesn’t like (Err: Service has been invoked too many times). {I’m wondering if away around this would be to set a random sleep interval (Utilities.sleep(Math.floor(Math.random()*10000))… just tried it and it seems to work}

Sorting columns when arrays of results are returned by custom formula

Sort order selectionWhen you use spreadsheet formula that write results to multiple cells sorting becomes a problem because the cells are generated using the CONTINUE formula which evaluates a result from the initiating cell. When you sort this reference is broken. The solution I use is to pull the data into a separate sheet (in this case Data) and then use the SORT formula in the ‘Post SN Count’ sheet to let the user choose different sort orders. The formula used in ‘Post SN Count’ to do this is in cell A4: =SORT(Data!A4:Q,D1,IF(D2=”Ascending”,TRUE,FALSE)). To try and prevent this being broken an ‘Item from list’ data validation is used with the ‘Allow invalid data’ unchecked.

Time to hack together: functional in 2hrs, tweaks/cache fixes 4hrs

Example 2: JISC CETIS Blog Post Dashboard

Having got social share stats for other peoples feeds it made sense for the JISC CETIS blogs. At the same time because I keep my blog on my own domain and Scott (Wilson) is on wordpress.com it was also an opportunity to streamline our monthly reporting and put all our page views in one place.

Below is a screenshot of this de-page viewed version of the CETIS dashboard (the code is all there I’ve just haven’t authenticated it with our various Google Analytic accounts). With this version the user can enter a date range and get a filtered view of the posts published in that period. Sort and social counts make a re-appearance with the addition of an ‘Examine Post Activity’ button. With a row highlighted, clicking this gives individual post activity (shown in the second screenshot). The code for this part got recycled into my last post Google Spreadsheet Template for getting social activity around RSS feeds.

CETIS Post Dashboard

Individual post activity

Getting to B

To get to this end point there are some steps to pull the data together. First the ‘Sources’ sheet pulls all our blog urls from the CETIS contacts page using the importHtml formula. Next on the ‘FeedExtract’ sheet the array of blog urls is turned into a sheet of posts using the same code in example 1. Social counts are then collect on the ‘Data’ sheet which is read by the Dashboard sheet.

A detour via page views

Having some existing code to get stats from Google Analytics made it easy to pass a url to one of two GA accounts and get stats back. Because Scott is on wordpress.com another way was required. Fortunately Tony Hirst has already demonstrated a method to pull data from the WordPress Stats API into Google Spreadsheets. Following this method stats are imported to a ‘Scott’sStats’ sheet and tallied up with a code loop (WordPress stats come in on per post, per day).

WordPress Stats API import

Here’s the custom formula code to return page views for a url:

function getCETISPageViews(url, startDate, endDate){
  // format dates
  var startDate = Utilities.formatDate(startDate, "BST", "yyyy-MM-dd");
  var endDate = Utilities.formatDate(endDate, "BST", "yyyy-MM-dd") ;
  // extract domain and path
  var matches = url.match(/^https?\:\/\/([^\/:?#]+)(?:[\/:?#]|$)/i);
  var domain = matches && matches[0];
  var path = url.replace(domain,"/");
  // switch for google analytic accounts
  var ids = false;
  if (domain == "http://mashe.hawksey.info/"){
    ids = ""//ga-profile-id  
  } else if (domain == "http://blogs.cetis.ac.uk/"){
    ids = ""//ga-profile-id;
  } else if (domain == "http://scottbw.wordpress.com/"){
    // code to compily stats imported to sheet
    // get all the values
    var doc = SpreadsheetApp.getActiveSpreadsheet();
    var sheet = doc.getSheetByName("Scott'sStats")
    var data = sheet.getRange(1, 4, sheet.getLastRow(), 2).getValues();
    var count = 0;
    // count on url match
    for (i in data){
      if (data[i][0] == url) count += data[i][1];
    }
    return parseInt(count);
  }
  if (ids){
    // GA get data using sub function
    return parseInt(getGAPageViews(startDate, endDate, path, 1, ids));
  }
}

Note: I use a custom Google Analytics function available in the spreadsheet script editor. There’s a new Analytics Service in Apps Script as an alternative method to connect to Google Analytics.

Time to hack together: functional in 3hrs, tweaks fixes 1hr

Summary

So there you go two examples of how you can quickly pull together different data sources to help us record and report the social reach of blog posts. You’ll notice I’ve conveniently ignored whether social metrics are important, the dangers of measurement leading to gaming, the low recorded social activity around OERRI projects. I look forward to your comments ;)

Google Spreadsheet Template for getting social activity around RSS feeds

The ‘Deprecated notice’ is because the access to the Topsy API (which is used to extract Twitter activity) is now behind a developer key.

A post from Alan Levine (@cogdog) on Syndication Machines (plus syndication of syndication) which details how feeds for the cMOOC course ds106 are aggregated and resyndicated got me thinking if there was additional data around student blog posts that could be usefully captured. Previously for something like that I would have turned to PostRank Analytics, which allowed you to specify any rss feed and it would aggregate social activity like tweets, bookmarks from a wide range of services and let you see it all in one place. Unfortunately PostRank were bought by Google and while some of this data is now accessible from Google Analytics it’s restricted to your account and data from social networks Google isn’t directly competing with.

So I thought it would be useful/interesting to start looking at what social activity you could pull together without authenticated access or API keys. So far I’ve identified couple and given some overlap with other work (and some late night play time), I’ve come up with a Google Spreadsheet template which pulls data together from comment feeds, Twitter and Delicious (with social counts for these plus Facebook, LinkedIn and Google+). You can give the spreadsheet a try with the link below:

*** Blog Activity Data Feed Template ***

Blog Activity Data Feed Template Overview

Features

  • Collects last 10 posts from a RSS feed
  • Uses sharedcount.com to get overall post share counts from Facebook, Twitter, Google+, LinkedIn and Delicious
  • For supported blogs (mainly manila WordPress) extracts comment snippets
  • Uses Topsy to collect tweets mentioning the post
  • Fetches all the Delicious bookmarks for the post url
  • Summarises activity from comments, tweets and bookmarks on the dashboard

You can make as many copies of the template as you like to track other RSS feeds.

If you prefer a slightly different overview then the RSS Feed Social Share Counting Google Spreadsheet gives a table of share counts for a feed

*** RSS Feed Social Share Counter ***

Social share count matrix

Technical highlights

There are a couple of cool things under-the-hood for the Activity Data template worth noting.

Managed Library = Easy to add more sources

So far I’ve limited the template to sources that have easy access. Previously if an API disappeared, changed or I discover another service I would have to modify the template and users would have to make a fresh copy. With Managed Libraries the Google Spreadsheet template now only needs 11 lines of code (shown below) for 3 custom formulas (see cells Dashboard!C8, and #n!G2 and #n!B8).

function getSharedCountActivity(url){
  return socialActivity.getSharedCount(url);
}

function getItemFeed(url, num){
  return socialActivity.getItemRSSFeed(url, num)
}

function getActivityFeed(url){
  return socialActivity.getActivityFeed(url);
}

Because the template uses my own socialActivity library and because I’ve left the template in Development Mode when I update the code changes should (I hope) filter to all the existing templates in use. This means if I can add more services with open APIs details should start appearing on the post activity pages (denoted with # at the beginning).

Fun with sparklines

Activity SparklinesIf you were wondering how I did the activity sparklines, here’s how. Sparklines are a built-in Google Spreadsheet formula (here’s the  reference). To get them to work you need some data if you want some options. For the data all you need is an array of values and sparkline will plot them equally spaced.

In this use case I’ve got comments, tweets, and bookmarks with date/times. To plot the sparkline I need to know how many of these were made between specific time periods. Another function we can use to get this data is FREQUENCY. This lets us specify some intervals and source data and returns an array of the frequency of data points.

One issue is that for each post the activity range varies, so one post might only have activity in July, and another in August. There are other formula (MIN/MAX) to get the range values. Here’s how the final solution was pieced together.

On each post sheet (prefixed with ‘#’) in cells A1:A20 are some formula to calculate date intervals based on the minimum and maximum date ranges. Frequency data used for each sparkline is then calculated inside the sparkline formula by indirectly looking at the data in each post sheet. The indirectly part is handled by the INDIRECT formula which evaluates a cell reference from other cell values. So on row 9 of Dashboard INDIRECT(B9&”!D9:D”) evaluates to ‘#1!D9:D’, row 10 is ‘#2!D9:D and so on. I could have hardcoded these cell references but I prefer the ‘clever’ way. Here’s part of the final formula with a breakdown of what it does:

SPARKLINE(FREQUENCY(INDIRECT(B9&”!D9:D”),INDIRECT(B9&”!A1:A20″)),{“charttype”,”line”;”ymin”,-1;”ymax”,MAX(FREQUENCY(INDIRECT(B9&”!D9:D”),INDIRECT(B9&”!A1:A20″)))+1} )

  • INDIRECT(B9&”!D9:D”) – Frequency data source built using INDIRECT which evaluates a cell reference from other cell values. On row 9 of Dashboard this evaluates to ‘#1!D9:D’, row 10 is ‘#2!D9:D
  • INDIRECT(B9&”!A1:A20″) – similar trick to get an interval array for the FREQUENCY formula
  • FREQUENCY(INDIRECT(B9&”!D9:D”),INDIRECT(B9&”!A1:A20″)) – gives our sparkline data array
  • MAX(FREQUENCY(… – get the maximum value from the frequency data just so we can add 1 to it to give a margin on the sparkline

Summary

Hopefully you’ll enjoy and if you can think of any improvements or other services to tap into just get in touch.

About

This blog is authored by Martin Hawksey Google+

JISC CETIS Learning Technology Advisor (OER Programme Support)
jisc cetis logo

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Copyright License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License. CC-BY mhawksey

Privacy /Cookies

This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).

Badges

. . .