1 Comment

Twitter has recently frustrated a number of developers and mashup artists moving to tighter restrictions on it’s latest API. Top of the list for many are all Twitter Search API requests need to be authenticated (you can’t just grab and run, a request has to be via a Twitter account), removal of XML/Atom feeds and reduced rate limits. There are some gains which don’t appear to be widely written about so I’ll share here

#1 Get the last 18,000 tweets instead of 1,500

Reading over the notes for the latest release discussion/notes for NodeXL I spotted that

you now specify how many tweets you want to get from Twitter, up to a maximum of 18,000 tweets

Previously in the old API the hard limits were 1,500 tweets from the last 7 days. This meant of you requested a very popular search term you’d only get the last 1,500 tweets making any tweets made earlier in the day inaccessible. In the new API there is still the ‘last 7 days’ limit but you can page back a lot further. Because the API limits to 100 tweets per call and 180 calls per hour this means you could potentially get 18,000 tweets in one hit. If you cache the maximum tweet id, wait an hour for the rate limit to refresh you could theoretically get even more (I’ve removed the 1.5k limit in TAGSv5.0, but haven’t fully tested how much of the 18k you can get before hit by script timeouts).

#2 Increased metadata with a tweet

Below is an illustration of the data returned in a single search result comparing the old and new search API.

Old and new Search API responses

If you look at the old data and the new data the main addition is a lot more profile data. A lot of this isn’t of huge interest (unless you wanted to do a colour analysis of profile colours), but there is some useful stuff. For example in this example I have profile information for the original and retweeter. as well as friend/follower counts, location and more (I’ve already shown how you can combine this data with Google Analytics for comparative analysis).

Whilst I’m sure this won’t appease the hardcore Twitter devs/3rd party for hackademics like myself grabbing extra tweets and more rich data has it’s benefits.

1 Comment

At 4pm GMT today (see in your own timezone) I'll be doing a Google Hangout On Air with Marc Smith one of NodeXL’s creators and coordinators. NodeXL is a handy free add-in for MS Excel (Windows) allowing you to generate, explore and analyse network graphs. I started using NodeXL a couple of years ago as I found it was an easy way to extract a social graph (friend/follower relationships) from Twitter. In the session we'll go through extracting a social graph from a Twitter hashtag and analysing the results.

You can view/participate in a number of ways. I'll embed the livestream in this post, or you can see/join the hangout when it appears in my Google+ timeline. You can ask question via Twitter or Google+ using the combined hashtags #lak13 and #nodexl  e.g. ‘What the best place to find out more about NodeXL? #lak13 #nodexl’ (this is a LAK13 fringe event)


In my last post on Canvas Network Discussion Activity Data I mentioned I was a little disappointed to not be able to use social network analysis (SNA) modelling techniques on the Canvas Network discussion boards. My main barrier was accessing the data via the Canvas API using my preferred toolset. Fortunately Brian Whitmer at (the company behind Canvas) posted a comment highlighting that as a student on the course it was easy for me to get access to this data using a token generated on my Canvas profile page. With this new information in this post I’ll cover three main areas:

  • a very quick introduction into techniques/opportunities for analysing threaded networks using SNA;
  • how I retrieved data from the Canvas platform for the #LAK13 discussions; and finally
  • some analysis using the NodeXL add-on for Microsoft Excel (Windows).

On Friday 1st March at 4pm GMT I’ll also be having a live Hangout on Air with Marc Smith, one of the original creators and continued project coordinator of NodeXL. The live feed will be embedded below embedded here and you can ask question via Twitter or Google+ using the combined hashtags #lak13 and #nodexl  e.g. ‘What the best place to find out more about NodeXL? #lak13 #nodexl’. For the hangout session we’ll look at how easy it is to use NodeXL to analyse a Twitter hashtag community in a couple of clicks. [The rest of this post is introducing a more advanced use of NodeXL so if I lose you in the rest of this post fear not as the session will be a lot easier going]

Opportunities for analysing threaded networks using SNA

Hello graph

WiredUK friend/follower graphOne of the dangers of SNA is people see images like the one to the right and miss the point of using this modelling technique (the image is from one of my first goes at using NodeXL so I’d be the first to admit there’s room for improvement). If you do there are couple of things to bear in mind. Often these images are exhaust, generated as a snapshot of a visual and/or exploratory analytic. A certain level of literacy is required to understand the underlying structure of the graph. Taking this last point a threshold concept for me was understanding that a basic concept of these graphs are more often than not constructed from an edge list which is essentially two columns of data representing and start point and end point to a node. For example, all the names in columns Vertex 1 and Vertex 2 are nodes and each row represents an edge between the nodes so in this case Beth –> Adam generates the highlighted edge.

Basic edge list and force layout

There is a lot more to understand about these types of graphs, but this basic concept means I know if I have any relationship data its easy to graph and explore.


At this point it’s worth mentioning the browser plugin SNAPP.

The Social Networks Adapting Pedagogical Practice (SNAPP) tool performs real-time social network analysis and visualization of discussion forum activity within popular commercial and open source Learning Management Systems (LMS). SNAPP essentially serves as a diagnostic instrument, allowing teaching staff to evaluate student behavioural patterns against learning activity design objectives and intervene as required a timely manner.

Valuable interaction data is stored within a discussion forum but from the default threaded display of messages it is difficult to determine the level and direction of activity between participants. SNAPP infers relationship ties from the post-reply data and renders a social network diagram below the forum thread. The social network visualization can be filtered based upon user activity and social network data can be exported for further analysis in NetDraw. SNAPP integrates seamlessly with a variety of Learning Management Systems (Blackboard, Moodle and Desire2Learn) and must be triggered while a forum thread is displayed in a Web browser.

The social network diagrams can be used to identify:

  1. isolated students
  2. facilitator-centric network patterns where a tutor or academic is central to the network with little interaction occurring between student participants
  3. group malfunction
  4. users that bridge smaller clustered networks and serve as information brokers

The paper referencing SNA research supporting these areas was presented at LAK11 (if you don’t have access also available in the presentation’s slidedeck).  The paper Visualizing Threaded Conversation Networks: Mining Message Boards and Email Lists for Actionable Insights (Hansen, Shneiderman & Smith, 2010) also highlights simple ways to identify question people, answer people and discussion starters which are all potentially very useful within courses for identifying network clusters individuals might want to join/follow.

Retrieving data from Canvas

Hopefully with that quick intro you can see there might be some value in using SNA from threaded discussion analysis. Reading the SNAPP overview hopefully you spotted that it currently doesn’t support extracting data from Canvas discussion boards. This is an opportunity to understand some of the analysis SNAPP is doing behind the scenes.

Hello Google Apps Script

If you have been following my posts you’ll see that I favour using Google Apps Script as a lightweight tool for extracting data. Thanks to Brian (Instructure) I’ve got a way to access the Discussion Topics API. Looking at the API documents I decided the best way to proceed was to get all of the LAK13 discussion topics (top level information) and use this to get the full topic data. If you speak JSON we are essentially turning this:

Data as rendered in Canvas

into this:

Data in JSON

finally getting this (web version here):

Data in spreadsheet

The code to do this is available here. I’m going to spare you the details of the code but here are the instructions is you’d like to export data from other Canvas hosted discussion boards. If you’re not interested in that you can just jump to the next section.

Generating an edge list (extracting data) from Canvas to Google Sheets

  1. Create a new Google Spreadsheet and then in Tool > Script editor copy in the code from here
  2. If you are not pulling data from LAK13 you need to edit values in lines 2-4. If you visit your course homepage hopefully you can decode the url pattern based on the example for LAK13 (I should also point out you need to be enrolled on the course to receive data. Also read Canvas API Policy)
  3. Next you need an access token which is generated from your Canvas Profile Settings page. Scroll down to the bottom and click New Access Token, filling in a purpose and leaving expires blank. Make a copy of the token as it’s needed for the next step (I added a copy to a .txt file just in case the next step didn’t work
  4. Back in the Script Editor in Google Spreadsheets click File > Project Properties. In the ‘Project properties’ tab click  ‘+ Add row’ and replace (name) with access_token and (value) with the token you got from Canvas before clicking Save
    Project Properties
  5. Make sure everything is saved in the Script Editor and then Run > getCanvasDiscussionEdges, wait for the script to finish and on Sheet1 you should have a bunch of data to play with.

Using NodeXL to analyse Canvas Discussions

There are a number of different questions we could ask of the LAK13 data. The particular one I want to look at is who are the core community members stimulating/facilitating discussion (e.g. applying a connectivist theory who are the people you might want to connect with). To do this we need to (I’m assuming you’ve already installed NodeXL):

  1. Download the data extracted to the Google Spreadsheet (File > Download as > Microsoft Excel). [If you just want the data I’ve extracted here’s the download link – the data is automatically refreshed nightly]
  2. Open the download file in Excel and in the created_at column select all and Format Cells as General (I needed to do this because NodeXL was miss formating dates on import)
  3. Start a new NodeXL Template (I use the NodeXL Excel Template option from my windows Start menu)
  4. From the NodeXL ribbon you want to Import > From Open Workbook
    Import > From Open Workbook
  5. In the import dialog vert1_name and vert2_name are edges, anything else prefixed with ‘vert’ is assigned to the corresponding Vertex n property column and everything else is an  Vertex 1 property:
    import dialog
  6. Once imported you can open the Edges sheet, select the created_at column and Format Cells reassigns a date/time format.
  7. In Prepare Data chose ‘Count and merge duplicate edges’ and select Count and Vertex1 and Vertex 2
    Count and merge duplicate edges
  8. In the Graph section of the NodeXL ribbon we want to make this a directed graph (replies are directed) and choose you layout algorithm (I usually go Harel-Koren Fast Multiscale)
    make this a directed graph
  9. Next we want to prepare the data we want to analyse. In the Autofill Columns (within Visual Properties portion of the ribbon) set Edge Visibility to ‘topic_id’ and in Edge Visibility Options set ‘If the source column number is: Not equal to 558’ Show otherwise Skip (this will skip edges that are responses to the Pre-course discussion forum – I’ll let you question this decision in the comments/forum)
    Edge Visibility Options
  10. Click Ok then Autofill
  11. Next open the Vertices sheet and select all the rows (Ctrl+A) and from the Visibility option select ‘Show if in an Edge’
    Show if in an Edge
  12. Now find the row in the Vertices sheet for George Siemens and Skip (doing this were creating a ‘what if George wasn’t there’ scenario
    ‘what if George wasn’t there’ scenario
  13. Open the Graph Metrics window and add Vertex in-degree, vertex out-degree, Vertex betweenness and closeness centrality and Top items (in the Top items options you’ll need to add these as the metrics you want top 10s for), finally click Calculate metrics.
    Graph Metrics window

At this point you could use the calculated metrics to weight nodes in a graph, but for now I’m going to skip that. You should now have a Top Items sheet with some useful information. In the Betweenness Centrality list you should have these names:

  • Martin Hawksey
  • Simon Knight
  • Alex Perrier
  • Khaldoon Dhou
  • Rosa Estriégana Valdehita
  • Maha Al-Freih
  • Suzanne Shaffer
  • Maxim Skryabin
  • Bryan Braul
  • Peter Robertso

Excluding the pre-course discussion forum and George Siemens the discussions these people have engaged with provide the shortest paths to other people engaging in discussions on the Canvas site. Strategically these are potentially useful people  within the network that you might want to follow, question or engage with.

Getting to this point obliviously hasn’t been straight forward and had SNAPP been available in this instance it would have turned this in to a far shorter post. Programmatically using tools like R we could have arrived at the same answer with a couple of lines of code (that might be my challenge for next week ;). What it has hopefully illustrated is if you have data in an edge format (two column relationships) tools like NodeXL make it possible for you use SNA modelling techniques to gain insight. (I’m sure it also illustrates that data wrangling isn’t always straight forward, but guess what that’s life).

NodeGL for LAK13 discussionsSo at this point some of you might be wondering what does the LAK13 reply network look like. I could give you a flat image but why don’t you use my NodeXL online graph viewer to explore it yourself or download the data from the NodeXL Graph Gallery.


In a recent post I showed that how given some urls it was easy to generated a templated report as a word document. This was partly done to show how blog posts from a JISC funded programme could be used to generate reports. One of the issues with this solution is not all the projects used WordPress as their blogging platform making it harder to get at some of the data. One thought I had was rather than dictating a particular platform JISC could aggregate all programme posts in a central database. This isn’t entirely new as at CETIS we already aggregate links to posts for selected programmes via our PROD database, the difference here would be as well as a link the post content would also be ingested making it easy to query and analysis the data from a single source. As I recently wrote about in Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine this technique is common in cMOOCs, and in particular highlighted how Digital Storytelling (DS106) do this already using the FeedWordPress plugin.

As part of a double hit I thought I’d see if there was a case for JISC implementing this type of infrastructure, and because I’ve got an interest in applying learning anlaytics to open online courses, have a look at the data stored by ds106 and see what analysis I could do. In the next series of posts I share my notes and sketches around extracting FeedWordPress data, supplement it with 3rd party sources and analysing dialog/activity around posts. In this first post I’m going to outline how all the data was imported into a local WordPress database, queried to pull out a data slice, refined and mined, before visualisation/analysis. What we end up with is a view of how ds106 blog posts are interlinked by hyperlinks within the post body. For the next in the series I record some notes on comment extraction and finally how to get more activity data from social network sites.

The data source

If JISC were hosting the data then in theory it would be  a lot easier to get to. As my test data comes from ds106 some extra preparatory work was required. Fortunately #ds106er Alan Levine (@cogdog) was able to provide me with a MySQL dump of data from their WordPress install. The dump contained over 700,000 lines of SQL commands and was over 500Mb. Problem one was then how to get this back into a local MySQL database to make slice, dice and export easier.

For a local Apache/MySQL/PHP server I used  XAMPP Lite. Importing the data wasn’t straight forward as despite tweaking the config kept having ‘allocated memory’ errors. The solution was ‘BigDump: Staggered MySQL Dump Importer’, which was able to automatically split and run the import. Even with this I got a couple of ‘MySQL has gone away’ errors so used Vim (a text editor that can handle very large files) to drop 5 blog post inserts (given ds106 had over 20,000 posts I can live with the lose).

The selected data export

Wordpress Database DiagramThose familiar with the backend of WordPress will know it has a relational structure with a number of tables holding different parts of the data. Because I knew I wanted to explore ‘ds106 dialog’ and that to do this the data would have to be compiled with other sources I needed to export a slice of the data. Whilst I was does this it also made sense to make some of the data more meaningful. For example, the wp_posts table which has most of the data I needed uses an author id number where as it would be better to get this as a display name by joining it with the wp_users data. Fortunately because MySQL is … well a ‘query language’ this is relatively easy to do by using the command below. To talk you through it I’m selecting a couple of columns from the wp_posts table and joining it with some other data on the condition that it’s a published post. As well as author display name you’ll see that I’m also returning a wfw:commentRSS. This is the comment RSS feed caught by the FeedWordPress plugin (I think it comes from here). I got wind of this metadata after reading Jim Groom’s Displaying Distributed Comments on the Hardboiled Blog. I’ll show how this is used later.

       wp_postmeta.meta_value AS comment_feed, 
FROM   wp_posts 
       LEFT JOIN wp_postmeta 
              ON = wp_postmeta.post_id 
                 AND wp_postmeta.meta_key = 'wfw:commentRSS' 
       LEFT JOIN wp_users 
              ON wp_posts.post_author = 
WHERE  wp_posts.post_type = 'post' 
       AND wp_posts.post_status = 'publish'

Running this query in phpMyAdmin gives us an option to exporting as a csv giving almost 20,000 blog posts (one per row) to play with which is plenty.

The refinement

apply-operationsUsing Google Refine we can import the csv file choosing comma separated and headers in first row. Next we want to process the post content to extract outbound links. Below a the rough steps I used which you can download and apply to your own project (the steps in the file vary slightly as I noticed that the guid links had ?p={number} instead of the post permalink. I ended getting these when fetch the comment feed and extracting the link. More detail about this in the next post. Note to self: issue with non-permalinks).

  1. Create column post_as_date at index 3 based on column post_date using expression grel:value.toDate("Y-m-d H:m:s")
  2. Create column outbound_links at index 5 based on column post_content using expression grel:forEach(value.parseHtml().select("a[href~=]"),e,e.htmlAttr("href")).join("|||")
  3. Split multi-valued cells in column outbound_links
  4. Create column source_links at index 9 based on column guid using expression grel:row.record.cells["guid"].value[0]

The key step is 3. which mines all the posts for <a href> tags and extracts them into a new column joining them together with ‘|||’


Using Refine we can split the outbound_links column to put one link per row (column dropdown, Edit cells > Split multi-valued cells). The last step is to make sure the post source link is included in each row using fill down the right and secure way. Filtering out blank outbound_link rows and Export using Custom tabular exporter (setting I used here) I get some data that looks like this:

The data meets Gephi and NodeXL

With this data we can open in Excel and then import into NodeXL using Import > From Open Workbook, more about that later, or alternatively using Gephi change the column headings in the .csv version to source and target and import via the ‘data laboratory’ tab (for graph manipulation in Gephi Tony Hirst has a great tutorial).

Sticking with Gephi for now you can generate the images below. On the left is the entire graph which depicts the ~59,000 links contained in ~20,000 ds106 blog posts. The image on the right is zoomed in to the inset where we can start seeing each webpage referenced as a dot connected by a line which is a link in a post. Whilst this is pretty, it’s also pretty useless.

All links from ds106 links Close-up of post links

So where to go with this? In the paper A Theoretical Proposal of Learning Communities Management Through the Analysis of the Organizational Blogosphere the authors propose using the measure of ‘betweenness centrality’ (BC) as a way of identifying influential learners within a blogging collective:

The betweeness centrality has been considered in literature (Marsden, 2002) as a way to find the most valuable nodes within a social network. The strategic function of these characters is fundamental for the system of learners, because they have the widest view of what happens in the net. This is due to the fact that they intercept the majority of the knowledge and information flows within the learning community. The people with the highest betweeness centrality can share their vision on the main knowledge claims of the community in an open debate.

Extrapolating this slightly lets consider individual blog posts rather than individual learners, using BC as a way to highlight posts that may bridge concepts, or support the community of learners in some way. Rather than examine all links in the blog posts lets first consider internal linking within the ds106 community.

To do this lets turn back to NodeXL which I find a lot easier to get data in and manipulate. Opening our export file from Google Refine and then using Import > From Open Workbook we can import all the data source and target being edge properties, the rest vertex 1 properties. This gives you over 75,000 connections between over 59,000 links, back at square one. A nice feature of NodeXL is to tell it to ‘skip’ some data (edge, vertex or group). The data is kept in the spreadsheet so you can include it later but it’s not used for any calculations. I only want to analysis ds106 posts that link to other ds106 posts. Here’s brief notes on the way I did it (other ways are possible and might even be better).

Autofill options

On the ‘Edges’ sheet in the ‘Other columns’ section I created 4 columns named: Community Link, Self Link, Combined and

Internal Link, then added the following formula in each column:

  • =IF(ISERROR(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30 ,FALSE)),0,IF(LEN(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30, FALSE))>12,1 ,0)) – because we only have dates, titles, names for posts made by the ds106 community, if this detail is blank on the vertices sheet then the edge is outwith the community (if it exists enter 1, otherwise 0) Tip: after doing this I copied all the Community Link values and Paste Special, Values to save recalculation on edit/open.
  • =IF([@[Vertex 1]]=[@[Vertex 2]],0,1) – test if edge is self-link
  • =[@[Community Link]]*[@[Self Link]] – multiplying these together as a simple operator
  • =IF(ISERROR(OR(FIND("/",[@[Vertex 1]],12), FIND("/",[@[Vertex 2]],12))), 0, IF(LEFT([@[Vertex 1]], FIND("/",[@[Vertex 1]], 12))=LEFT([@[Vertex 2]], FIND("/",[@[Vertex 2]],12)), 1,0)) – used to test if edge is two posts by the same author


Using the Autofill we can chose to skip edges based on the ‘Combined’ value (and while we are at it set the edge style based on the ‘Self Link’ value). Once this is done you can use NodeXL to calculate Groups and Metrics knowing that it’ll skip the edges it doesn’t need.

Once we’ve done all of this and applied some chart options this is the graph we get.

ds106 interlinked blog posts

Taking a closer look at one of the groupings (G1) shown below we can see how ds106 blog posts (the dots) link to each other, a dotted line indicating it’s the same post author referring to their own work.

Group G1

The first impression

We now have a dataset of ds106 blog posts and the webpages that they individually link to. There are some interesting patterns and more work to be done to interpret these. Whilst this technique has surfaced interlinks within the community there is no context in which they are made. Ranking the posts by betweenness centrality we get the following top 10:


It’s interesting that most of these come from tutors (although given the amount of co-creation the role of tutor/student is very blurred), which in some ways isn’t that surprising given they will be summarising and highlighting key work.

The Limitations

Hopefully by this point you are screaming at the screen because of the limitations of this analysis. The focus has been on blog post content, which I hope I’ve shown can be used to identify interesting moments. The big limitation here is it’s looking at a narrow slice of activity - how students are linking to blog posts without any real context. Another staple of blogs is comments. Comments provide an easy way for a reader to feedback to the author continuing the dialogue sharing thoughts, ideas and  reflections. It’s also all well and good me doing this but the real interesting thing would be to put this graphical overview in the hands of tutors and students to see if it helps them. I imagine it wouldn’t be hard to wrap the MySQL query initially used in some PHP and visualisation library like d3.js and provide users with some situational awareness of what is going on in their community. If you would like to explore the data the NodeXL file is here for download.

As part of this post was also about the feasibility of using FeedWordPress to aggregate JISC project blogs here are some of my notes:

  • Exporting post content was possible with MySQL access (interfaces would be easy to design)
  • Issue encountered with how post links stored (blogspot/blogger feeds use a internal guid=, where as other guids are links to the original posts). This was solved by fetching the comment feed and parsing the page link from the response
  • Need to be aware that FeedWordPress can't get content for partial rss feeds but plugins are available
  • Registering a feed is a manual process
  • Issue with mixed-mode feeds (if you are automatically pulling in all content if the feed source has a mixed use e.g. blog is mixture of personal and work, then data isn't as clean)
  • Getting the data into Google Refine made it easy to mine post content e.g. in one expression I'd calculated that ds106 has published over 4 million words

There are other technicalities of FeedWordPress I'd want to explore like how post edits are handled. Overall it was a lot easier having a single data source.

Your Thoughts

At this point it would be very useful to have your thoughts on this technique. And remember I’m a ‘maker of things’ so your suggestions may get incorporated into a functional tool ;)

In my next post in this series I’ll share my notes on building and analysing comments data.

1 Comment

As I mentioned in Filtering a Twitter hashtag community for questions and responses I’ve been asked to do some analysis of the Current/Future State of Higher Education (CFHE12) course. Week 1 has mainly been about creating a toolchain that makes it easier to hit a button and get some insight. The focus has mainly been on tweets with the #cfhe12 hashtag. I’m still scratching my head as to what this all means but there are already discussions to extend the scope trying to establish more context by also looking at blog and discussion forum posts. The danger I also have as a ‘maker of things’ as questions emerge I want to make things to help find the answers.

To easy into this lets start with an overview here are some key stats for 7-13th October 2012 (BST) (and already I resisting the temptation to create an overview template):

  • 762 Tweets
  • 305 Links
  • 172 RTs
  • 244 Unique twitter accounts
  • 14% (n.104) of tweets were in @reply to another person using #cfhe12

This sheet contains more details including a summary of who tweeted the most and got the most @mentions and the ‘Dashboard’ sheet which let me know that this was the most retweeted tweet:

Below are two graphs summarising the Twitter activity for week 1 of #cfhe12 (LHS) and another course earlier in the year #moocmooc (you can click on both of these for interactive versions).

summary of #cfhe12 tweets for week 1
#cfhe12 week 1 tweets

Summary of tweets from #moocmooc
#moocmooc tweets

It’s notable that the volume and proportion of tweets and @replies is higher in #moocmooc. Part of this could be down to the fact that #moocmooc was a condensed course that was one week long. Other factors may include the chosen infrastructure and how this was promoted, size of course and who was participating.

Extracting a conversation graph, which is shown below, there isn’t a great deal of @replies for week 1. In the graph each dot represents a single tweet and dots are joined if the person is @replying that tweet. I probably need to find a way for you to interact with this graph, but for now I’ve prepared these pages with conversations for groups G1-G4:

cfhe12 week 1 conversation graph
[The above graph data can be downloaded from the NodeXL Graph Gallery]

Exploring G3 and G4 some of the limitations of this technique become apparent. For example clicking on the date in the first tweet in G4 reveals the full text from Twitter, which includes text from G3 i.e. they are the same conversation and should be grouped together.

So more work to do, more things to think about, more tools needed to make sense of this easier. In the meantime any of your observations are greatly welcome.

Yesterday I got stuck into the first week of the Coursera course on Computing for Data Analysis. The course is about:

learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods.

You might be asking given that I’ve already dabbled in R why am I taking an introductory course? As I sat watching the lectures on my own (if anyone wants to do a Google Hangout and watch next weeks lectures together let me know) I reminisced about how I learned to swim. The basic story is 6 year old boy is staying a posh hotel for first time, nags parents to take him to the swimming pool, when they get there gets changed runs off and jumps in at the deep end. When I eventually came back to the surface I assumed the doggy paddle and was swimming’ … well ‘swimming’ in the sense that I wasn’t drowning.

The method of ‘throwing myself in’ is replicated throughout my life, particularly when it comes to learning. So whilst I’ve already thrown myself into R I can survive but only just and what I’ve produced is mainly as a result of trying not to drown. This revelation was particularly clear when learning about subsetting (reshaping data)

I’ve got an example where I’ve been practicing my subsetting skills with NodeXL data later in this post, but first some quick reflections about my experience on the course so far.

MOOCing about in Coursera

So hopefully you’ve already got the picture that I’m a fairly independent learner so I haven’t bothered with the built-in discussion boards, instead opting to view the lectures (I’m finding x1.5 speed suits me) and take this weeks quiz. The assignment due for week 2 is already announced and people are racing ahead to get it done (which appears to have forced the early release of next weeks content).

Something apparent to me in the Coursera  site is the lack of motivational cues. I’ve got no idea how I’m doing in relationship with my fellow 40,000 other students in terms of watching the lectures or in this weeks quiz. Trying to get my bearings in using the #compdata Twitter hashtag hasn’t been that successful because in the last 7 days there have only been 65 people using or mentioned with the tag (and of the 64 tweets 29 were ‘I just signed up for Computing for Data Analysis #compdata …’)

Things are looking up on the Twitter front though as some recent flares have gone up:

and also @ @hywelm has made himself known ;)

Will there be much community building in the remaining 3 weeks?

Mucking about with NodeXL and R

In the section above I’ve mentioned various Twitter stats. To practice this week’s main compdata topics of reading data and subsetting I thought I’d have a go at getting the answers from a dataset generated in NodeXL (I could have got them straight from NodeXL but where is the fun in that ;).

Step 1 was to fire up NodeXL and import a Twitter Search for #compdata with all of the boxes ticked except Limit to… .

Twitter Search Import from NodeXL

As a small aside I grabbed the the NodeXL Options Used to Create the Graph used in this MOOC search by Marc Smith, hit the automate button and came up with the graph shown below (look at those isolates <sigh>):

The #compdata graph

To let other people play along I then uploaded the NodeXL spreadsheet file in .xlsx to Google Docs making sure the ‘Convert documents …’ was checked and here it is as a Google Spreadsheet. By using File > Publish to the web… I can get links for .csv versions of the sheets.

In R I wrote the following script:

If you run the script you should see various answers pop out. As I’m learning this if anyone would like to suggest improvements please do. My plan is to keep adding to the data and extending the script as the weeks go buy to practices my skills and see what other answers I can find


This post is a bit messy. I got caught trying out too many ideas at once, but hopefully you'll still find it useful

Sheila recently posted Analytics and #moocmooc in which she collects some thoughts on the role of analytics in courses and how some of the templates I’ve developed can give you an overview of what is going on.  As I commented in the post I still think there is more work to make archives from event hashtags more useful even if just surfacing tweets that got most ‘reaction’.

There are three main reactions that are relatively easy to extract from twitter: retweets, favouring and replies. There are issues with what these actions actually indicate as well as the reliability of the data. For example users will use ‘favouring’ in different ways, and not everyone uses a twitter client that can or uses a reply tweet (if you start a message @reply without clicking a reply button Twitter looses the thread).

But lets ignore these issues for now and start with the hypothesis that a reaction to a tweet is worth further study. Lets also, for now, narrow down on threaded discussions. How might we do this? As mentioned in Sheila's post we’ve been archiving #moocmooc tweets using Twitter Archiving Google Spreadsheet TAGS v3. As well as the tweet text other metadata is recorded including a tweet unique identifier and, where available the id of the tweet it is replying to.

Google Spreadsheet columns

We could just filter the spreadsheet for rows with reply ids but lets take a visual approach. Downloading the data as a Excel file we can open it using the free add-in NodeXL.

NodeXL allows us to graph connections, in this case conversation threads. NodeXL allows use to do other useful things like group conversations together to make further analysis easier. Skipping over the detail here’s what you get if you condense 6,500 #moocmooc tweets into grouped conversations.

 moocmooc grouped converstations

This is more than just a pretty picture. In NodeXL I’ve configured it so that when I hover over each dot which represents and individual tweet I get a summary of what was said by who and when (shown below).

NodeXL being used to examine nodes

It’s probably not too surprising to see strings of conversations, but by graphing what was an archive of over 6500 tweets we can start focusing on what might be interesting subsets and conversation shapes. There are some interesting patterns that emerge:

conversation group 1 conversation group 2conversation group 3

Within NodeXL I can extract these for further analysis. So the middle image can be viewed as:

Examination of conversation group 2

There’s a lot more you can do with this type of data, start looking at how many people are involved in conversations, number of questions per conversations and lots more. I should also say before I forget that NodeXL can be configured to collect twitter search results with it’s built-in twitter search tool. It can also be configured to do the collection on a regular basis (hmm I should really have a go at doing that myself). So potentially you’ve got a nice little tool to analysis twitter conversations in real-time …

If you’d like to explore the data more it’s available from the NodeXLGraphGallery. I’m going off to play some more ;)


I recently had a chance to spend some time with Marc Smith co-founder of the Social Media Research Foundation which is behind the Microsoft Excel networks add-in NodeXL. I’ve done a couple of blog posts now with NodeXL and after a prompted by Marc I thought it was worth a revisit. So in this post I’ll highlight some of the new features of NodeXL's Twitter Search tools that make it a useful tool for community/resource detection and analysis.

Importing a Twitter Search – expanding all the urls

Before going too far I should also point out there has been a separate social network importer for Facebook Fan pages for a while now. On the new Twitter Search import there is now an option to ‘Expand URLs in tweets’. This is useful because Twitter now  wraps all links in it’s own shortening service The shortened urls are also unique for each tweet* even if the target url is the same. Having a feature that expands these is useful to see what people are linking to (it makes it easier to see if people are sharing the same resources or resources from the same domain).  And as you’ll see later makes it easier data to use in mashups.

*except new style RTs which use the same link

Expand URLs options

Did you know you can use urls and website domains in your search? This is a trick I’ve been using for a long time and I’m not sure how widely known it is. For example here is everyone who has been sharing the new Creative Commons License chooser at or just everyone sharing a link that has anything that links to the Creative Commons website domain. In Tweetdeck I use a search column with ‘ OR’ to pickup any chatter around these sites.

Replicable research and the (almost) one button expert

NodeXL has been a great tool for me to start learning about network analysis, but as I play with various settings I’m conscious that I’m missing some of the basic tricks to get the data into a meaningful shape. For a while now people have been able to upload and share their network analysis in the NodeXLGraphGallery. This includes downloading the NodeXL data as an Excel Workbook or GraphML (this is a nice way to allow replicable research).

An even newer feature is to download the NodeXL Options the graph author used. This means a relative amateur like myself with no sociological background, and unlike Marc unaware of what the increasing popularity of zombie films might be saying about our society (although they can be used to explain betweenness centrality), can tap into their expertise and format a graph in a meaningful way with a couple of clicks. There’s still the danger that you don’t understand the graph, but it can still be a useful jumpstart.

Import NodeXL Options

Twitter Search Top Items Summary

The next new thing is a Twitter Search Network Top Items page. I did a search for ‘#oer OR #ukoer’ to pull the last 7 days  tweets. By importing the options from this NodeXL Graph Gallery example and running the ‘Automate’ you can reuse my settings on your own search result. By running Graph Metrics > Twitter search network top items (part of my Automate options) I get this sheet which I’ve uploaded to Google Spreadsheet

Twitter Search Network Top Items page

This sheet lets you quickly see overall and group level:

  • Top Replied-To in Entire Graph
  • Top Mentioned in Entire Graph
  • Top URLs in Tweet in Entire Graph
  • Top Hashtags in Tweet in Entire Graph
  • Top Tweeters in Entire Graph


These are useful summaries to look at who is most active in the community, what urls are most being shared, overlapping tag communities. I admit that it can look like a scary dashboard of stuff which not all of you will like, but NodeXL is a network graphing tool so it’s easy to visually explore the data.

So looking at macro level we can quickly graph the ripples typical within a Twitter community which mainly showing the effects of retweets (this view was extracted from my online NodeXL Google Spreadsheet Graph Viewer). This can help you quickly see the smaller clusters within the community who are generating retweets and conversations.

Retweet ripples

Community (group) in a box

Because my data was also on the NodeXL Graph Gallery Marc kindly created this view which groups sub-communities using an algorithm and overlays the most used hashtags used by the sub-community (Marc’s version on NodeXL Graph Gallery). The group hashtag labels, which are in frequency order, are very useful in this situation because the search term I used was pulling in overlapping hashtag communities (#oer and #ukoer). So looking for boxes where ‘ukoer’ is near the beginning would indicate they are from the uk crowd.

oer/ukoer graph by Marc Smith

Getting more from the data

Earlier I mentioned that having expanded urls was useful for further analysis. Something I quickly played with that I’m not entirely sure how to get the most out of (if anything) is reusing my RSS Feed Social Share Counting Google Spreadsheet code to get social share data from the most tweeted links. Here’s the result (embedded below). Let me know if you have any thoughts on how it can be used:


I keep getting this nagging voice in my head to do something with Google+ and Google Spreadsheets. I've resisted until now as I don't think there is enough of an official API there to satisfy my current interest in social network analysis. This hasn't stopped other people from getting off the beaten track and developing some really useful stuff.

Who folk who put me in a g+ circle followA while ago Tony (Hirst) posted a Fragment: Looking up Who’s in Whose Google+ Circles…,which highlights how friend and follower information can be extracted from Google using an undocumented API. Tony followed this up with So Where Am I Socially Situated on Google+? in which he used a genius bit of Python scripting to extract a large fragment of his extended Google+ network which he then processed in Gephi.

Here's my take on the problem, a Google Spreadsheet template to extract Google Plus network information for analysis in the tool of your choice, in my case NodeXL.  

If you want to go and play straight away here's a link to:

*** Get Google Plus Friend/Follower Information ***
(File > Make a copy to use yourself)

Below is an example output from my network generated in NodeXL (and a copy in my online NodeXL viewer ;):

mhawksey-googleplus friend follower

How to use

Instructions for use are contained in the Spreadsheet Readme sheet. My top tip is be patient, it can take a while to grab the data so don’t be alarmed if a minute goes by without a notification from the sub-level run. If nothing is happening after 5 minutes something has gone wrong to try running ‘Google+ > Start/continue collection’ again (it should continue from where it left off). Also this script is hot off the press so if you are having problems comments are very welcome.

One other very important caveat is looking at the execute logs the script isn’t getting friend data for everyone. Not entirely sure why but might be Google trying to prevent use of an undocumented API. This means some nodes (in one sample 10%) are lacking outbound data.

Getting data into NodeXL

I'm not familiar with getting the data into tools like Gephi (other than via a NodeXL export) so if someone wants to post or link to something to do this for other tools leave a comment or get in touch. Instead here are some basic instructions for getting the data into NodeXL:

  1. Once you collected the data from the Google Spreadsheet select File > Download As > Excel
  2. Open the downloaded file in Excel making sure you the Links sheet is active
  3. Start a new NodeXL template and from the NodeXL ribbon menu select Import form Open Workbook
  4. Make sure the dialog is pulling data from the Links sheet and ‘Columns have headers’ is not ticked
  5. Make sure Type is set to Directed and click Prepare Data > Get Vertices from Edge List
  6. Back in the Excel spreadsheet with your downloaded data  right-click on the Nodes tab and select ‘Move or Copy…’ and change the To book dropdown to your new NodeXL template
  7. Back in your NodeXL template open the Vertices sheet and add two ‘Other columns’ with the headings ‘Name’ and ‘ImageUrl’
  8. In the cell beneath the new Name column enter the formula =IF(VLOOKUP([@Vertex],Nodes!A:C,2,FALSE)=0,[@Vertex],VLOOKUP([@Vertex],Nodes!A:C,2,FALSE))
  9. In the cell beneath the new ImageUrl column enter the formula =VLOOKUP([@Vertex],Nodes!A:C,3,FALSE)
  10. In the cell beneath the Visual Properties Image File column you can enter the formula =[@ImageUrl] then fill the Shape column with Image (you may of course decide not to use images for your nodes, it’s up to you)
  11. Do your other analysis: graph metrics, groups, autofill etc

Some technical notes on using Google Apps Script to collect the data  

Processing time – using ScriptProperties as a cookie

Using Google Spreadsheet/Google Apps Script I can’t the same extent of information (where Tony got a sample of who his friends were also following I’m just getting which of my friends follow each other). This is because Apps Scripts are executed server-side and automatically timeout after around five minutes. Even with a more focused data set in mind I was finding I could processed all the data in 5 minutes so had to resort to breaking the process into chucks using the build-in ScriptProperties (it’s like a script cookie) to store where the process was up to.

Caching in the Cloud with Google Apps Script

Google recently announced the addition of a CacheService in Google Apps Script. In Tony’s original code he was also caching responses from APIs to prevent multiple lookups. In my version there is less need for caching as I only make one grab for each friend or follower. I still implemented the CacheService as a way to backup a sublevel processing run. The scenario being you get most of the way through the 50 calls per block when the script fails, the next time it can just re-read the data it already got from the cache speeding things up. Doing this I did however find the limits of the CacheService:

  • cache names have a size limit  (Solution: I trimmed mine to 240 characters),
  • cache size looks like it might have a 100k limit (Solution: caching only the data I needed and wrapping put in a try/catch exception – if I couldn’t put it I’d just make a fresh call if I needed the data)
  • anything cached is stored as a string (Solution: some array splitting and joining in the right places)

Below is how the function for this ended up looking (I’m I’m sure it can be refined – I’m just a hobbyist coder). The full code is here.

function getCachedDataOids(url) {
  var cache = CacheService.getPublicCache(); // initialize
  var oids = cache.get("f"+url.substr(0,240)); // get data if any (url trimmed to prevent error)
  if (oids == null || oids == "") { // if null or empty means there is no cached data or last fetch failed
    var requestData = {"method":"GET", "headers": { "User-Agent": ""}}; // prepare request
    var response = UrlFetchApp.fetch(url, requestData); // try fetch
    if (response.getResponseCode() == 200) { // if response then
      var resp = response.getContentText(); // get response text 
      var reobg = new RegExp("\\d{21}","g"); // only interested in Google Plus user ID so prepare regexp
      oids = resp.match(reobg); // get an array of ids
      if (!oids == null){ // if something stringify it
        oids = oids.join(",");
      try {
        oids = oids.join(",");
        cache.put("f"+url.substr(0,240), oids, defCache); // try and put it in cache
      } catch(e){
        return oids; // if too big just return it 
  return oids;

Enjoy ;)


Recently Tony (Hirst) tipped me off about a new viewer for Gephi graphs. Developed by Raphaël Velt it uses JavaScript to parse Gephi .gefx files and output the result on a HTML5 canvas. The code for the viewer is on github available under a MIT license if you want to download and remash, I've also put an instance here if you want to play. Looking for a solution to render NodeXL data from a Google Spreadsheet in a similar way here is some background in the development of NodeGL – an online viewer of NodeXL graphs hosted on Google Spreadsheets

In the beginning there was Gexf-JS, and it was really good …

Gexf-JS ViewerThere are several features of Gexf-JS I really like: it rendering positions of nodes as generated in Gephi so can handle large datasets; some great UI features including node summary and highlighting connections; and a search box to find nodes.

If you have followed some of my previous work you'll know I'm interest in getting output, mainly from NodeXL, online to allow users to explore and interact with the data (e.g.  A template for rendering small NodeXL visualisations on the web ... or EDGESExplorer: Simple force layout diagrams from edge lists stored ...)

Most of these solutions have centred around using the d3.js library to render force layout graphs. A drawback of this method is that with my implementations d3.js calculates node positions based on node links in realtime so you have limited control node placement and you also start experiencing performance issues with large datasets. Gephi and NodeXL on the other hand let the user choose from number of different layout algorithms to calculate node positions and even if needed make manual adjustments. What Gexf-JS is doing is taking Gephi generated layout data and rendering it in the browser in a scalable format.

One of the reasons I was interested in Gexf-JS is one of the improvements I'd like to make to my online Twitter conversation visualisation tool, TAGSExplorer, is the option to display a hashtag community graph (who is friends with who) as a different way to explore the data. In a previous experiment rendering a twitter community graph for #IIE2011 using d3.js the result wasn't brilliant especially when you compare it to the same data processed in Gephi and viewed using gexf-js.

So one workflow I'm thinking for TAGSExplorer is to display the conversation graph using d3.js with an option for users to upload a community graph processed in NodeXL to the same Google Spreadsheet. As a first step towards this I was interested in modifying Raphaël’s Gexf-JS code to render NodeXL data uploaded to a Google Spreadsheet, mainly as proof of concept and to also familiarise myself with Raphaël’s code.

The good news is it works and rather than just having the code sit on my test server I’ve released NodeGL as an online service and also put the source code back into the community. So below is an example graph generated in NodeXL and here’s a link to the version generated from the same data stored on Google Spreadsheets


If you want to try out your own here’s how:

  1. Generate your graph in NodeXL and File > Save As > Excel 97-2003 Workbook (*.xls) clicking Continue and Yes in the dialog popups
  2. Upload the generated .xls file to Google Docs making sure you tick 'Convert documents, presentations, spreadsheets and drawings to the corresponding Google Docs format'
  3. Open the uploaded spreadsheet and select File > Publish to the web... and publish all the sheets
  4. Copy the spreadhseet key from the link box (it will be similar to the example key highlighted in this link
  5. To view your graph replace the highlight part of this url with your own spreadsheet key

You should know that …


NodeGL uses the canvas element, which might cause compatibility issues with older browsers. It has been tested with the latest Chrome, Firefox and Internet Explorer versions. It doesn't work with Internet Explorer 8 or older.


Only part of the NodeXL graph schema has been implemented so features like node labels and shape are ignored.

And finally … the code

I’ve only done limited testing so feedback is very welcome or you can Download NodeGL Code and tweak yourself

And really finally …

NodeXL has it’s own GraphGallery which allows users to submit graphs in GraphML format. These can be downloaded and viewed within NodeXL but I thought it would be useful to also have an online viewer to preview these graphs. I’ve started work on a NodeXL Graph Gallery Viewer which can turn uploaded GraphML like one for IIE2011 in results like this (I’ve let the folks at NodeXL know about this, they seem keen to develop further ;)