Visualisation

As part of some work I'm doing with the Open University around the OER Research Hub project I developed this high fidelity prototype which let users explore survey responses collected by the project (the short video below highlights the main features):

In the guest post I wrote on the 'OER Survey Exploratoratorium' I outlined the problem:

When presented with over 4,000 survey responses the challenge was how to let people explore the data set. When presented with this challenge the first thought invariably is what is the shape of the data. In this case the survey responses were collected in Survey Monkey. After considering options like consuming the data into the OER Impact Map via the Survey Monkey API, the overhead in terms of developing user interfaces and squeezing into a WordPress data structure resulted in the exploration of other options. The approach that looked like it would squeeze the most functionality out of little development time was to use Google Fusion Tables to host the data and put a visualisation layer over the top. The reason for choosing Fusion Tables is it allows a Guardian Datastore model of letting the user easily access and reuse the source data either in Fusion Tables itself or exporting into another tool. If you would like to peek at the data behind this there are two tables: one with the survey questions and another with the survey data.

I’ve extracted the main part of the code into this gist so you can get a sense of what’s going on. If this is something you are interested in doing yourself there is some documentation on the Google Visualisation API  for getting Google Fusion Tables. This page is has one example of how you can fetch data from Fusion Tables. It’s however worth noting that as Google Fusion Tables implements  the Chart Tools Datasource Protocol you can query the data as a datasource. This allows you to use the Google Visualization API Query Language with  SQL like syntax. The gist below is a reworking of this query example in the Google Code Playground which you can use to see the differences. The main one is how the query is set by specifying which table the data is from in the query. A couple of notes I have on using Google Fusion Tables as a datasource in this way are:

  • data returned limited to 500 rows. If you want more you can turn to the full Google Fusion Tables API  which has a separate SQL like query language. Using this API is rate limited and requires OAuth and/or API key. I got more than 500 by using LIMIT and OFFSET in my queries. (the full Google Fusion Table API is worth bookmarking for cross-referencing).
  • using back-quotes ` specified in the Visualisation API to wrap column names with spaces doesn’t appear to work. You do specify columns by their name rather that A, B C etc as used in Google Sheets. (The Google Fusion Tables API specifies single quotes which I don't think work in this scenario - this is an example of where cross-referencing helps)
  • Google Fusion Tables doesn’t implement the OR operator (related issue ticket marked Won't Fix. When I mentioned to Tony Hirst (@psychemedia) he suggested De Morgan's laws which would be an alternative)

Hope you enjoy and look forward to seeing you Google Fusion/Visualization mashups ;)

Share this post on:
| | |
Posted in GDE, Google, Visualisation on by .

4 Comments

Update: Nicola Osborne (EDINA) has kindly live-blogged the session so extensive notes are here

Later today I’ve been invited by the University of Edinburgh Data Library team to talk about data visualisation. The abstract I submitted and slides are below. Putting a slidedeck like this together is always useful as you mentally sort through your mind the pieces of knowledge you’ve obtained, which in my case is only from the last year or so. It’s also a little unnerving to think how much more is still out there (known unknowns and unknown unknowns). The slides contain links to source (when you get to the data/vis matrix some of the thumbnails are live links), here’s also the bundle of top level links.

There are a number of examples throughout history where visualisations have been used to explore or explain problems. Notable examples include Florence Nightingale's 'Mortality of the British Army' and John Snow's Cholera Map of London. Recently the increased availability of data and software for analyzing and generating various views on this data has made it easier to generate data visualisations. In this presentation Martin Hawksey, advisor at the Jisc Centre for Educational, Technology and Interoperability Standards (Cetis), will demonstrate simple techniques for generating data visualisations: using  tools (including MS Excel and Google Spreadsheets), drawing packages (including Illustrator and Inkscape) and software libraries (including d3.js and timeline.js). As part of this participants will be introduced to basic visual theories and the concepts of exploratory and explanatory analytics. The presentation will also highlight some of the skills required for discovering and reshaping data sources.

1 Comment

Sankey DiagramSankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity.” Wikipedia.

I first came across Sankey diagrams by (the OKFN’s latest School of Data’s contributors ;) Tony Hirst in d3.js Powered Sankey Diagram. Subsequently Bruce McPherson showed how to create Sankey diagrams from Excel also using d3.js.

Having collect some survey data for Analytics and Institutional Capabilities at #cetis13 (here’s a copy of the form) we were looking for a way to take us beyond the Google Form reporting and gaining extra insight. In particular I was interested in trying to see if there were any relationships between the multiple choice questions. Using a Sankey diagram seemed like a possible solution and my colleague David Sherlock quickly came up with a modification of Mike Bostock’s Sankey example to accept a csv input (I’ll link if/when it gets written up).

Seeing this I thought it might be useful to make a wrapper to generate Sankey diagrams for data stored in a Google Sheet. The solution was relatively straight forward, using the Google Visualisation API to get the data in the right shape for d3js. An example of the prototype is here

There’s no interface yet for you to select a spreadsheet, sheets, columns etc but you can take you Google Spreadsheet ‘publish to the web’ and then add the following data to the url.

One of the issues with this solution is you might not want to make all your data available. To get around this I’ve written a Google Apps Script that lets you use a custom formula to preformat the data. To see this in action this Spreadsheet contains an example. The formula is in cell A1 and uses the format =setSankey(datarange, cols , separator)

  • datarange – sheet/cell reference for source data eg 'Form Responses - Edit'!A2:D Note must start with column A
  • cols – comma separated list of columns to use to generate a chart for eg "B,C,D"
  • separator {optional} – used to split multi value cells defaults to ", " eg " | "

To use this in your own spreadsheets open Tools > Script editor and copy the code from this gist. Here’s an example url using pre-processed data. My main difference is the addition of the &output=1 to the querystring.

Obviously creating your own querystrings to render the data isn’t ideal and it would be relatively straight forward to create a UI wrapper similar to the one used in EDGESExplorer, but it’s something I reluctant to do unless there is enough demand. The other consideration is the question – does the sankey diagram provide useful insight for the type of data or is it just more ‘damn lies’.

It would have of course been nice to write a Google Gadget to include this in a Spreadsheet … but Google are discontinuing those :(.

I should say this post contains a lot of technical information, doesn't give much background and is mainly for my hard-core followers

This is a very lose sketch of an experiment I might refine which uses Jason Davies wordcloud script (add-on for d3.js) as a way to filter data hosted in a Google Spreadsheet. I was essentially interested in the Twitter account descriptions of the community using the the Social Media Week – Glasgow hashtag, but a minor detour has reminded me you can:

  • get json data straight from a Google Spreadsheet
  • you can build dynamic queries to get what you want

So I fired up NodeXL this morning and got this pretty graph of how people using the #smwgla hashtag at least twice follow each other.

people using the #smwgla hashtag at least twice follow each other

One of the features of NodeXL is to add stats columns to your data which includes friend/follower counts, location and profile descriptions.

NodeXL Stats

Uploading the data from NodeXL (Excel) to Google Spreadsheets allows me to render an interactive version of the community graph using my NodeXL Google Spreadsheet Graph Viewer.

interactive version of the #smwgla community graph

All this is doing is grabbing data from Google Spreadsheets using their Visualization API and rendering it visually using javascript/HTML5 canvas. You can use the query language part of this API to get very specific data back (if you want a play try Tony Hirst’s Guardian Datastore Explorer). Using Tony’s tool I got this query built. One thing you might notice is I’m selecting a column of twitter description WHERE it contains(‘’) <- a blank – if it’s a blank why did I bother with the WHERE statement?

Switching to Jason Davies wordcloud demo we can play with custom data sources if you have some JSON. In Tony’s tool you have options to get the data in html (tqx=out:html) and csv (tqx=out:csv). There is a third undocumented option for json tqx=out:json. Using this we can get a url for the wordcloud generator https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%2727%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

To make the wordcloud interactive, so that when you click on a term it filters the data on that term was can use the option to include {word} in our source url e.g. https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%27{word}%27%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

And here is the final result, an interactive wordcloud of #smwgla Twitter account descriptions [Note: you need to hit the Go button when you click-through]:

interactive wordcloud of #smwgla Twitter account descriptions

The end result useful? Not sure, but how the data is extracted is (to me anyway).

4 Comments

I recently had a chance to spend some time with Marc Smith co-founder of the Social Media Research Foundation which is behind the Microsoft Excel networks add-in NodeXL. I’ve done a couple of blog posts now with NodeXL and after a prompted by Marc I thought it was worth a revisit. So in this post I’ll highlight some of the new features of NodeXL's Twitter Search tools that make it a useful tool for community/resource detection and analysis.

Importing a Twitter Search – expanding all the urls

Before going too far I should also point out there has been a separate social network importer for Facebook Fan pages for a while now. On the new Twitter Search import there is now an option to ‘Expand URLs in tweets’. This is useful because Twitter now  wraps all links in it’s own shortening service t.co. The shortened urls are also unique for each tweet* even if the target url is the same. Having a feature that expands these is useful to see what people are linking to (it makes it easier to see if people are sharing the same resources or resources from the same domain).  And as you’ll see later makes it easier data to use in mashups.

*except new style RTs which use the same t.co link

Expand URLs options

Did you know you can use urls and website domains in your search? This is a trick I’ve been using for a long time and I’m not sure how widely known it is. For example here is everyone who has been sharing the new Creative Commons License chooser at http://creativecommons.org/choose/ or just everyone sharing a link that has anything that links to the Creative Commons website domain. In Tweetdeck I use a search column with ‘hawksey.info OR cetis.ac.uk’ to pickup any chatter around these sites.

Replicable research and the (almost) one button expert

NodeXL has been a great tool for me to start learning about network analysis, but as I play with various settings I’m conscious that I’m missing some of the basic tricks to get the data into a meaningful shape. For a while now people have been able to upload and share their network analysis in the NodeXLGraphGallery. This includes downloading the NodeXL data as an Excel Workbook or GraphML (this is a nice way to allow replicable research).

An even newer feature is to download the NodeXL Options the graph author used. This means a relative amateur like myself with no sociological background, and unlike Marc unaware of what the increasing popularity of zombie films might be saying about our society (although they can be used to explain betweenness centrality), can tap into their expertise and format a graph in a meaningful way with a couple of clicks. There’s still the danger that you don’t understand the graph, but it can still be a useful jumpstart.

Import NodeXL Options

Twitter Search Top Items Summary

The next new thing is a Twitter Search Network Top Items page. I did a search for ‘#oer OR #ukoer’ to pull the last 7 days  tweets. By importing the options from this NodeXL Graph Gallery example and running the ‘Automate’ you can reuse my settings on your own search result. By running Graph Metrics > Twitter search network top items (part of my Automate options) I get this sheet which I’ve uploaded to Google Spreadsheet

Twitter Search Network Top Items page

This sheet lets you quickly see overall and group level:

  • Top Replied-To in Entire Graph
  • Top Mentioned in Entire Graph
  • Top URLs in Tweet in Entire Graph
  • Top Hashtags in Tweet in Entire Graph
  • Top Tweeters in Entire Graph

Detection

These are useful summaries to look at who is most active in the community, what urls are most being shared, overlapping tag communities. I admit that it can look like a scary dashboard of stuff which not all of you will like, but NodeXL is a network graphing tool so it’s easy to visually explore the data.

So looking at macro level we can quickly graph the ripples typical within a Twitter community which mainly showing the effects of retweets (this view was extracted from my online NodeXL Google Spreadsheet Graph Viewer). This can help you quickly see the smaller clusters within the community who are generating retweets and conversations.

Retweet ripples

Community (group) in a box

Because my data was also on the NodeXL Graph Gallery Marc kindly created this view which groups sub-communities using an algorithm and overlays the most used hashtags used by the sub-community (Marc’s version on NodeXL Graph Gallery). The group hashtag labels, which are in frequency order, are very useful in this situation because the search term I used was pulling in overlapping hashtag communities (#oer and #ukoer). So looking for boxes where ‘ukoer’ is near the beginning would indicate they are from the uk crowd.

oer/ukoer graph by Marc Smith

Getting more from the data

Earlier I mentioned that having expanded urls was useful for further analysis. Something I quickly played with that I’m not entirely sure how to get the most out of (if anything) is reusing my RSS Feed Social Share Counting Google Spreadsheet code to get social share data from the most tweeted links. Here’s the result (embedded below). Let me know if you have any thoughts on how it can be used:

4 Comments

Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).

To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.

Getting the data into C2

There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:

Google Refine Create Project from Google Doc
Google Refine Create Project from Google Doc

... and then use the Templating Export tool to generate the C2 xml.

Google Refine Templating Export
Google Refine Templating Export

For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.

Prefix

<?xml version="1.0" encoding="UTF-8"?>
  <searchresult>
    <query>#or2012</query>

Row template

    <document id="{{row.index}}">
      <title>{{cells["text"].value}}</title>
      <url>{{cells["status_url"].value}}</url>
      <snippet>
        {{cells["text"].value}}
      </snippet>
    </document>

Row separator

[single carriage return]

Suffix

</searchresult>

If you want to play along here is a copy of the #or2012 archive in C2 xml

Processing the data in Carrot2

Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.

Carrot2 Workbench Search Panel
Carrot2 Workbench Search Panel

The results

Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):

Carrot2 Workbench Visualization Mode
Carrot2 Workbench Visualization Mode

The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).

View a tweet
View a tweet in Carrot2

Useful/meaningful?

This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out ;)

3 Comments

Last couple of days I’ve been at IWMW12 hosted this year at University of Edinburgh. I’ve already posted Data Visualisation Plenary/Workshop Resources which has my slides from the plenary. I was teaming up with Tony Hirst (OU) and have included his slides to the page.

Because of living 'almost locally' and other family commitments I missed out on most of the social events, instead I got drunk on data working into the early hours to find what stories I could uncover from the #IWMW12 stream. In this post I’ll show you what I’ve come up with and some of the highlights in trying to turn raw data into something interesting/meaningful (or pointless if you prefer). Interestingly a lot of what I cover here uses the same techniques used in my recent The story data tells us about #CitizenRelay guest post, so I’ve got an emerging templated workflow emerging which I can deploy at events which makes me wonder if I should be getting organisers pay my travel/accommodation as an event data amplifier?

UK University Twitter Account Community

On day one Brian Kelly mentioned some work by Craig Russell to collate a table of UK University Social Media accounts which featured in a guest post on Brian’s blog titled Further Evidence of Use of Social Networks in the UK Higher Education Sector. You can get the data Craig has compiled from a Google Spreadsheet. Looking at this two things immediately sprung to mind. First that the document could be made more ‘glanceable’ just using some simple conditional formatting, and second there was a nice list of Twitter accounts to do something with.

image

Here’s a link to my modified version Craig’s spreadsheet. It uses the importRange formula to pull the data in so it creates a live link to the source document. For the conditional formatting I looked for text containing ‘http’ turning the cell text and background green. The HTML view of this is a lot cleaner looking.

On  the Twitter Accounts sheet extract the account screen names by pulling everything after the last ‘/’ and remove most of the blank rows using a unique formula.

Putting this list into the free MS Excel add-in NodeXL and using the Import > From Twitter List Network lets you get data on which of these accounts follow each other. I played around with visualising the network in NodeXL but found it easier in the end to put the data into Gephi getting the image below. These ‘hairballs’ have limited value and you’re best having a play with the interactive version, which is an export of Gephi visualised using the gexf-js tool by Raphaël Velt (De-hairballing is something Clement Levallois (‏@seinecle) and he kindly sent me a post to a new tool he’s creating called Gaze).

UK HEI Twitter Accounts

The #IWMW12 Twitter Archive Two More Ways

TimelineAs part of #iwmw12 I was collecting an archive of tweets which already gives you the TAGSExplorer view. I also use the Topsy API and Google Spreadsheet to extract tweets which is then passed into Timeline by Vérité which gives you a nice sense of the event. [If anyone else would like to make their own twitter media timeline there is a template in this post  (it is easy as make a copy of the template, enter your search terms and publish the sheet).]

Searchable archive

Searchable archiveNew way number one is a filterable/searchable archive of IWMW12 tweets. Using the Google Visualisation API I can create a custom interface to the Google Spreadsheet of tweets. This solution uses some out-of-the-box functionality including table paging, string filtering and pattern formatting. Using the pattern formatter was the biggest achievement as it allows you to insert Twitter Web Intents functionality (so if you click to reply to a tweet it pulls up Twitter’s reply box.

I also processed the archive using R to get a term frequency to make a d3 based wordcloud (I’ve started looking at how this can be put into a more general tool. Here’s my current draft which you should be able to point any TAGS spreadsheet at (this version also includes a Chart Range Filter letting you view a time range). I definitely need to write more about how this was done!)

Filter by time

Mappable archive

One of the last things I did was to filter the twitter archive for tweets with geo-location. Using the Spreadsheet Mapper 3.0 template I was able to dynamically pull the data to generate a time stamped KML file. The timestamps are ignored when you view in Google Maps, but if you download the kml file it can be played in Google Earth (you’ll have to adjust the playback control to separate the playback heads – I tried doing this in the code but the documentation is awful!)

Google Earth playback

Or if you have the Google Earth browser plugin a web version of IWMW12 geo-tweets is here (also embeded below):

So there you go … or as said by Sharon Steeples

Originally posted on CitizenRelay

Telling stories with data is a growing area for journalism and there is already a strong community around Data Driven Journalism (DDJ). I’m not journalist, by day I’m a Learning Technology Advisor for JISC CETIS, but my role does allow me to explore how data can be used within education. Often this interest spills into the evenings where I ‘play’ with data and ways to visualise the hidden stories. Here are a couple of ways I’ve been playing with data from the CitizenRelay:

A time

One of the first things I did was produce a Timeline of the CitizenRelay videos and images shared on Twitter. This uses the Topsy web service to find appropriate tweets which are stored in this Google Spreadsheet template which are then displayed in the Timeline by Vérité tool (an open source tool for displaying media in a timeline). The result is a nice way to navigate material shared as part of CitizenRelay and an indication of the amount of media shared by people.

 Timeline of the CitizenRelay videos and images shared on Twitter

A time and place

As part of the CitizenRelay Audioboo was used to record and share interviews. For a data wrangler like myself Audioboo is a nice service to work with because they provide a way to extract data from their service in a readable format. One of the undocumented options is to extract all the clips with a certain tag in a format which includes data about where the clip was recorded. Furthermore this format is readable for other services so with a couple of clicks with we can get a Google Map of CitizenRelay Boos which you can click on and find the audio clips.

 Google Map of CitizenRelay Boos

One experiment I tried which didn’t entirely work out the way I wanted was to add date/time to the Audioboo data and also embed the audio player. This datafile (generated using this modified Google Spreadsheet template) can be played in Google Earth allowing to see where Boos were created, when they were created with a timeslider animation and directly playback the clips. This experiment was partially successful because I would prefer the embedded player worked  without having to download Google Earth.

 Google Earth of CitizenRelay Boos

A look at who #CitizenRelay reporters were

So far we have mainly focused on the content but lets now look at the many eyes and ears of the CitizenRelay who helped share and create stories on Twitter.

CitizenRelay Many eyes

The image shows the profile pictures of over 600 people who used the #CitizenRelay tag on Twitter so far this year. This image was generated using a free add-in for Microsoft Excel called NodeXL, read more about getting started with NodeXL. What that image doesn’t show you is how interconnected this community is. Using another free tool called Gephi and with the right data we can plot the relationships in this twitter community, who is friends with who (read more about getting started with Gephi). In the image below pictures of people are replaced with circles and friendships are depicted by drawing a line between circles.

CitizenRelay Community

There are almost 7,000 relationships shown in the image so it can be a bit overwhelming to digest. Using Gephi it is possible to interactively explorer individual relationships. For, example the image below shows the people I’m friends with who used the #CitizenRelay tag.

CitizenRelay Sub-community

A look at what #CitizenRelay reporters said

Using the same technique for plotting relationships it’s also possible to do something similar with what people said using the #CitizenRelay tag. By plotting tweets that mention or reply to other people we get:

citizenrelay-conversation

This image is evidence that #CitizenRelay wasn’t just a broadcast, but a community of people sharing their stories. Visualising Twitter conversations is one of my interests and I’ve developed this interactive tool which lets you explore the #CitizenRelay tweets.

CitizenRelay Interactive Archive

So there you go some examples of what you can do with free tools and a bit of data, I’m sure there are many more stories to be found in CitizenRelay.

Share this post on:
| | |
Posted in Data, Twitter, Visualisation on by .

3 Comments

On Tuesday 19th June I’ll be presenting at the Institutional Web Manager Workshop (IWMW) in Edinburgh … twice! Tony Hirst and I are continuing our tour, which started at the JISC CETIS Conference 2012, before hitting the stage at GEUG12. For IWMW12 we are doing a plenary and workshop around data visualisation (the plenary being a taster for our masterclass workshop). I’ll be using this post as a holder for all the session resources.

Update: I've also added Tony Hirst's (OU) slides. Tony went on first to introduce some broad data visualisation themes before I went into a specific case study.

The draft slides for my part of the plenary are embedded below and available from Slideshare and Google Presentation (the slides are designed for use with pptPlex, but hopefully they still make sense). For the session I’m going to use the OER Visualisation Project to illustrate the processes required to get a useful dataset and how the same data can be visualised in a number of ways depending on audience and purpose. Update: I should have said the session should be streamed live, details will appear on IWMW site.

Update: As a small aside I've come up with a modified version of Craig Russell's UK Universities Social Media table as mentioned in Further Evidence of Use of Social Networks in the UK Higher Education Sector guest post on UKWebFocus (something more 'glanceable'). Using the Twitter account list as a starting point I've looked at how University accounts follow each other and come up with this (click on the image for an interactive version).

If you have any questions feel free to leave a comment or get in touch.

5 Comments

Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 

wordclouds

Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! ...  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)