Visualisation

4 Comments

Update: Nicola Osborne (EDINA) has kindly live-blogged the session so extensive notes are here

Later today I’ve been invited by the University of Edinburgh Data Library team to talk about data visualisation. The abstract I submitted and slides are below. Putting a slidedeck like this together is always useful as you mentally sort through your mind the pieces of knowledge you’ve obtained, which in my case is only from the last year or so. It’s also a little unnerving to think how much more is still out there (known unknowns and unknown unknowns). The slides contain links to source (when you get to the data/vis matrix some of the thumbnails are live links), here’s also the bundle of top level links.

There are a number of examples throughout history where visualisations have been used to explore or explain problems. Notable examples include Florence Nightingale's 'Mortality of the British Army' and John Snow's Cholera Map of London. Recently the increased availability of data and software for analyzing and generating various views on this data has made it easier to generate data visualisations. In this presentation Martin Hawksey, advisor at the Jisc Centre for Educational, Technology and Interoperability Standards (Cetis), will demonstrate simple techniques for generating data visualisations: using  tools (including MS Excel and Google Spreadsheets), drawing packages (including Illustrator and Inkscape) and software libraries (including d3.js and timeline.js). As part of this participants will be introduced to basic visual theories and the concepts of exploratory and explanatory analytics. The presentation will also highlight some of the skills required for discovering and reshaping data sources.

1 Comment

Sankey DiagramSankey diagrams are a specific type of flow diagram, in which the width of the arrows is shown proportionally to the flow quantity.” Wikipedia.

I first came across Sankey diagrams by (the OKFN’s latest School of Data’s contributors ;) Tony Hirst in d3.js Powered Sankey Diagram. Subsequently Bruce McPherson showed how to create Sankey diagrams from Excel also using d3.js.

Having collect some survey data for Analytics and Institutional Capabilities at #cetis13 (here’s a copy of the form) we were looking for a way to take us beyond the Google Form reporting and gaining extra insight. In particular I was interested in trying to see if there were any relationships between the multiple choice questions. Using a Sankey diagram seemed like a possible solution and my colleague David Sherlock quickly came up with a modification of Mike Bostock’s Sankey example to accept a csv input (I’ll link if/when it gets written up).

Seeing this I thought it might be useful to make a wrapper to generate Sankey diagrams for data stored in a Google Sheet. The solution was relatively straight forward, using the Google Visualisation API to get the data in the right shape for d3js. An example of the prototype is here

There’s no interface yet for you to select a spreadsheet, sheets, columns etc but you can take you Google Spreadsheet ‘publish to the web’ and then add the following data to the url.

One of the issues with this solution is you might not want to make all your data available. To get around this I’ve written a Google Apps Script that lets you use a custom formula to preformat the data. To see this in action this Spreadsheet contains an example. The formula is in cell A1 and uses the format =setSankey(datarange, cols , separator)

  • datarange – sheet/cell reference for source data eg 'Form Responses - Edit'!A2:D Note must start with column A
  • cols – comma separated list of columns to use to generate a chart for eg "B,C,D"
  • separator {optional} – used to split multi value cells defaults to ", " eg " | "

To use this in your own spreadsheets open Tools > Script editor and copy the code from this gist. Here’s an example url using pre-processed data. My main difference is the addition of the &output=1 to the querystring.

Obviously creating your own querystrings to render the data isn’t ideal and it would be relatively straight forward to create a UI wrapper similar to the one used in EDGESExplorer, but it’s something I reluctant to do unless there is enough demand. The other consideration is the question – does the sankey diagram provide useful insight for the type of data or is it just more ‘damn lies’.

It would have of course been nice to write a Google Gadget to include this in a Spreadsheet … but Google are discontinuing those :(.

I should say this post contains a lot of technical information, doesn't give much background and is mainly for my hard-core followers

This is a very lose sketch of an experiment I might refine which uses Jason Davies wordcloud script (add-on for d3.js) as a way to filter data hosted in a Google Spreadsheet. I was essentially interested in the Twitter account descriptions of the community using the the Social Media Week – Glasgow hashtag, but a minor detour has reminded me you can:

  • get json data straight from a Google Spreadsheet
  • you can build dynamic queries to get what you want

So I fired up NodeXL this morning and got this pretty graph of how people using the #smwgla hashtag at least twice follow each other.

people using the #smwgla hashtag at least twice follow each other

One of the features of NodeXL is to add stats columns to your data which includes friend/follower counts, location and profile descriptions.

NodeXL Stats

Uploading the data from NodeXL (Excel) to Google Spreadsheets allows me to render an interactive version of the community graph using my NodeXL Google Spreadsheet Graph Viewer.

interactive version of the #smwgla community graph

All this is doing is grabbing data from Google Spreadsheets using their Visualization API and rendering it visually using javascript/HTML5 canvas. You can use the query language part of this API to get very specific data back (if you want a play try Tony Hirst’s Guardian Datastore Explorer). Using Tony’s tool I got this query built. One thing you might notice is I’m selecting a column of twitter description WHERE it contains(‘’) <- a blank – if it’s a blank why did I bother with the WHERE statement?

Switching to Jason Davies wordcloud demo we can play with custom data sources if you have some JSON. In Tony’s tool you have options to get the data in html (tqx=out:html) and csv (tqx=out:csv). There is a third undocumented option for json tqx=out:json. Using this we can get a url for the wordcloud generator https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%2727%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

To make the wordcloud interactive, so that when you click on a term it filters the data on that term was can use the option to include {word} in our source url e.g. https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%27{word}%27%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

And here is the final result, an interactive wordcloud of #smwgla Twitter account descriptions [Note: you need to hit the Go button when you click-through]:

interactive wordcloud of #smwgla Twitter account descriptions

The end result useful? Not sure, but how the data is extracted is (to me anyway).

4 Comments

I recently had a chance to spend some time with Marc Smith co-founder of the Social Media Research Foundation which is behind the Microsoft Excel networks add-in NodeXL. I’ve done a couple of blog posts now with NodeXL and after a prompted by Marc I thought it was worth a revisit. So in this post I’ll highlight some of the new features of NodeXL's Twitter Search tools that make it a useful tool for community/resource detection and analysis.

Importing a Twitter Search – expanding all the urls

Before going too far I should also point out there has been a separate social network importer for Facebook Fan pages for a while now. On the new Twitter Search import there is now an option to ‘Expand URLs in tweets’. This is useful because Twitter now  wraps all links in it’s own shortening service t.co. The shortened urls are also unique for each tweet* even if the target url is the same. Having a feature that expands these is useful to see what people are linking to (it makes it easier to see if people are sharing the same resources or resources from the same domain).  And as you’ll see later makes it easier data to use in mashups.

*except new style RTs which use the same t.co link

Expand URLs options

Did you know you can use urls and website domains in your search? This is a trick I’ve been using for a long time and I’m not sure how widely known it is. For example here is everyone who has been sharing the new Creative Commons License chooser at http://creativecommons.org/choose/ or just everyone sharing a link that has anything that links to the Creative Commons website domain. In Tweetdeck I use a search column with ‘hawksey.info OR cetis.ac.uk’ to pickup any chatter around these sites.

Replicable research and the (almost) one button expert

NodeXL has been a great tool for me to start learning about network analysis, but as I play with various settings I’m conscious that I’m missing some of the basic tricks to get the data into a meaningful shape. For a while now people have been able to upload and share their network analysis in the NodeXLGraphGallery. This includes downloading the NodeXL data as an Excel Workbook or GraphML (this is a nice way to allow replicable research).

An even newer feature is to download the NodeXL Options the graph author used. This means a relative amateur like myself with no sociological background, and unlike Marc unaware of what the increasing popularity of zombie films might be saying about our society (although they can be used to explain betweenness centrality), can tap into their expertise and format a graph in a meaningful way with a couple of clicks. There’s still the danger that you don’t understand the graph, but it can still be a useful jumpstart.

Import NodeXL Options

Twitter Search Top Items Summary

The next new thing is a Twitter Search Network Top Items page. I did a search for ‘#oer OR #ukoer’ to pull the last 7 days  tweets. By importing the options from this NodeXL Graph Gallery example and running the ‘Automate’ you can reuse my settings on your own search result. By running Graph Metrics > Twitter search network top items (part of my Automate options) I get this sheet which I’ve uploaded to Google Spreadsheet

Twitter Search Network Top Items page

This sheet lets you quickly see overall and group level:

  • Top Replied-To in Entire Graph
  • Top Mentioned in Entire Graph
  • Top URLs in Tweet in Entire Graph
  • Top Hashtags in Tweet in Entire Graph
  • Top Tweeters in Entire Graph

Detection

These are useful summaries to look at who is most active in the community, what urls are most being shared, overlapping tag communities. I admit that it can look like a scary dashboard of stuff which not all of you will like, but NodeXL is a network graphing tool so it’s easy to visually explore the data.

So looking at macro level we can quickly graph the ripples typical within a Twitter community which mainly showing the effects of retweets (this view was extracted from my online NodeXL Google Spreadsheet Graph Viewer). This can help you quickly see the smaller clusters within the community who are generating retweets and conversations.

Retweet ripples

Community (group) in a box

Because my data was also on the NodeXL Graph Gallery Marc kindly created this view which groups sub-communities using an algorithm and overlays the most used hashtags used by the sub-community (Marc’s version on NodeXL Graph Gallery). The group hashtag labels, which are in frequency order, are very useful in this situation because the search term I used was pulling in overlapping hashtag communities (#oer and #ukoer). So looking for boxes where ‘ukoer’ is near the beginning would indicate they are from the uk crowd.

oer/ukoer graph by Marc Smith

Getting more from the data

Earlier I mentioned that having expanded urls was useful for further analysis. Something I quickly played with that I’m not entirely sure how to get the most out of (if anything) is reusing my RSS Feed Social Share Counting Google Spreadsheet code to get social share data from the most tweeted links. Here’s the result (embedded below). Let me know if you have any thoughts on how it can be used:

4 Comments

Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).

To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.

Getting the data into C2

There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:

Google Refine Create Project from Google Doc

Google Refine Create Project from Google Doc

... and then use the Templating Export tool to generate the C2 xml.

Google Refine Templating Export

Google Refine Templating Export

For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.

Prefix

<?xml version="1.0" encoding="UTF-8"?>
  <searchresult>
    <query>#or2012</query>

Row template

    <document id="{{row.index}}">
      <title>{{cells["text"].value}}</title>
      <url>{{cells["status_url"].value}}</url>
      <snippet>
        {{cells["text"].value}}
      </snippet>
    </document>

Row separator

[single carriage return]

Suffix

</searchresult>

If you want to play along here is a copy of the #or2012 archive in C2 xml

Processing the data in Carrot2

Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.

Carrot2 Workbench Search Panel

Carrot2 Workbench Search Panel

The results

Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):

Carrot2 Workbench Visualization Mode

Carrot2 Workbench Visualization Mode

The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).

View a tweet

View a tweet in Carrot2

Useful/meaningful?

This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out ;)

3 Comments

Last couple of days I’ve been at IWMW12 hosted this year at University of Edinburgh. I’ve already posted Data Visualisation Plenary/Workshop Resources which has my slides from the plenary. I was teaming up with Tony Hirst (OU) and have included his slides to the page.

Because of living 'almost locally' and other family commitments I missed out on most of the social events, instead I got drunk on data working into the early hours to find what stories I could uncover from the #IWMW12 stream. In this post I’ll show you what I’ve come up with and some of the highlights in trying to turn raw data into something interesting/meaningful (or pointless if you prefer). Interestingly a lot of what I cover here uses the same techniques used in my recent The story data tells us about #CitizenRelay guest post, so I’ve got an emerging templated workflow emerging which I can deploy at events which makes me wonder if I should be getting organisers pay my travel/accommodation as an event data amplifier?

UK University Twitter Account Community

On day one Brian Kelly mentioned some work by Craig Russell to collate a table of UK University Social Media accounts which featured in a guest post on Brian’s blog titled Further Evidence of Use of Social Networks in the UK Higher Education Sector. You can get the data Craig has compiled from a Google Spreadsheet. Looking at this two things immediately sprung to mind. First that the document could be made more ‘glanceable’ just using some simple conditional formatting, and second there was a nice list of Twitter accounts to do something with.

image

Here’s a link to my modified version Craig’s spreadsheet. It uses the importRange formula to pull the data in so it creates a live link to the source document. For the conditional formatting I looked for text containing ‘http’ turning the cell text and background green. The HTML view of this is a lot cleaner looking.

On  the Twitter Accounts sheet extract the account screen names by pulling everything after the last ‘/’ and remove most of the blank rows using a unique formula.

Putting this list into the free MS Excel add-in NodeXL and using the Import > From Twitter List Network lets you get data on which of these accounts follow each other. I played around with visualising the network in NodeXL but found it easier in the end to put the data into Gephi getting the image below. These ‘hairballs’ have limited value and you’re best having a play with the interactive version, which is an export of Gephi visualised using the gexf-js tool by Raphaël Velt (De-hairballing is something Clement Levallois (‏@seinecle) and he kindly sent me a post to a new tool he’s creating called Gaze).

UK HEI Twitter Accounts

The #IWMW12 Twitter Archive Two More Ways

TimelineAs part of #iwmw12 I was collecting an archive of tweets which already gives you the TAGSExplorer view. I also use the Topsy API and Google Spreadsheet to extract tweets which is then passed into Timeline by Vérité which gives you a nice sense of the event. [If anyone else would like to make their own twitter media timeline there is a template in this post  (it is easy as make a copy of the template, enter your search terms and publish the sheet).]

Searchable archive

Searchable archiveNew way number one is a filterable/searchable archive of IWMW12 tweets. Using the Google Visualisation API I can create a custom interface to the Google Spreadsheet of tweets. This solution uses some out-of-the-box functionality including table paging, string filtering and pattern formatting. Using the pattern formatter was the biggest achievement as it allows you to insert Twitter Web Intents functionality (so if you click to reply to a tweet it pulls up Twitter’s reply box.

I also processed the archive using R to get a term frequency to make a d3 based wordcloud (I’ve started looking at how this can be put into a more general tool. Here’s my current draft which you should be able to point any TAGS spreadsheet at (this version also includes a Chart Range Filter letting you view a time range). I definitely need to write more about how this was done!)

Filter by time

Mappable archive

One of the last things I did was to filter the twitter archive for tweets with geo-location. Using the Spreadsheet Mapper 3.0 template I was able to dynamically pull the data to generate a time stamped KML file. The timestamps are ignored when you view in Google Maps, but if you download the kml file it can be played in Google Earth (you’ll have to adjust the playback control to separate the playback heads – I tried doing this in the code but the documentation is awful!)

Google Earth playback

Or if you have the Google Earth browser plugin a web version of IWMW12 geo-tweets is here (also embeded below):

So there you go … or as said by Sharon Steeples

Originally posted on CitizenRelay

Telling stories with data is a growing area for journalism and there is already a strong community around Data Driven Journalism (DDJ). I’m not journalist, by day I’m a Learning Technology Advisor for JISC CETIS, but my role does allow me to explore how data can be used within education. Often this interest spills into the evenings where I ‘play’ with data and ways to visualise the hidden stories. Here are a couple of ways I’ve been playing with data from the CitizenRelay:

A time

One of the first things I did was produce a Timeline of the CitizenRelay videos and images shared on Twitter. This uses the Topsy web service to find appropriate tweets which are stored in this Google Spreadsheet template which are then displayed in the Timeline by Vérité tool (an open source tool for displaying media in a timeline). The result is a nice way to navigate material shared as part of CitizenRelay and an indication of the amount of media shared by people.

 Timeline of the CitizenRelay videos and images shared on Twitter

A time and place

As part of the CitizenRelay Audioboo was used to record and share interviews. For a data wrangler like myself Audioboo is a nice service to work with because they provide a way to extract data from their service in a readable format. One of the undocumented options is to extract all the clips with a certain tag in a format which includes data about where the clip was recorded. Furthermore this format is readable for other services so with a couple of clicks with we can get a Google Map of CitizenRelay Boos which you can click on and find the audio clips.

 Google Map of CitizenRelay Boos

One experiment I tried which didn’t entirely work out the way I wanted was to add date/time to the Audioboo data and also embed the audio player. This datafile (generated using this modified Google Spreadsheet template) can be played in Google Earth allowing to see where Boos were created, when they were created with a timeslider animation and directly playback the clips. This experiment was partially successful because I would prefer the embedded player worked  without having to download Google Earth.

 Google Earth of CitizenRelay Boos

A look at who #CitizenRelay reporters were

So far we have mainly focused on the content but lets now look at the many eyes and ears of the CitizenRelay who helped share and create stories on Twitter.

CitizenRelay Many eyes

The image shows the profile pictures of over 600 people who used the #CitizenRelay tag on Twitter so far this year. This image was generated using a free add-in for Microsoft Excel called NodeXL, read more about getting started with NodeXL. What that image doesn’t show you is how interconnected this community is. Using another free tool called Gephi and with the right data we can plot the relationships in this twitter community, who is friends with who (read more about getting started with Gephi). In the image below pictures of people are replaced with circles and friendships are depicted by drawing a line between circles.

CitizenRelay Community

There are almost 7,000 relationships shown in the image so it can be a bit overwhelming to digest. Using Gephi it is possible to interactively explorer individual relationships. For, example the image below shows the people I’m friends with who used the #CitizenRelay tag.

CitizenRelay Sub-community

A look at what #CitizenRelay reporters said

Using the same technique for plotting relationships it’s also possible to do something similar with what people said using the #CitizenRelay tag. By plotting tweets that mention or reply to other people we get:

citizenrelay-conversation

This image is evidence that #CitizenRelay wasn’t just a broadcast, but a community of people sharing their stories. Visualising Twitter conversations is one of my interests and I’ve developed this interactive tool which lets you explore the #CitizenRelay tweets.

CitizenRelay Interactive Archive

So there you go some examples of what you can do with free tools and a bit of data, I’m sure there are many more stories to be found in CitizenRelay.

Share this post on:
| | |
Posted in Data, Twitter, Visualisation on by .

3 Comments

On Tuesday 19th June I’ll be presenting at the Institutional Web Manager Workshop (IWMW) in Edinburgh … twice! Tony Hirst and I are continuing our tour, which started at the JISC CETIS Conference 2012, before hitting the stage at GEUG12. For IWMW12 we are doing a plenary and workshop around data visualisation (the plenary being a taster for our masterclass workshop). I’ll be using this post as a holder for all the session resources.

Update: I've also added Tony Hirst's (OU) slides. Tony went on first to introduce some broad data visualisation themes before I went into a specific case study.

The draft slides for my part of the plenary are embedded below and available from Slideshare and Google Presentation (the slides are designed for use with pptPlex, but hopefully they still make sense). For the session I’m going to use the OER Visualisation Project to illustrate the processes required to get a useful dataset and how the same data can be visualised in a number of ways depending on audience and purpose. Update: I should have said the session should be streamed live, details will appear on IWMW site.

Update: As a small aside I've come up with a modified version of Craig Russell's UK Universities Social Media table as mentioned in Further Evidence of Use of Social Networks in the UK Higher Education Sector guest post on UKWebFocus (something more 'glanceable'). Using the Twitter account list as a starting point I've looked at how University accounts follow each other and come up with this (click on the image for an interactive version).

If you have any questions feel free to leave a comment or get in touch.

5 Comments

Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 

wordclouds

Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! ...  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)

2 Comments

This morning I finished listening to Episode 5 of Data Stories: How To Learn Data Visualization. Data Stories is a bi-weekly podcast on data visualisation produced by Enrico Bertini and Moritz Stefaner, episode 5 also featuring Andy Kirk. For anyone interested in Data Visualisation I’d highly recommend you give it a listen.

Like many others I’m at the beginning of my data visualisation journey, one of the things this episode highlighted was there is a whole world of data visualisation experts out there that I’ve yet to start stealing learning from. Fortunately today another Visualisation expert, Nathan Yau (FlowingData), posted his list of Data and visualization blogs worth following. Perfect!

I could’ve gone through the list and individually subscribed to each of the blogs feeds but I’m lazy (so lazy that a 15 minute hack has turned into a 3 hour write-up <sigh>) and just wanted to dump them into my Google Reader. This is a problem Tony Hirst has encountered  in Feed-detection From Blog URL Lists, with OPML Output. One thing that is not clear is how Tony got his two column CSV of source urls. There are various tools Tony could have used to do this. Here’s my take on converting a page of blog urls into an OPML bundle.

Step 1 Extracting blogs urls: Method 1 using Scraper Chrome Extension

“Scraper is a Google Chrome extension for getting data out of web pages and into spreadsheets.”

Chrome users can grab a copy of Scraper here. Once installed if you go to Nathan Yan's Data and visualization blogs worth following and right-click on  the first link in the list and select ‘Scrape similar’

scraper menuIn the window that opens you should get something similar to the one below. Scraper has two options for identifying the parts of the page you want to extract, XPath or JQuery Selectors. Both of these have similar coding structures but for this example I’m going to stick with XPath. XPath basically provides a way to identify parts of the XML/HTML structure and extract content (if you are not familiar with XPath the w3schools is a great starting point).

Scraper dialog

In this example Scraper should default to ‘//div[1]/div[2]/ul[1]/li/a’. Here’s a quick explanation of how I read this query.  Because it starts with // it will select “nodes in the document from the current node that match the selection no matter where they are” for me this is the trigger to read the query from right to left as we are matching an endpoint pattern. So:

match all <a> in all <li> in first <ul> of second <div> (<div class=”entry-content”> of first <div> (<div class="entry">)

this give use the links from the first block of bullet point. We want the links from all of the bullet points lists so the pattern we want is

match first <a> in all <li> in all <ul> of second <div> of first <div>

So basically we need to switch a to a[1] and ul[1] to ul e.g. ‘//div[1]/div[2]/ul/li/a[1]’. Edit the XPath query and in the columns section beneath change the order by clicking and dragging so that @href/URL comes first. Clicking on the ‘Scrape’ button to get a new preview which should now contain a list of 37 urls. Click on Export to Google Docs … You are now ready to move to Step 2 Auto-discovering feed urls below.

Step 1 Extracting blogs urls: Method 2 using Google Spreadsheet importXML function

Another way to get this data is to directly scrape it using Google Spreadsheets using the importXML function. This function also uses XPath to extract parts of a webpage so we can reuse the query used in Method 1 but get the data straight into a spreadsheet (it’s also a live link so if Nathan adds a new link the spreadsheet will automatically update to include this). Let give it a go.

Create a new spreadsheet and in cells A1 to B3 enter the column heading Link, Title and Url. Next in cell A2 enter:

=ImportXML("http://flowingdata.com/2012/04/27/data-and-visualization-blogs-worth-following/","//div[1]/div[2]/ul/li/a[1]/@href")

Note the addition of @href. This is included to extract the href attribute in the <a>. You should now have similar list of 37 urls from Nathan’s post.  To get titles we could enter another importXML function in cell B2 using the XPath ‘//div[1]/div[2]/ul/li/a[1]’ which will extract the text between <a></a>. Another way is to actual scrape the data from the target url. So in cell B2 enter:

=ImportXML(A2,"//title")

So this will go to the url in A2 (http://infosthetics.com/) and extract anything wrapped in <title>

Now select cell B2 and fill the column down to get titles for all the urls. Finally we need to select the entire B column and Copy/Paste values only. The reason we do this is Google Spreadsheets only allows 50 importXML function per spreadsheet and we’ll need 37 more to get the RSS feeds for these sites.

Copy Paste as Values

Step 2 Auto-discovering feed urls

Initially i tried using Feed Autodiscovery With YQL with importXML using an XPath of "//link/@href" but I was not getting any results. So instead decided to auto-detect the feed directly using importXML. In cell C2 enter:

=ImportXML(A2,"/html/head/link[@rel='alternate'][1]/@href")

This time the XPath starts at the XML tree root (<html>) looks in the <head> for the first link with the attribute rel=’alternative’. From Tony’s post:

Remember, feed autodiscovery relies on web page containing the following construction in the HTML <head>element:
<link rel=”alternate” type=”application/rss+xml” href=”FEED_URL” title=”FEED_NAME” />

[I tried using //link[@rel="alternate" and @type="application/rss+xml"] but Google Spreadsheet didn’t like it, instead grabbing the first rel=alternate link]

Fill cell C2 down the rest of the column to get RSS feeds for the other urls. You’ll notice that there’s a #N/A for http://neoformix.com/ this is because their feed isn’t auto-discoverable. Visiting their site there is a XML link (http://neoformix.com/index.xml) that we can just paste into our spreadsheet (tiding data is a usual processes in data visualisation).

Step 3 Generating an OPML bundle

Spreadsheet -> OPML GeneratorYou should now have a spreadsheet like this one with 3 columns of data (if you used the Scraper extension in step 1/method 1 you’ll need to make sure your columns are headed Link, Title and Url for the next step). Next to turn our spreadsheet of feeds into an OPML bundle. Fortunately this step has been made super easy by using the Spreadsheet -> OPML Generator. Just follow the instructions on this site an seconds later you’ve got:

OPML File of Nathan Yau’s recommended Data and Visualisation Blogs

And because I’ve imported these into Google Reader here’s an aggregated page of their posts.

Update:

Tony Hirst said:

I said:

and the how to

Feed autodiscovery scraper - based on http://bit.ly/IDuJLu CC-BY psychemedia