Graphs can be a powerful way to represent relationships between data, but they are also a very abstract concept, which means that they run the danger of meaning something only to the creator of the graph. Often, simply showing the structure of the data says very little about what it actually means, even though it’s a perfectly accurate means of representing the data. Everything looks like a graph, but almost nothing should ever be drawn as one. Ben Fry in ‘Visualizing Data’
I got that quote from Dan Brickley’s post Linked Literature, Linked TV – Everything Looks like a Graph and like Dan I think Ben Fry has it spot on. When I started following Tony’s work on network analysis (here’s a starting point of posts), my immediate response was ‘Where’s Wally?’, where was I in relationship to my peers, who was I connected to, or even who wasn’t I connected to.
As I start my exploration of tools like NodeXL it’s very clear that being able to filter, probe and wander through the data provides far more insights to what’s going on. This is why when I, and I’m sure Tony as well, show our tangled webs it’s designed as a teaser to inspire you to follow our recipes and get stuck into the data yourself. This isn’t however always practical.
But what does this graph actually mean? I could start highlighting parts of the story, but that would be my interpretation of the data. I could give you the NodeXL file to download and look at, but you might not have this software installed or be proficient at using it. I could try looking at the raw data in the Google Spreadsheet, but it lacks ‘scanability’. So I’ve come up with a halfway house. A re-useable interface to the TAGS spreadsheet which starts presenting some of the visual story, with interactivity to let you drilldown into the data. I give you:
TAGSExplorer is a result of a couple of days code bashing (so a little rough around the edges) which mainly uses the DataTable part of the Google Visualization API to read data from a TAGS spreadsheet and format it to use with d3.js graphing library. By chucking some extra JavaScript/JQuery code (partly taken from johman’s Twitter Smash example) I’ve been able to reformat the raw Twitter data from the Google Spreadsheet and reformat it returning Twitter functionality like reply/retweet by using their Web Intents API.
What is displayed:
A node for each Twitterer who used the #studentexp hashtag and is stored in the spreadsheet archive.
Solid lines between nodes are conversations eg @ernestopriego tweeted “@easegill I agree completely. Learning how to use social media tools is part of digital literacy and fluency; part of education. #studentexp” creating a connection between @ernestopriego and @easegill.
Dotted lines are not direct replies but mentions eg @theREALwikiman tweeted “If you’re an academic librarian it might be worth following @GdnHigherEd‘s #studentexp tag right now, if you have time. Interesting stuff.” For performance by default these are turned off but enabled by following the instructions below.
Node text size based on he number of @replies and @mentions
How to make your own?
If you haven’t already you need to capture some tweets into a TAGS spreadsheet
When you have some data from the spreadsheet File > Publish to the web …
Head over to TAGSExplorer and enter you spreadsheet key (or just paste the entire spreadsheet url HT to Tony Hirst for this code)
Click ‘get sheet names’ and select the sheet of the data you want to use (if you are doing a continuous collection the default is archive)
Click ‘go’
If you want to share with others, click the ‘link for this’ at the top right which gives you a permanent url – the permanent link also hides the spreadsheet selection interface. By default mention lines are off but can be enabled by adding &mentions=true to the link (see example above)
Some examples
If you don’t have your own data yet here’s some examples from data I’ve already collected:
I’ve got some ideas, I’m interested in integrating the sentiment scores from ViralHeat, but more importantly where do you think I should go next with this?
It would be great if NodeXL had a way of publishing graphs whilst maintaining some of this interactivity, a bit like the way I can embed basic Twitter networks using the Hirst-Hawksey Protovis (Friendviz) Google Gadget. After I tip from Tony I had a look at the D3.js library which has superseded Protovis. I had a go at changing the data source in this example by adding a custom column to the Edges sheet in NodeXL with ="{source: """&[@[Vertex 1]]&”"”, target: “”"&[@[Vertex 2]]&”"”, type: “”licensing”"},” but which generated something – more tweaking required
On the Vertices sheet insert a new column in ‘Other Columns’ and call it ‘index’. Insert the following formula into the first cell beneath the column heading (on row 3) =ROW([@Vertex])-3 . This should fill the column with sequential numbers. In row 1 of the index column enter =COLUMN()and take a note of the number it calculates
Insert another column in the Other columns and call it something like d3 data and insert the formula ="{""id"":"&[@index]&", ""name"": """&[@Label]&""", ""url"":"""&[@[Custom Menu Item Action]]&""", ""nodeSize"":"&ROUND([@Size],2)&"}," where:
[@index] – is the number column you just created
[@Label] – is the node label you’ve applied
[@[Custom Menu Item Action]] – is designed to be filled with urls for menu actions. If you don’t have a custom menu action you should replace "&[@[Custom Menu Item Action]]&" (including 1st set of quotes with # eg &""", ""url"":""#"", ""nod
[@Size] – is the visual properties size
Move to the Edges sheet and insert another ‘Other columns’, call it d3 data and insert the formula ="{""source"": "&VLOOKUP([@[Vertex 1]],Vertices!A:AC,29,FALSE)&", ""target"": "&VLOOKUP([@[Vertex 2]],Vertices!A:AC,29,FALSE)&"},"Important: you need to replace AC and 29 with your own ‘index’ column letters and numbers so if you index column number is 30 replace AC,29 with AD,30 Also note there are two instances of this range in the formula.
Step 2: Edit the data file
So far what we’ve done is prepare the spreadsheet to dump some data. The next part is to insert this into a data file for a template html page to render. For this I’m going to show you how to do it using the code repository GitHub and the Bl.ocks.org viewer but if you prefer you can download the project files, edit them offline before uploading somewhere of your own.
Visit this page and click on the ‘fork’ button (if you’re not signed up registration is free)
Once signed in and back on the code page click on ‘edit’ button
In the force.json window paste the values from d3 data column on the Vertices sheet overwriting the ‘paste your vertices here’ text. On the last row of the pasted data remove the last comma from the pasted data
Next paste the d3 data from your edges sheet where it says ‘paste your edges data here’ and again remove the last comma from the pasted data (here’s an example of what your file should look like)
And that should be you. Because the d3.js library is rendering the data live in your browser there’s a limit to the number of nodes/edges you can render (I reckon 150 nodes, 2000 edges is safe, more than that might be a problem. You can now go off be merry and prosper, but if you want stick around to find out on how this process could be streamlined and some d3.js tricks I picked up.
Building blocks for a macro
Your still here yeah! The process of creating the the force.json data file is a bit cumbersome but could be streamlined using a macro. I’m not that familiar with Visual Basic so won’t be doing this myself just yet but here is the pseudocode and code snippets I’ve found if I were to do it.
node size – [line 90] pulls the nodeSize attribute from the json data for a node and in this case multiples by 3 to get a radius (attr(“r”))
marker position – [lines 73] because we have a variable node size the marker end position needs to be dynamically adjusted. This is done by pulling the nodeSize again from the data and adjusting the attr(“refX”) (I think this value is based on diameter rather rather radius, but I’m very unsure about that it’s the radius + the marker height + a bit more for line width)
marker duplication/attachment to path – [lines 71 and 85] the thing I’m getting my head around is d3 is basically an interface for pure SVG so it’s not enough to just now javascript, you need to know how svg markup works. If I had nodes all of the same size I could create a SVG marker and append it to every path as a marker-end by using it’s marker id url(#markerName). As the refX varies I need to make a unique marker for each edge, then attach it using it’s id. This is what lines 71 and 85 do, create an id then attach that marker to a path. There is also a whole markup language for marker and path shapes. Here’s where I started learning about markers.
And if you are still reading this thank you for sticking with it ;). What do you think is displaying force diagrams from NodeXL with d3.js practical? The limit to the number of nodes and edges is very restrictive. Perhaps it would just be better if NodeXL just generated some static SVG markup for users to embed on websites, after all, all the major browsers now support this format.
Like any fake PhD candidate it’s important to follow the work of your supervisor, after all they will be marking your imaginary neverending thesis. So after much toil and many pointers from Tony here’s what I’ve come up with – a collision of the Guardian Platform API and visualisation with the d3.js library – GuardianTagExplorer.
In this post I’ll highlight a couple of features of the interface and then try to recall many of the lessons learned. Below is a short clip to show how it’s supposed to work or you can have a play yourself via the link above (because it uses SVG the 9% of you who use Internet Explorer 8 or less won’t see anything):
What does it do
When you enter a search term it asks the Guardian Open Platform if there are any articles associated with that term. Each of these articles has some metadata attached including a list of tags used to categorise the piece. Using a ported version of Tony’s python code these tags are collected and the number of other articles from the search result with the same tag are counted. The page then renders this information as a force layout diagram using the d3.js visualisation library (tags and links = nodes and edges) and a histogram by putting the same data into the Google Visualization API.
I didn’t show it in the video but you can create predefined searches for linking and embedding. For example, here’s one for the term ‘JISC’ and if your RSS reader hasn’t stripped out the iframe the same page is embedded below:
How it was made/What I learned
I mentioned to my unsupervisor that I was thinking of doing something with the a live version of the Guardian Open Platform with d3 based on his Friendviz example and he immediately spotted a couple of problems, the biggest being the Guardian API prefer it if you keep your api key a secret.
Yahoo Pipes as a proxy service
Fortunately Tony also had the answer of using Yahoo Pipes with a private string block as a proxy service (I’m not sure if there is much benefit to doing this as while the API key is still hidden anyone can access the pipe. The API is rate limited anyway and I hope the Guardian people see I’m keeping to the spirit of the terms and conditions.
So data source, check. Porting python to JavaScript. Relatively straight forward apart from no combinations mapping function, but having sketched out what was going on I think I’ve got an equivalent.
ddd dd dd d3.js
Big headache! even though I’ve churned out a fair bit of code I’m not or never have been a professional programmer so getting my head around d3.js has been a big challenge. There were a couple of examples I spent a lot of time picking over trying to understand what was going on. The main ones were:
[These examples are renderings of GitHub Gists using the bl.ocks.org service created by … mbostock and is a great way to publish little snippets of stuff]
I also got a peak at the generated code for Tony’s Visualising New York Times Article API Tag Graphs Using d3.js. You’ll notice that my offering is similar in appearance and functionality to yohman’s example (I’m quietly ignoring his copyright mark – fair use etc :-s).
Its hard to convey exactly what I learned from the last couple of days of pushing pixels. The big difference between d3 and the similar protovis library I used here is there is a lot more setting up to do in the code. The payoff is you have far more control of the end result. Having spent days trying to understand d3, it was contrasted by the minutes needed to create the tag histogram using the Google Visualization API.
One thing I never got working is a zoom/pan effect. I’ve seen tiny snippets of code that does this for charts. Unfortunately the API Reference for this behaviour is still to be written.
Where next
Now that I’ve got a basic framework for visualising tag/category information I interested in refining this by trying out some other examples. So if you have an API you want me to play with drop me a line ;)
Recently I’ve started exploring extended Twitter conversations using data captured from my Twitter Archiver Google Spreadsheet (think I going to rename this TAGS) and one of the problems was “The user ids in the Search API are different from those in the REST API … This defect is being tracked by Issue 214. This means that the to_user_id and from_user_id field vary from the actualy [sic] user id on Twitter.com”
Twitter have announced [H/T @psychemedia] that this problem is now fixed so it made sense to start getting the right user ids as part of the opt-in now rather than weight for November. While I was updating the code it also made sense to also include an option to get the Tweet Entities, which additional pieces of metadata like urls and hashtags detected by twitter. These don’t appear to be fully enabled yet, but I’ve already added a new column entities_str in the TMP sheet to capture these (if you don’t want these just delete the column from TMP). Because entities are objects of varying length for now I’m just recording them as a JSON string.
A couple more updates are the addition of a ‘time’ column. This just converts the tweet date/time into a spreadsheet friendly format to make doing calculations easier.
There are also some easter eggs in the code. The main ones are the code I use to grab ViralHeat sentiment analysis of tweets and some code to generate data for my Protovis gadget.
One of the great things about open development is people take your stuff and do interesting things with it that you never thought about. As you’ll see from the embedded tweet above that’s exactly what Saqib Ali did, taking my Google Spreadsheet Twitter Archive mashup and combining it with the Embed.ly oEmbed API to create this curated page of tweets, brilliant!
Inspired by this, in this post I highlight why using a spreadsheet as a data source for embedded objects might be useful and how I made the SpreadEmbed prototype which allows you to do this.
On a technical level SpreadEmbed reads a column using the Google Visualization API, passes the links to Embed.ly and generates a page from the result.
Why use a Google Spreadsheet as a data source?
Of course you could use a tool like Storify (which also is powered by Embed.ly) to do this but as I highlighted in this recent post there are limitations to far back you can go in the twitter search. I also like the concept of reusing a spreadsheet of tweets because as I’m discovering in my own analysis of event hashtags there are some very complex and interesting stories to tell which would benefit from some actual quotations. So for me I can filter a large archive of tweets and render it in a more visually appealing way (rather than as in this post directing people to a spreadsheet of data).
Putting this prototype together took about 4 hours to get it working and about the same again with tweaks. Fortunately there was already an example on the Embed.ly github, which embeds YouTube clips from a Twitter search. As my prototype iterates across cells most of the work has already been done.
The remaining headache was providing a way to access cell data from a user defined spreadsheet. The solution was using Tony Hirst’s Google Spreadsheets as a Database code (just for fun … (more because I wanted to see if I could do it) I wrote some code to generate a dropdown list of sheet names to allow the user to indicate where there links were). The source code is here, feel free to re-use/mod.
Limitations and better ways
One of the reasons I’m presenting SpreadEmbed as a proof of concept/4 hr hack and not a service is if you are making more than 10,000 calls a month the the embed.ly API you gotta pay. Also there are probably better ways of doing it. Saqib hasn’t said how exactly he did his version, but I’m guessing you can just copy and paste a list of urls into a post and if you’ve got an embed.ly type plugin it’ll do the rest of your work for you.
For a while I’ve been curating ‘The GAS’ which is a collection of Google Apps Script examples and news pulled together using scoop.it. I started it mainly because I’m a ‘patterns programmer’, that is I have limited formal coding skills so learn and produce things from other people’s examples. Because others getting started with Apps Script may also benefit learning this way I’m advertising the link here.
I also want to keep adding to this resource and while scoop.it has some nice article discovery searches I’m aware that I’m missing a lot more. So if you have found or made an Apps Script example you would like to share in The GAS there are a couple of ways you can do this.
If you are already a scoop.it user you can suggest resources for other scoops you follow
I was recently rediscovering an old Hirst post on Google Spreadsheets as a Database in which he demos an “interface for constructing URIs to query Google spreadsheets using the Google query language” . I found myself at this post after researching a ‘cunning idea’ I’ve in development and also because Google Search ‘probably’ knows best.
Anyway this post is unrelated to the ‘cunning idea’ but I got thinking one of the aspects of databases is getting data in. I tackled this when I was at Dev8D back in February with the post Collecting any data in a Google Spreadsheet using GET or POST. At the time I was planning a follow-up post to show a practical application but never got around to it.
So here is a more refined piece of code which actually records data submitted from your own form or custom code:
There are a couple of ways you can submit data. You can use a basic HTML form (this example shows how you can avoid redirecting to the Google macro/service page using a hidden iframe (I tried using ajax to submit the data without refresh but you run into cross domain/XHTTP security issues. I imagine you could get around this by having a intermediary PHP page that passes the data as POST or GET. If you are going down either of these solutions remember to allow anonymous access when you publish as a service in Google Spreadsheets)). Here’s the spreadsheet where the data from the basic form example is being collected.
[This text originally appeared in Live Twitter data from FOTE #fote11 post but I’m extracting here to provide a separate space for comment (and hit RSS aggregators with something I think is quite interesting]
The graph shows that over 80% (n.296) of #fote11 hash taggers (n.355) posted tweets with an overall positive sentiment detection.
Somethings worth noting with this data. Sentiment analysis is being analysed using machine detection (ie it might be wrong). Someone with overall negative sentiment doesn’t necessarily indicate that they had a bad event experience. If the person was reflecting on issues being presented or quoting others who had a negative experience this will be reflected in their sentiment score. The bottom line is the graph gives an overview of a more complex story. If you want to start unpicking that story yourself the GraphML data is available on the NodeXL GraphGallery.
Not at FOTE11 today (you can follow the live stream) but was interested in what data I could extract from the #fote11 twitter stream and process as the event unfolds. Most of the data here comes from a mashup of my Google Spreadsheet mashup. (This is perhaps putting Google Spreadsheets to the limit in terms of fetching data but lets see how it goes)
Update: Some of these graphs may take a couple of seconds to appear [H/T @johnmclear]
Update 2: The interactive #fote community was freezing Internet Explorer so click through to see interactive version [H/T James Swansburg]
Update 3: Added a second snapshot of the Twitter conversations to cover whole day
Update 4: Coded the last snapshot with sentiment data from ViralHeat
The graph shows that over 80% (n.296) of #fote11 hash taggers (n.355) posted tweets with an overall positive sentiment detection.
Somethings worth noting with this data. Sentiment analysis is being analysed using machine detection (ie it might be wrong). Someone with overall negative sentiment doesn’t necessarily indicate that they had a bad event experience. If the person was reflecting on issues being presented or quoting others who had a negative experience this will be reflected in their sentiment score. The bottom line is the graph gives an overview of a more complex story. If you want to start unpicking that story yourself the GraphML data is available on the NodeXL GraphGallery.
This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).