Data

4 Comments

Screenshot from TAGSExplorerGiven the number researchers who ask me about access to historic Twitter data who end up disappointed to hear free access to search results are limited to the last 7 days I’m sure they will be pleased to hear about the Twitter Data Grants:

we’re introducing a pilot project we’re calling Twitter Data Grants, through which we’ll give a handful of research institutions access to our public and historical data.

This was an area I’d hoped the Library of Congress who’d have solved long ago given they were gifted the data in April 2010. Unfortunately despite the announcement in Jan 2013 that access was weeks away nothing has appeared.

It’s worth stressing that Twitter’s initial pilot will be limited to a small number of proposals, but those who do get access will have the opportunity to “collaborate with Twitter engineers and researchers”. This isn’t the first time Twitter have opened data to researchers having made data available for a Jisc funded project to analyse the London Riot and while I except Twitter end up with a handful of elite researchers/institutions hopefully the pilot will be extended.

Proposals for this pilot need to be in by 15 March. A link is included in the Introducing Twitter Data Grants page.

Share this post on:
| | |
Posted in Data, Research, Twitter on by .

4 Comments

Update: Nicola Osborne (EDINA) has kindly live-blogged the session so extensive notes are here

Later today I’ve been invited by the University of Edinburgh Data Library team to talk about data visualisation. The abstract I submitted and slides are below. Putting a slidedeck like this together is always useful as you mentally sort through your mind the pieces of knowledge you’ve obtained, which in my case is only from the last year or so. It’s also a little unnerving to think how much more is still out there (known unknowns and unknown unknowns). The slides contain links to source (when you get to the data/vis matrix some of the thumbnails are live links), here’s also the bundle of top level links.

There are a number of examples throughout history where visualisations have been used to explore or explain problems. Notable examples include Florence Nightingale's 'Mortality of the British Army' and John Snow's Cholera Map of London. Recently the increased availability of data and software for analyzing and generating various views on this data has made it easier to generate data visualisations. In this presentation Martin Hawksey, advisor at the Jisc Centre for Educational, Technology and Interoperability Standards (Cetis), will demonstrate simple techniques for generating data visualisations: using  tools (including MS Excel and Google Spreadsheets), drawing packages (including Illustrator and Inkscape) and software libraries (including d3.js and timeline.js). As part of this participants will be introduced to basic visual theories and the concepts of exploratory and explanatory analytics. The presentation will also highlight some of the skills required for discovering and reshaping data sources.

4 Comments

The 'Deprecated notice' is because the access to the Topsy API (which is used to extract Twitter activity) is now behind a developer key.

A post from Alan Levine (@cogdog) on Syndication Machines (plus syndication of syndication) which details how feeds for the cMOOC course ds106 are aggregated and resyndicated got me thinking if there was additional data around student blog posts that could be usefully captured. Previously for something like that I would have turned to PostRank Analytics, which allowed you to specify any rss feed and it would aggregate social activity like tweets, bookmarks from a wide range of services and let you see it all in one place. Unfortunately PostRank were bought by Google and while some of this data is now accessible from Google Analytics it’s restricted to your account and data from social networks Google isn’t directly competing with.

So I thought it would be useful/interesting to start looking at what social activity you could pull together without authenticated access or API keys. So far I’ve identified couple and given some overlap with other work (and some late night play time), I’ve come up with a Google Spreadsheet template which pulls data together from comment feeds, Twitter and Delicious (with social counts for these plus Facebook, LinkedIn and Google+). You can give the spreadsheet a try with the link below:

*** Blog Activity Data Feed Template ***

Blog Activity Data Feed Template Overview

Features

  • Collects last 10 posts from a RSS feed
  • Uses sharedcount.com to get overall post share counts from Facebook, Twitter, Google+, LinkedIn and Delicious
  • For supported blogs (mainly manila WordPress) extracts comment snippets
  • Uses Topsy to collect tweets mentioning the post
  • Fetches all the Delicious bookmarks for the post url
  • Summarises activity from comments, tweets and bookmarks on the dashboard

You can make as many copies of the template as you like to track other RSS feeds.

If you prefer a slightly different overview then the RSS Feed Social Share Counting Google Spreadsheet gives a table of share counts for a feed

*** RSS Feed Social Share Counter ***

Social share count matrix

Technical highlights

There are a couple of cool things under-the-hood for the Activity Data template worth noting.

Managed Library = Easy to add more sources

So far I’ve limited the template to sources that have easy access. Previously if an API disappeared, changed or I discover another service I would have to modify the template and users would have to make a fresh copy. With Managed Libraries the Google Spreadsheet template now only needs 11 lines of code (shown below) for 3 custom formulas (see cells Dashboard!C8, and #n!G2 and #n!B8).

function getSharedCountActivity(url){
  return socialActivity.getSharedCount(url);
}

function getItemFeed(url, num){
  return socialActivity.getItemRSSFeed(url, num)
}

function getActivityFeed(url){
  return socialActivity.getActivityFeed(url);
}

Because the template uses my own socialActivity library and because I’ve left the template in Development Mode when I update the code changes should (I hope) filter to all the existing templates in use. This means if I can add more services with open APIs details should start appearing on the post activity pages (denoted with # at the beginning).

Fun with sparklines

Activity SparklinesIf you were wondering how I did the activity sparklines, here’s how. Sparklines are a built-in Google Spreadsheet formula (here’s the  reference). To get them to work you need some data if you want some options. For the data all you need is an array of values and sparkline will plot them equally spaced.

In this use case I’ve got comments, tweets, and bookmarks with date/times. To plot the sparkline I need to know how many of these were made between specific time periods. Another function we can use to get this data is FREQUENCY. This lets us specify some intervals and source data and returns an array of the frequency of data points.

One issue is that for each post the activity range varies, so one post might only have activity in July, and another in August. There are other formula (MIN/MAX) to get the range values. Here’s how the final solution was pieced together.

On each post sheet (prefixed with ‘#’) in cells A1:A20 are some formula to calculate date intervals based on the minimum and maximum date ranges. Frequency data used for each sparkline is then calculated inside the sparkline formula by indirectly looking at the data in each post sheet. The indirectly part is handled by the INDIRECT formula which evaluates a cell reference from other cell values. So on row 9 of Dashboard INDIRECT(B9&"!D9:D") evaluates to ‘#1!D9:D’, row 10 is ‘#2!D9:D and so on. I could have hardcoded these cell references but I prefer the ‘clever’ way. Here’s part of the final formula with a breakdown of what it does:

SPARKLINE(FREQUENCY(INDIRECT(B9&"!D9:D"),INDIRECT(B9&"!A1:A20")),{"charttype","line";"ymin",-1;"ymax",MAX(FREQUENCY(INDIRECT(B9&"!D9:D"),INDIRECT(B9&"!A1:A20")))+1} )

  • INDIRECT(B9&"!D9:D") – Frequency data source built using INDIRECT which evaluates a cell reference from other cell values. On row 9 of Dashboard this evaluates to ‘#1!D9:D’, row 10 is ‘#2!D9:D
  • INDIRECT(B9&"!A1:A20") – similar trick to get an interval array for the FREQUENCY formula
  • FREQUENCY(INDIRECT(B9&"!D9:D"),INDIRECT(B9&"!A1:A20")) – gives our sparkline data array
  • MAX(FREQUENCY(… – get the maximum value from the frequency data just so we can add 1 to it to give a margin on the sparkline

Summary

Hopefully you’ll enjoy and if you can think of any improvements or other services to tap into just get in touch.

Jorum has a Dashboard Beta (for exposing usage and other stats about OER in Jorum) up for the community to have a play with: we would like to get your feedback!

For more information see the blog post here: http://www.jorum.ac.uk/blog/post/38/collecting-statistics-just-got-a-whole-lot-sweeter

Pertinent info: the Dashboard has live Jorum stats behind it, but the stats have some irregularities, so the stats themselves come with a health warning. We’re moving from quite an old version of DSpace to the most recent version over the summer, at which point we will have more reliable stats.

We also have a special project going over the summer to enhance our statistics and other paradata provision, so we’d love to get as much community feedback as possible to feed into that work. We’ll be doing a specific blog post about that as soon as we have contractors finalised!

Feedback by any of the mechanisms suggested in the blog post, or via discussion here on the list, all welcome.

The above message came from Sarah Currier on the [email protected] list. This was my response:

It always warms my heart to see a little more data being made openly available :)

I imagine (and I might be wrong) that the main users of this data might be repository managers wanting to analyse how their institutional resources are doing. So to be able to filter uploads/downloads/views for their resources and compare with overall figures would be useful.

Another (perhaps equally important) use case would be individuals wanting to know how their resources are doing, so a personal dashboard of resources uploaded, downloads, views would also be useful. This is an area Lincoln's Bebop project were interested in so it might be an idea to work with them to find out what data would be useful to them and in what format (although saying that think I only found one #ukoer record for Lincoln {hmm I wonder if anyone else would find it useful if you pushed data to Google Spreadsheets a la Guardian datastore (here's some I captured as part of the OER Visualisation Project}) ).

I'm interested to hear what the list think about these two points

You might also want to consider how the data is licensed on the developer page. Back to my favourite example, Gent use the Open Data Commons licence  http://opendatacommons.org/licenses/odbl/summary/

So what do you think of the beta dashboard? Do you think the two use cases I outline are valid or is there a more pertinent one? (If you want to leave a comment here I’ll make sure they are passed on to the Jorum team, or you can use other means).

[I’d also like to add a personal note that I’ve been impressed with the recent developments from Jorum/Mimas. There was a rocky period when I was at the JISC RSC when Jorum didn’t look aligned to what was going on in the wider world, but since then they’ve managed to turn it around and developments like this demonstrate a commitment to a better service]

Update: Bruce Mcpherson has been working some Excel/Google Spreadsheet magic and has links to examples in this comment thread

Share this post on:
| | |
Posted in API, Data, Jorum, OER and tagged on by .

3 Comments

Last couple of days I’ve been at IWMW12 hosted this year at University of Edinburgh. I’ve already posted Data Visualisation Plenary/Workshop Resources which has my slides from the plenary. I was teaming up with Tony Hirst (OU) and have included his slides to the page.

Because of living 'almost locally' and other family commitments I missed out on most of the social events, instead I got drunk on data working into the early hours to find what stories I could uncover from the #IWMW12 stream. In this post I’ll show you what I’ve come up with and some of the highlights in trying to turn raw data into something interesting/meaningful (or pointless if you prefer). Interestingly a lot of what I cover here uses the same techniques used in my recent The story data tells us about #CitizenRelay guest post, so I’ve got an emerging templated workflow emerging which I can deploy at events which makes me wonder if I should be getting organisers pay my travel/accommodation as an event data amplifier?

UK University Twitter Account Community

On day one Brian Kelly mentioned some work by Craig Russell to collate a table of UK University Social Media accounts which featured in a guest post on Brian’s blog titled Further Evidence of Use of Social Networks in the UK Higher Education Sector. You can get the data Craig has compiled from a Google Spreadsheet. Looking at this two things immediately sprung to mind. First that the document could be made more ‘glanceable’ just using some simple conditional formatting, and second there was a nice list of Twitter accounts to do something with.

image

Here’s a link to my modified version Craig’s spreadsheet. It uses the importRange formula to pull the data in so it creates a live link to the source document. For the conditional formatting I looked for text containing ‘http’ turning the cell text and background green. The HTML view of this is a lot cleaner looking.

On  the Twitter Accounts sheet extract the account screen names by pulling everything after the last ‘/’ and remove most of the blank rows using a unique formula.

Putting this list into the free MS Excel add-in NodeXL and using the Import > From Twitter List Network lets you get data on which of these accounts follow each other. I played around with visualising the network in NodeXL but found it easier in the end to put the data into Gephi getting the image below. These ‘hairballs’ have limited value and you’re best having a play with the interactive version, which is an export of Gephi visualised using the gexf-js tool by Raphaël Velt (De-hairballing is something Clement Levallois (‏@seinecle) and he kindly sent me a post to a new tool he’s creating called Gaze).

UK HEI Twitter Accounts

The #IWMW12 Twitter Archive Two More Ways

TimelineAs part of #iwmw12 I was collecting an archive of tweets which already gives you the TAGSExplorer view. I also use the Topsy API and Google Spreadsheet to extract tweets which is then passed into Timeline by Vérité which gives you a nice sense of the event. [If anyone else would like to make their own twitter media timeline there is a template in this post  (it is easy as make a copy of the template, enter your search terms and publish the sheet).]

Searchable archive

Searchable archiveNew way number one is a filterable/searchable archive of IWMW12 tweets. Using the Google Visualisation API I can create a custom interface to the Google Spreadsheet of tweets. This solution uses some out-of-the-box functionality including table paging, string filtering and pattern formatting. Using the pattern formatter was the biggest achievement as it allows you to insert Twitter Web Intents functionality (so if you click to reply to a tweet it pulls up Twitter’s reply box.

I also processed the archive using R to get a term frequency to make a d3 based wordcloud (I’ve started looking at how this can be put into a more general tool. Here’s my current draft which you should be able to point any TAGS spreadsheet at (this version also includes a Chart Range Filter letting you view a time range). I definitely need to write more about how this was done!)

Filter by time

Mappable archive

One of the last things I did was to filter the twitter archive for tweets with geo-location. Using the Spreadsheet Mapper 3.0 template I was able to dynamically pull the data to generate a time stamped KML file. The timestamps are ignored when you view in Google Maps, but if you download the kml file it can be played in Google Earth (you’ll have to adjust the playback control to separate the playback heads – I tried doing this in the code but the documentation is awful!)

Google Earth playback

Or if you have the Google Earth browser plugin a web version of IWMW12 geo-tweets is here (also embeded below):

So there you go … or as said by Sharon Steeples

Originally posted on CitizenRelay

Telling stories with data is a growing area for journalism and there is already a strong community around Data Driven Journalism (DDJ). I’m not journalist, by day I’m a Learning Technology Advisor for JISC CETIS, but my role does allow me to explore how data can be used within education. Often this interest spills into the evenings where I ‘play’ with data and ways to visualise the hidden stories. Here are a couple of ways I’ve been playing with data from the CitizenRelay:

A time

One of the first things I did was produce a Timeline of the CitizenRelay videos and images shared on Twitter. This uses the Topsy web service to find appropriate tweets which are stored in this Google Spreadsheet template which are then displayed in the Timeline by Vérité tool (an open source tool for displaying media in a timeline). The result is a nice way to navigate material shared as part of CitizenRelay and an indication of the amount of media shared by people.

 Timeline of the CitizenRelay videos and images shared on Twitter

A time and place

As part of the CitizenRelay Audioboo was used to record and share interviews. For a data wrangler like myself Audioboo is a nice service to work with because they provide a way to extract data from their service in a readable format. One of the undocumented options is to extract all the clips with a certain tag in a format which includes data about where the clip was recorded. Furthermore this format is readable for other services so with a couple of clicks with we can get a Google Map of CitizenRelay Boos which you can click on and find the audio clips.

 Google Map of CitizenRelay Boos

One experiment I tried which didn’t entirely work out the way I wanted was to add date/time to the Audioboo data and also embed the audio player. This datafile (generated using this modified Google Spreadsheet template) can be played in Google Earth allowing to see where Boos were created, when they were created with a timeslider animation and directly playback the clips. This experiment was partially successful because I would prefer the embedded player worked  without having to download Google Earth.

 Google Earth of CitizenRelay Boos

A look at who #CitizenRelay reporters were

So far we have mainly focused on the content but lets now look at the many eyes and ears of the CitizenRelay who helped share and create stories on Twitter.

CitizenRelay Many eyes

The image shows the profile pictures of over 600 people who used the #CitizenRelay tag on Twitter so far this year. This image was generated using a free add-in for Microsoft Excel called NodeXL, read more about getting started with NodeXL. What that image doesn’t show you is how interconnected this community is. Using another free tool called Gephi and with the right data we can plot the relationships in this twitter community, who is friends with who (read more about getting started with Gephi). In the image below pictures of people are replaced with circles and friendships are depicted by drawing a line between circles.

CitizenRelay Community

There are almost 7,000 relationships shown in the image so it can be a bit overwhelming to digest. Using Gephi it is possible to interactively explorer individual relationships. For, example the image below shows the people I’m friends with who used the #CitizenRelay tag.

CitizenRelay Sub-community

A look at what #CitizenRelay reporters said

Using the same technique for plotting relationships it’s also possible to do something similar with what people said using the #CitizenRelay tag. By plotting tweets that mention or reply to other people we get:

citizenrelay-conversation

This image is evidence that #CitizenRelay wasn’t just a broadcast, but a community of people sharing their stories. Visualising Twitter conversations is one of my interests and I’ve developed this interactive tool which lets you explore the #CitizenRelay tweets.

CitizenRelay Interactive Archive

So there you go some examples of what you can do with free tools and a bit of data, I’m sure there are many more stories to be found in CitizenRelay.

Share this post on:
| | |
Posted in Data, Twitter, Visualisation on by .

3 Comments

On Tuesday 19th June I’ll be presenting at the Institutional Web Manager Workshop (IWMW) in Edinburgh … twice! Tony Hirst and I are continuing our tour, which started at the JISC CETIS Conference 2012, before hitting the stage at GEUG12. For IWMW12 we are doing a plenary and workshop around data visualisation (the plenary being a taster for our masterclass workshop). I’ll be using this post as a holder for all the session resources.

Update: I've also added Tony Hirst's (OU) slides. Tony went on first to introduce some broad data visualisation themes before I went into a specific case study.

The draft slides for my part of the plenary are embedded below and available from Slideshare and Google Presentation (the slides are designed for use with pptPlex, but hopefully they still make sense). For the session I’m going to use the OER Visualisation Project to illustrate the processes required to get a useful dataset and how the same data can be visualised in a number of ways depending on audience and purpose. Update: I should have said the session should be streamed live, details will appear on IWMW site.

Update: As a small aside I've come up with a modified version of Craig Russell's UK Universities Social Media table as mentioned in Further Evidence of Use of Social Networks in the UK Higher Education Sector guest post on UKWebFocus (something more 'glanceable'). Using the Twitter account list as a starting point I've looked at how University accounts follow each other and come up with this (click on the image for an interactive version).

If you have any questions feel free to leave a comment or get in touch.

3 Comments

A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.

big data[1][2] consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[3] search, sharing, analytics,[4] and visualizing – BY Wikipedia

In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.

Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.

Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.

As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.

[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]

Open source as a Service

some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service

I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure.  First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.

If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).

Open algorithms, machines and people

amplab - state of the artThe use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.

They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:

  • Apache Mesos – an open source cluster manager
  • Spark – an open source interactive and interactive data analysis system
  • SCADS – consistency adjustable data store (license unknown)
  • PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)

amplab - machines

In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.

Here are the real applications of AMPLab Anthony mentioned:

[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]

OpenStack

Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.

NoSQL

In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia

NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:

This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.

[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]

Summary

The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.

Here are videos of all the presentations (including links to slides where available)

BTW Here is an archive of tweets from #esym12

For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published