Archive for the 'Analytics' Category

Analytics Reconnoitre: Notes on Open Solutions in Big Data from #esym12

A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.

big data[1][2] consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[3] search, sharing, analytics,[4] and visualizing – BY Wikipedia

In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.

Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.

Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.

As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.

[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]

Open source as a Service

some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service

I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure.  First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.

If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).

Open algorithms, machines and people

amplab - state of the artThe use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.

They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:

  • Apache Mesos – an open source cluster manager
  • Spark – an open source interactive and interactive data analysis system
  • SCADS – consistency adjustable data store (license unknown)
  • PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)

amplab - machines

In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.

Here are the real applications of AMPLab Anthony mentioned:

[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]

OpenStack

Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.

NoSQL

In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia

NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:

This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.

[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]

Summary

The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.

Here are videos of all the presentations (including links to slides where available)

BTW Here is an archive of tweets from #esym12

For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published

Visual Analytics: Comparison of @SCOREProject and @UKOER (and template for making your own)

Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 

wordclouds

Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! …  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)

Analytics Reconnoitre: Notes on R in education and industry

As part of my role at JISC CETIS I’ve been asked to contribute to our ‘Analytics Reconnoitre’ which is a JISC commissioned project looking at the data and analytics landscape. One of my first tasks is to report on the broad landscape and trends in analytics service and data providers. Whilst I’m still putting this report together it’s been interesting to note how one particular analytics tools, R, keeps pinging on my radar. I thought it would be useful to loosely join these together and share.

Before R, the bigger ‘data science’ picture 

Before I go into R there is some more scene setting required. As part of the Analytics Reconnoitre Adam Cooper (JISC CETIS) has already published Analytics and Big Data – Reflections from the Teradata Universe Conference 2012 and Making Sense of “Analytics”.

The Analytics and Big Data post is an excellent summary of the Teradata Universe event and Adam is also able to note some very useful thoughts on ‘What this Means for Post-compulsory Education’. This includes identifying pathways for education to move forward with business intelligence and analytics. One of these I particularly liked was:

Experiment with being more analytical at craft-scale
Rather than thinking in terms of infrastructure or major initiatives, get some practical value with the infrastructure you have. Invest in someone with "data scientist" skills as master crafts-person and give them access to all data but don’t neglect the value of developing apprentices and of developing wider appreciation of the capabilities and limitations of analytics.

[I’m biased towards this path because it encapsulates a lot of what I aspire to be. The craft model was one introduced to me by Joss Winn at this year’s Dev8D and coming for a family of craftsmen it makes me more comfortable to think I’m continuing the tradition in some way.]

Here are Adams observations and reflections on ‘data science’ from the same bog post:

"Data Scientist" is a term which seems to be capturing the imagination in the corporate big data and analytics community but which has not been much used in our community.

A facetious definition of data scientist is "a business analyst who lives in California". Stephen Brobst gave his distinctions between data scientist and business analyst in his talk. His characterisation of a business analyst is someone who: is interested in understanding the answers to a business question; uses BI tools with filters to generate reports. A data scientist, on the other hand, is someone who: wants to know what the question should be; embodies a combination of curiosity, data gathering skills, statistical and modelling expertise and strong communication skills. Brobst argues that the working environment for a data scientist should allow them to self-provision data, rather than having to rely on what is formally supported in the organisation, to enable them to be inquisitive and creative.

Michael Rappa from the Institute for Advanced Analytics doesn’t mention curiosity but offers a similar conception of the skill-set for a data scientist in an interview in Forbes magazine. The Guardian Data Blog has also reported on various views of what comprises a data scientist in March 2012, following the Strata Conference.

While it can be a sign of hype for new terminology to be spawned, the distinctions being drawn by Brobst and others are appealing to me because they are putting space between mainstream practice of business analysis and some arguably more effective practices. As universities and colleges move forward, we should be cautious of adopt the prevailing view from industry – the established business analyst role with a focus on reporting and descriptive statistics – and miss out on a set of more effective practices. Our lack of baked-in BI culture might actually be a benefit if it allows us to more quickly adopt the data scientist perspective alongside necessary management reporting. Furthermore, our IT environment is such that self-provisioning is more tractable.

R in data science and in business

For those that don’t know R is an open source statistical programming language. If you want more background about the development of R the Information Age cover this in their piece Putting the R in analytics. An important thing to note, which is covered in the story, is R was developed by two academics at University of Auckland and continues to have a very strong and active academic community supporting it. Whilst initially used as an academic tool the article highlights how it is being adopted by the business sector.

I originally picked up the Information Age post via the Revolutions blog (hosted by Revolution Analytics) in the post Information Age: graduates driving industry adoption of R, which includes one of the following quotes from Information Age:

This popularity in academia means that R is being taught to statistics students, says Matthew Aldridge, co-founder of UK- based data analysis consultancy Mango Solutions. “We’re seeing a lot of academic departments using R, versus SPSS which was what they always used to teach at university,” he says. “That means a lot of students are coming out with R skills.”

Finance and accounting advisory Deloitte, which uses R for various statistical analyses and to visualise data for presentations, has found this to be the case. “Many of the analytical hires coming out of school now have more experience with R than with SAS and SPSS, which was not the case years ago,” says Michael Petrillo, a senior project lead at Deloitte’s New York branch.

Revolutions have picked up other stories related to R in big data and analytics. Two I have bookmarked are Yes, you need more than just R for Big Data Analytics in which Revolutions editor David Smith underlines that having tools like R aren’t enough and a wider data science approach is needed because “it combines the tool expertise with statistical expertise and the domain expertise required to understand the problem and the data applicable to it” .

Smith also reminds use that:

The R software is just one piece of software ecosystem — an analytics stack, if you will — of tools used to analyze Big Data. For one thing R isn’t a data store in its own right: you also need a data layer where R can access structured and unstructured data for analysis. (For example, see how you can use R to extract data from Hadoop in the slides from today’s webinar by Antonio Piccolboni.) At the analytics layer, you need statistical algorithms that work with Big Data, like those in Revolution R Enterprise. And at the presentation layer, you need the ability to embed the results of the analysis in reports, BI tools, or data apps.

[Revolutions also has a comprehensive list of R integrated throughout the enterprise analytics stack which includes vendor integrations from IBM, Oracle, SAP and more]

The second post from Revolutions is R and Foursquare’s recommendation engine which is another graphic illustration of how R is being used in the business sector separately from vendor tools.

Closing thoughts

At this point it’s worth highlighting another of Adam’s thoughts on directions for academia in Analytics and Big Data:

Don’t focus on IT infrastructure (or tools)
Avoid the temptation (and sales pitches) to focus on IT infrastructure as a means to get going with analytics. While good tools are necessary, they are not the right place to start.

I agree about not being blinkered by specific tools and as pointed out earlier R can only ever be just one piece of software in the ecosystem and any good data scientist will use the right tool for the job. It’s interesting to see an academic tool being adopted, and arguable driving, part of the commercial sector. Will academia follow where they have led – if you see what I mean?

Notes on extracting the JISC CETIS twitter follower network

As recently mentioned on Sheila’s work blog the way the @jisccetis twitter account is evolving. Up until recently this account was used as a broadcast channel, pushing out latest news to followers and not following back. This was balanced by members of staff having personal twitter accounts, engaging with the community. As with any community there’s going to be overlap with common friendships and Phil Barker (@philbarker) suggested it would be good to see the extended JISC CETIS twitter follower network.

In this post I’ll introduce some sketches* with results to explore and show you how the data was extracted.

*this is a term I’ve picked up from Tony Hirst along with explanatory and exploratory visualisations both presented in More Thoughts on a Content Strategy for Data. The other thing I have sitting heavily in my thoughts is Eric Berlow’s TEDTalk where he shows complex doesn’t always mean complicated (H/T @PaulHollins). My fear is I’m going to dump you with complicated exploratory sketches, when I should be giving you a simple explanatory answers.

Dump #1 Blooming great

Blooming great

For this first dump I’ve deliberately left it as low resolution as I only want to give you an overview and not analyse each node. In the graph you’ll spot dense patches of purple [A] these are made of the individual twitter screen names of people following one of the CETIS twitter accounts. So at the very top of the image there is a cluster of people following just me [B]. Other dense patches represents other groups of people following other CETIS Twitter accounts. In the centre of the main group [C] are Twitter users who follow 2 or more CETIS accounts. In Gephi by rolling over nodes it’s easy to explore who people follow. To the right of the graph [D] is the @ArchimateTool account. This cluster has fewer connections to the main CETIS following. Finally around the centre of the graph are loose groups [E] of users who follow 2 CETIS staff.

Update: Some other stats. The average out-degree in the network is 1.424 and 81% of the people in graph only follow one of the CETIS accounts. It would be interesting to see how this compares with other organisations. It’s important to also remember it’s not just about twitter (email probably still has the best reach and conversion)

[If you are desperate to explore an interactive version of this I’ve put a copy on my install of Gexf-JS viewer.image

Dump #2 Many Eyes

Overall there are over 3,500 unique Twitter accounts that follow one or more CETIS staff accounts. 3,500 pairs of eyes looking at what CETIS or staff members are doing, with the potential to spread our message even further through their own networks. Here’s what a lot of those eyes look like (click for larger version on zoom.it):

Many eyes (click to see on zoom.it)

I suppose the next question is do we have the right Twitter audience watching us.  A quick wordcloud of the profile description of the staff following us:

CETIS Follower Description Wordle

Getting the data

My regular top traffic generating blog post is Export Twitter Followers and Friends using a Google Spreadsheet which allows users to easily grab details of up to 5,000 (more if you don’t mind some code tinkering) Twitter account friend/followers. I don’t know how widely known it is but Twitter doesn’t just let you get your own friends/followers, you can get the data for any public Twitter account. So that’s what I did, snaffled details of who was following @jisccetis and JISC CETIS staff with public twitter accounts.

The way the spreadsheet is set up it generates a separate sheet for each persons follower details. To make it easy to import into Gephi/NodeXL I wrote this short script:

Here’s a copy of the modded spreadsheet. To use File > Make a copy, run through the authentication instructions, grab some follower details from different accounts then run Twitter > Combine follower sheets. If you’re going to be using Gephi last thing you should do before downloading as csv is change the column heading on the ‘combined’ sheet from screen_name to source.

Using Gephi

The best way I’ve found to get the data in Gephi is start a new project and then use the Import Spreadsheet option in the Data Laboratory pointing it to the csv file downloaded from Google Spreadsheet. I’ll let you play with manipulating the data. If you come up with any nice recipes please share ;)

Using NodeXL

Open a blank NodeXL template and then open the downloaded csv in Excel as a new workbook, then from the NodeXL ribbon Import > Open workbook. Its worth ticking the extra columns as vertex 1 properties. Again I’ll let you play, any recipes please share (the many eyes image was generated by switching the nodes to image and using the profile_image_url extracted using the Google Spreadsheet and using a grid layout. If anyone has worked out how to using images as nodes in Gephi I’d be very interested to hear).

So what

I avoided going into any deep analysis with this as there are probably internal discussions to be had, such as, should we be targeting college staff more? What I hope this posts illustrates is it’s relatively easy to extract this type of data and start to get the very beginnings of some answers (e.g. how many unique followers do we have). There still a lot to unpick in this area so I’m sure I’ll be revisiting. My question to you is if you were doing this type of study what answers would you be looking for?

Google Analytics rolling out social network activity streams: Paradata heaven?

Today in Capturing The Value Of Social Media Using Google Analytics Google announced some new features that will be appearing in Google Analytics. The post is mainly focused around ‘social value’ of defining and monitoring goals for getting people coming to your site from social networks to do something on your site (click a button, view a certain page).

The bit that is really interesting (for me anyway) is the announcement on ‘activity streams’. These will include information on:

how people are engaging socially with your content off your site across the social web. For content that was shared publicly, you can see the URLs they shared, how and where they shared (via a “reshare” on Google+ for example), and what they said. Currently, activities are reported for Google+ and across a growing list of our Social Data Hub partners including recently signed brands Badoo, Disqus, Echo, Hatena and Meetup.

Example Activity Stream

There is obvious overlap here with some of my recent work extracting ‘activity data’ from social networks for sites and repositories, but before I pack my bags there are a number of things to consider.

Twitter and Facebook probably won’t come to the party
Google’s access to activity data is limited to those who want to join the Analytics Social Data Hub. While there are already some reasonably big names signed up given the Twitter/Facebook/Google+ social network war it’s unlikely that you are going to see individual tweet analytics as I achieved here in the near future.

Access to the data
It’ll be interesting if Google will make ‘activity stream’ data available for download or access via their API. There’s very little information on the Social Data Hub website about what 3rd party services are signing up to and if there is an compensation for make their data available. For a number of the existing signups they already have their own public APIs so they may be happy for this data to be made available. Only time will tell.

Not everyone uses Google Analytics
I’m also trying to take comfort in the fact that not everyone uses Google Analytics, so there is hopefully still value is surfacing and centralising activity data for non-Analytics users.

So interesting times, but does anyone actually care about this type of data yet?

Turning Google+ Search results into a RSS feed (for Google Reader)

In Using Google Reader to create a searchable archive of Twitter mentions Alan Cann commented:

Subscribing to RSS feeds in Google Reader is my bog standard way of archiving Twitter feeds. Now to figure out how to get an RSS feed from a Google+ hashtag…

Lets look at how it might be possible. So there’s no visible RSS feed from the Google+ Search page. Looking at the API documentation there is documentation on Activities: search. So we could have a query like:

but there are a couple of problems. Data is returned in JSON and would need remapping to RSS. The real deal breaker, which is highlighted if you click on the link above, is you need to register for an API key from Google’s API Console to get the the data. So at this point I could setup a service to convert Google+ Searches into RSS feeds (and someone may have already done this), show you how to do it via the Console or show you some other way. For now I’m opting for ‘another way’.

Publishing any XML format using Google Spreadsheets

Using the same trick in Tracking activity: Diigo site bookmark activity data in a Google Spreadsheet (who is saving your stuff) we can extract some information from a Google+ Search page like this one into a Google Spreadsheet using the importXML function and XPath queries to pull out part of the page (here are parts of the same search pulled into a Google Spreadsheet). There is an option to publish a Google Spreadsheet as RSS or ATOM but it’s not structured in the same way as for a blog feed (title is a cell reference etc. like this).

What we need is a way to trick Google into formatting the data in a different way. As part of the Google Earth Outreach project a Spreadsheet Mapper was developed. This spreadsheet template allows you to enter geographic data which is automatically formatted as KML data (KML is another XML language for presenting geo data). This is achieved by creating a KML template within the spreadsheet and using the plain text output as KML. 

So using the same trick here’s a:

*** Google Spreadsheet Template to Turn Google+ Search into an RSS Feed ***

Google+ Search in Google Reader

Entering a search term and publishing the spreadsheet gives me a custom RSS feed of activity data. This feed works in Google Reader (haven’t tested any others), and with Reader we have the benefit of the results being cached (still not sure what the limitations are).

Important: Some things to be aware of. Because the data for this is extracted using XPath when Google change the page styling this solution probably won’t work anymore. Also the RSS feed being produced is for the last 10 search items. If you’ve got an active term then data might get lost.

So yet more resource based activity/paradata for you to digest!

Tracking activity: Diigo site bookmark activity data in a Google Spreadsheet (who is saving your stuff)

Recently I’ve been interested in tracking activity around resources. This comes off the back of the OER Visualisation project where I started looking at social share data around educational resources, the beginnings of a PostRank style RSS social engagement tracker, and more recently Using Google Spreadsheets to combine Twitter and Google Analytics data to find your top content distributors (it’s been a eye-opener to see how much individual activity data there is … if you know where you look).

Working on the vague use case of ‘academic finds an interesting resource and bookmarks it for later’ my assumption is there might be more social bookmarking rather than shares via services like Twitter. To see what data is accessible I turned my attention to Diigo. For those that don’t know Diigo started as a online bookmarking service but have kept adding sharing, notetaking, highlighting type features and continues to try and steal the Delicious crowd.

Diigo does have an official API but is based around individual users rather than sites. Site level data is available and here is an example for my hawksey.info domain.  The page returns the last 20 bookmarks made by users for my site. Clicking on a bookmark lets you see how many people have also publically bookmarked the page, the date and how they tagged it (there’s probably more here to do on crowdsourced metadata … for another day).

Diigo bookmark details

Back to the top level data. Obviously you could visit this page each day to see who has been bookmarking your material or maybe even find a service that emails the webpage to you each day. I’m more interested in how this data might be centralised in one place so that you can combine it with other information.  It probably won’t be a surprise that I chose Google Spreadsheets to have a crack at this.

Below is embedded this Diigo Site Tracker Google Spreadsheet <- click on the link and File > Make a copy for your own version and enter your site url in cell B3

In the spreadsheet you can see the Diigo profile url for the person who has bookmarked a link, what was bookmarked and by scraping the details page how many times the link has already been saved.

How it was made

If you have been following my other work you might think this is powered by Google Apps Script, but you’d be wrong. The spreadsheet is entirely powered by the built-in importXML function. As you’ll see from the documentation the function can handle a range of markup languages including HTML. So we can point the function at a webpage but how do we get back the parts we want. This is usually the bit I trip over. To query the part of the page you want back you need to use XPath.

XPath lets you drilldown into the part of the page you want. The key I’ve found to unlocking XPath is a browser extension (I’m currently using this one) which lets me see the XPath for part of the page I’m looking at. I then use this information in the importXML function (it’s worth noting that Google Spreadsheets limits you to 50 imports per spreadsheet, so to scale this solution to get data from other services I’d probably have to switch to Apps Script or something else).

So that was Diigo, your homework is to do something similar with Delicious and I’ll give you my answer tomorrow ;) [I might even be able to show you how you can link this to Google Analytics data].   

Using Google Spreadsheets to combine Twitter and Google Analytics data to find your top content distributors

When you share a link on Twitter there are a number of services, like bit.ly, which allow you to track the impact of the url in terms of the number of clicks it attracts from other users. At the same time there are a number of ways to monitor people sharing links to your site, the most basic being using a Twitter search like this one for hawksey.info. Using these search results you could start extracting the follower information from the person tweeting, workout potential reach and so on, but wouldn’t you like to know, as with your own bit.ly account, how many visits someone else’s tweet generated to your site? Fortunately there is a way to do this and in this post I give you two tools to help you do it and look at why this information might be useful, but lets first look at how it is possible.

Referral Traffic

In August 2011 Twitter started automatically wrapping links over 19 characters in it’s own shortening service t.co, later in October started wrapping all links in t.co. When you start navigating around the web at each page you visit the server generally knows which page you came from (exceptions include direct traffic and when coming from https://). When you click on a t.co link the site can track where you came from, known as referral traffic (interestingly it looks like Twitter also bypass your url shortener of choice by following the redirects until it reaches a final destination and use that link in the t.co redirect). So when you click on a t.co link posted on Twitter I can detect that, that’s where you came from. Furthermore, each time a person tweets a link it gets wrapped in a unique t.co url even if that url has been shortened before with the exception being new style retweets. This means that when someone clicks on a link tweeted by someone I can trace it back to a single person.

Lets see how this works in practice. When I fire up my Google Analytics account and look at referral traffic I can see it’s dominated by t.co sources. Drilling down into the t.co data I can actually see how many visits each t.co link generated.

Referral source Referral path

Searching Twitter for say http://t.co/wEbXrPah allows me to trace it back to this tweet:

So we can say this tweet and it subsequent 5 retweets generated 42 visits to my blog. At this point you might be saying but this tweet above has a bit.ly link, there’s no reference to t.co. It may say bit.ly but underneath the hyperlink is t.co:

Tweet html source

For some this isn’t news. In fact, Tom Critchlow was writing about how Twitter’s t.co link shortening service is game changing – here’s why way back in August 2011. His post probably has a better explanation of what is happening and also includes a bookmarklet (appears broken after recent Analytics overhaul) which takes the t.co referral path from your Google Analytics report and searches it on Twitter to find out which persons tweet is sending you the traffic.

I thought this was a neat idea but wanted to get the full impact of seeing the visit count associated with a named person. I tried a couple of ways to inject the data using a bookmarklet with no joy so turned to Google Spreadsheets (and in particular Google Apps Script) to marry data from Twitter with Google Analytics. So I give you:

Method 1: Quick 7 day search

*** Twitter/Google Analytics Top Distributor Sheet v1.0 ***
[File > Make a copy for you own version]

With this Google Spreadsheet I can authenticate with my Google Analytics account which then allows me to extract t.co data using the Google Analytics Core Reporting API. I then pass each t.co link to the Twitter Search API to find out who tweeted it first. This is all wrapped in a custom formula getGATwitterRef(startDate, endDate, optional numberOfResults) which generates a table of results like this:

Twitter/Google Analytics Top Distributor Sheet v1.0

So big thanks go to Brian E. Bennett (@bennettscience) for generating 16 visits, Alberto Cottica (@alberto_cottica), @futuresoup and others for also generating traffic. But who generated the 14 visits? The reason this row is blank is while the link is still generating traffic to my site the link was first tweeted over 7 days ago meaning it is outwith the capabilities of the Twitter Search API.

What a terrible shame never mind 9 out of 10 isn’t bad … ah but hold on I’ve got a Google Spreadsheet template that can archive Twitter searches (TAGS). So I give you:

Method 2: TAGS v4.0 with Google Analytics integration

*** TAGS v4.0 with Google Analytics integration ***
[File > Make a copy for you own version]

So by using a search term for your domain you can start collecting ten of thousands of tweets over months and years, which you then query against your Google Analytics data.

‘What the flip’ you might be asking. Here’s the explanation. By using the same search query at the beginning of this post for all tweets containing ‘hawksey.info’ (and because Twitter is wrapping everything in t.co it knows these even if they began life as a bit.ly or goo.gl) I can build up a corpus of tweets containing links to my site. If you also look at this archive I’m building you’ll see in the column labelled ‘text’ there is not a bit.ly or goo.gl in sight, all the links are t.co.

So all I need to do is extract a t.co referral path I’m interested in from my Google Analytics data and find who first tweeted it in my archive giving me the number of visits that link generated.   

So now I can say thank you @BillMew for the mystery 14 visits, thank you @TweetSmarter for the 38 visits generated from your tweet last month (you’ll see some #N/A I think because my collection went offline for a bit, working sweet now)

TAGS v4.0 with Google Analytics integration

Why

Whilst there could be a seriously creepy side to this (lets not forget people like Google have made serious bucks knowing where you go and what you share), there are a couple of reasons why I was interested in following up this concept. One was in relation to the Learning Registry/JLeRN experiment (background info here) which is trying to create a framework beyond metadata to include activity/paradata around resources. The idea is this data can be used to provide feedback and improve discoverability of resources. So potentially you’ve got some rich data to push into a node .. errm I think, maybe #haventplayedwiththespecyet

Another thought was during the OER Visualisation project I discovered that social sharing of resources appears, insert caveats, to be rare. If you could recognising and reward the people pushing content it might encourage them to distribute more (and as I highlighted in How do I ‘like’ your course? The value of Facebook recommendation there is real and measureable value in people distributing information through their networks) – the flip side to this being as soon as you start measuring something someone else will start gaming the system.

There’s also a degree of profiling you could do. If someone ends up at your resource having clicked on a linked shared by A that person may have something in common with them so you could target additional resources to them based on what A might like.

I’m sure there are others. As I’ve shared the tools to do this it’s only fair that you share your ideas in the comments.

But there is potentially more …

Data Feed Query ExplorerI will leave you with one last thought. I haven’t mentioned much about the code, which is available and open source (my bits anyway) via the Script Editor. To help construct the Google Analytics query I used the Data Feed Query Explorer. Here’s a permalink to the main query structure I used. If you open the link and hit the ‘Authenticate with Google Analytics’ button choosing one of your own analytics ids you can see what data comes back. I’ve been conservative only pulling what I need but if you click on the ‘dimensions’ box you can see I could also be pulling where the visits were coming from, time of day, and more. All potentially valuable intelligence to give you a picture of how a resource is being shared if you can unlock it of course.   

Introducing a RSS social engagement tracker in Google Apps Script #dev8d

For my session at Dev8D I got delegates building a RSS social engagement tracker similar to PostRank (Slides here) [Note to self: Too much coding for the room. Doh!]. Initially I was going to use my Fast-tracking feedback example for this session but forever wanting to make my life difficult decided late on to come up with something entirely new. Part of this decision was influenced by being featured in Mashable’s  5 Essential Spreadsheets for Social Media Analytics (yeah), combined with the fact that my PostRank daily email notifications are broken.

For those not familiar with PostRank their service (now owned by Google) would allow you to enter a blog RSS feed and then they would monitor ‘social engagement’ around that feed recording tweets, likes, saves etc.

Here is my solution that does something similar:

*** FeedRankSheet Google Spreadsheet v0.1 ***

This is still very beta and doesn’t entirely work as I’d like. I’ll be making improvements based on your feedback ;)

How to use it

Enter the RSS feed of the posts and comments you want to track, blog address and optionally a comma separated list of Twitter usernames you want to remove from the search. Then open the Script Editor and Run doFeedRank (you’ll need to authorise), finally add a trigger to run daily.

What it does

Example output [click to enlarge]Each time the script runs it gets the latest share counts for posts via the sharedcount.com, combines it with the comment feed and Twitter search results and generated an email to send to designated people (click on image for example output).

How it works

Most of the script is just data reading and writing. The clever bit is using Google Sites pages as a template for the email. Here is the page for the email wrapper and this page is the post share count template.

The thing that really surprised me is that SitesApp.getPageByUrl can get any public Google Sites page allowing you to do Page calls like .getHtmlContent() even if you don’t own it.

Things to improve

  • Exceeding maximum execution – I might need to optimise the code as I was getting timeouts when running as a trigger.
  • Deltas – It would be useful to include individual share count increases on daily updates (eg Twitter 9(+3))
  • I also have a sneaking suspicion that reading the posts from the spreadsheet rather than accessing the raw feed xml using apps script might be a problem. I need to run over a period of time to get data.

Integrating Google Spreadsheet/Apps Script with R: Enabling social network analysis in TAGS

Increasingly I find myself creating Twitter hashtag archives using my TAGS Google Spreadsheet template as a means to identify who in that community has the most influence and ultimately use this intelligence to target people that might be able to help me disseminate my work. Marc Smith at the Social Media Research Foundation has a useful overview on ‘How to build a collection of influential followers in Twitter using social network analysis and NodeXL’.

I don’t go to the extreme of seeking people to follow and gaining influence with retweets, I usually just follow interesting people who follow me, but the post introduces the important concept of:

  betweenness centrality” – a measure of how much a person acts a bridge between others.

(betweenness centrality (BC) was a big turning point in my interest and understanding of social network analysis, a moment captured by Sheila MacNeill)

To date the only way I could calculate BC on an archive of tweets was to download the data to my desktop, run it through NodeXL and review the data. This isn’t ideal as the data becomes orphaned. I have experimented with calculating BC using Google Apps Script using a modified version of some PHP code put together by Jonathan Cummings, but kept hitting timeouts before I could get anything back.

I forgot about pursuing this angle until that is I saw Tony Hirst’s A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets in which he uses the statistical computing and graphing tool ‘R’ to read a spreadsheet of archived tweets and produce some quick summary views (I highly recommend you read this post and also check the contribution from Ben Marwick in the comments). Reading this post made me think if it is that easy to read and analyse data using R could you also not somehow push the results back.

Fortunately, and I do mean fortunately, I have no experience of R, R Script, R Studio (I like having no preconceived ideas of what new tools can do – it far more rewarding to throw yourself into the unknown and see if you make it out the other side), but I do know a lot about Google Apps Script giving me a destination – just no way of getting there.

The idea, I think, is ingeniously simple. Read data, as Tony did, process it in R and then using Apps Script’s feature to be published as a service to simply POST the data back to the original spreadsheet.

As that is quite complicated I’ll recap. Fetch a Google Spreadsheet as a *.csv, do something with the data and then push the data back in the same way that you post a web form (and if you skipped the link the first time POST the data back to the original spreadsheet).

Having sunk a day of my own time (and it is my own time because I get paid for the OER Visualisation project for the hours I work on it), I’m not going to go into the details of how to setup R (or in my case RStudio) to do this – hey I learned it in a couple of hours so can you – instead I’ll give you the bits and pieces you need and general instructions.  Before I start you might want to see if the result is worth it so here’s a sheet of SNA stats for the #ukgc12 archive.

SNA Stats 

Playing with some test data

To make it easier I start with a partially complete dataset. The scenario is I’ve got my archive and run options 1-3 in the TAGS – Advanced menu to get an Edges sheet of friend/follower information.

  1. Open this Google Spreadsheet and File > Make a copy (this is a fully functioning – if I haven’t broken it of the next version of TAGS so if you clear the Archive and setup you can start collecting and using this with your own data).
  2. Once you’ve copied select File > Publish to the web and publish the spreadsheet
  3. In the new spreadsheet open Tools >  Script editor.. and Run > Setup (this get a copy of the spreadsheet id need to run as a service – in the normal scenario this is collected when the user authenticates the script with Twitter)
  4. Open Share > Publish as service..  and check ‘Allow anyone to invoke’ with ‘anonymous access’, not forgetting to ‘enable service’. You’ll need a copy of the service URL for later on. Click ‘Save’
    Publish as service
  5. Back in the script editor on line 57 enter a ‘secret’ – this will prevent anyone from uploading data will in anonymous mode (you can choose to only enable the service when required for extra security.
  6. Open your install of R and load a copy of this script.
  7. There are four things to edit in this script
    1. key – spreadsheet key, the bit after https://docs.google.com/spreadsheet/ccc?key= and before the &hl… junk
    2. gid – the sheet number of the Edges sheet, unless you insert/use a different sheet should always be 105 for a TAGS spreadsheet
    3. serviceUrl – the url you got in step 4
    4. secret -  the same secret you entered in step 5
  8. You might also need to install the packages used – most of them are standard but you may need to get igraph – used to get all social network data
  9. Run the R script – it may take some time to read a write to Google Spreadsheets so be patient

That’s it. If you go back to the spreadsheet (you may need to refresh) the SNA Metrics and Vertices sheets should be populated with data generated from R

The Apps Script Magic

Here’s the Google Apps Script snippet used to handle the data being pushed from R:

I’ve commented most of it so you can see what is happening. While Apps Script has a debugger which lets you monitor execution and variables it can’t intercept the POST so I used the original POST/GET code to dump the data into some cells then tweaked the script to read it from there to work out what needed to be done.

Final thoughts

I think this is a powerful model of reading selected, processing and then uploading data back to the source. I’m also only using the very basics of igraph and sure much more could be done to detect neighbourhoods, clusters and more. Also I wonder if more of the friendship data collection could be done in R with the TwitteR – (you R people really know how to make it hard to find info/help/support for your stuff ;) Right now I can get friend/follower info for a list of 250 users.

The intriguing aspect is just how much data can you push back to Apps Script and as there is a long list of Services could you also handle binary data like chart images (perhaps down the Blob and then Document Service route, or maybe just straight into Charts).

I welcome any comments you have about this technique and particularly value any feedback (I’m not a SNA expert so if there are errors in calculation or better measures I would welcome these)

About

This blog is authored by Martin Hawksey+ JISC CETIS Learning Technology Advisor (OER Programme Support)
jisc cetis logo

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License. CC-BY mhawksey