#CETIS12 Social Network Analysis & Data Visualisation: Past, Present and Future #ukoer #infovis Presentation

This week I’ll be at the CETIS Conference in Nottingham presenting some of the work done as part of the OER Visualisation Project. Here are my OER Visualisation slides (embedded below)

To add some context, there were a couple of things floating in my head when I put this together. First, you’ll see the influence of Data Driven Journalism (DDJ). Tapping in to DDJ is useful because it’s still an emerging discipline and I feel there is a rice vein of innovation and exploration around distilling data into stories so there is a lot to learn/borrow from this area. Second, David Sherlock (CETIS) kindly blogged about the conference session which includes some questions he would like answered (most of these questions were asked at Tony Hirst’s Visualisation session at Dev8D). Third, Tony’s session also highlighted that the type of visualisations used is very dependant on the audience using them (you may want something shiny to impress which removes a lot of the data or something more practical).

The presentation is in two parts, communication and data.

Communicate

Slides 4-9 are designed to show how the same base data can be rendered in different ways:

  • It starts with the snowflake video (slide 4), which is visually impressive but at the cost of removing detailed information. For example, it’s hard to see which subjects are getting the post deposits or a sense of who is making the most deposits.
  • Slide 5 tries to address the ‘who is making deposits’ question with an interactive map of Jorum submissions reconciled to location.
  • Next, (slide 6) removes the element of time, condensing over 9,000 individual records into one image (as this was generated in NodeXL there is an opportunity to highlight that there is an interactive version).
  • In slide 7, the dimension of time is reintroduced with an interactive bubble diagram. There’s an opportunity to highlight some ‘glance-ability’ features of this graph, but also raise concerns about information overload. 
  • Slide 8 is a heatmap of the data in Google Spreadsheets. At this point we are moving away from the visually impressive focusing on situated practical information.
  • Slide 9 shows how this information can be shown in a different way. This slide is also an opportunity to highlight the continual struggle between representing whilst not diluting the information
  • In slide 10 is the New York Times ‘The Jobless Rate for People Like You’ example which is included as an example of a way to win this battle. The caveat to this is visuals like this are still very bespoke needing an amount of focused development (also remembering to highlight that there are a number of libraries that can help towards getting the job done).

Communication rules

Slide 11 – Whilst I recognise there are ‘rules’ and guidance with graphics and typesetting and perhaps I would be a better person if I took time to understand them, for now I am happy in my ignorance the payback hopefully being to think differently about the problems.

Damn that dirty data

The second part of the presentation is the story behind how the data used in part one was compiled, cleaned, contextualised and combined. It’ll be a whilst stop tour of consuming OAI-PMH data in Google Refine, the issues of reconciling records to institutions, and the delights of Google Refine to clean and combine data. 

Other things

Opportunity to flag other things done as part of the project.

Summary

Mainly just to highlight the range of free/open source tools and libraries out there. May mention some others. Finally just to remind the folks that data use is a great way to validate data. There were a number of examples from the project where using the data turned up unexpected results which were traced back to issues with the datasets. 

What do you think?

So this is your opportunity to feedback on the presentation. Do you want a different focus? Are there other tools used in the project you’d like more information on? Should I have included ‘101 ways Google Spreadsheets can annoy you’?  

Introducing a RSS social engagement tracker in Google Apps Script #dev8d

For my session at Dev8D I got delegates building a RSS social engagement tracker similar to PostRank (Slides here) [Note to self: Too much coding for the room. Doh!]. Initially I was going to use my Fast-tracking feedback example for this session but forever wanting to make my life difficult decided late on to come up with something entirely new. Part of this decision was influenced by being featured in Mashable’s  5 Essential Spreadsheets for Social Media Analytics (yeah), combined with the fact that my PostRank daily email notifications are broken.

For those not familiar with PostRank their service (now owned by Google) would allow you to enter a blog RSS feed and then they would monitor ‘social engagement’ around that feed recording tweets, likes, saves etc.

Here is my solution that does something similar:

*** FeedRankSheet Google Spreadsheet v0.1 ***

This is still very beta and doesn’t entirely work as I’d like. I’ll be making improvements based on your feedback ;)

How to use it

Enter the RSS feed of the posts and comments you want to track, blog address and optionally a comma separated list of Twitter usernames you want to remove from the search. Then open the Script Editor and Run doFeedRank (you’ll need to authorise), finally add a trigger to run daily.

What it does

Example output [click to enlarge]Each time the script runs it gets the latest share counts for posts via the sharedcount.com, combines it with the comment feed and Twitter search results and generated an email to send to designated people (click on image for example output).

How it works

Most of the script is just data reading and writing. The clever bit is using Google Sites pages as a template for the email. Here is the page for the email wrapper and this page is the post share count template.

The thing that really surprised me is that SitesApp.getPageByUrl can get any public Google Sites page allowing you to do Page calls like .getHtmlContent() even if you don’t own it.

Things to improve

  • Exceeding maximum execution – I might need to optimise the code as I was getting timeouts when running as a trigger.
  • Deltas – It would be useful to include individual share count increases on daily updates (eg Twitter 9(+3))
  • I also have a sneaking suspicion that reading the posts from the spreadsheet rather than accessing the raw feed xml using apps script might be a problem. I need to run over a period of time to get data.

Dev8D Introduction to Google Apps Script Session: This is what you’ll be making (well the start of)

Today at Dev8D from 3-4pm participants of my introduction to Google Apps Script workshop will be making the beginnings of a PostRank style social engagement tracker (example report embedded below). I’ll be releasing slides and code after the session (Slides here and full code here).

OER Visualisation Project: Fin [day 40.5]

This is the final official post (a little late than expected) for the OER Visualisation Project. In this post I’ll summaries the work and try and answer the questions originally raised in the project specification.

[This post is draft until this message is removed]

Before I remove the draft message above I welcome any comments and feedback you have.

Over the course of the project there have been 17 blog posts, including this one, listed at the end and accessible via the ooh-er category. The project was tasked with two work packages: PROD Data Analysis (WP1); and Content Outputs (WP2). The nature of these packages means there is overlap in deliverables but I will attempt to deal with them separately.

PROD Data Analysis (WP1)

  • Examples of enhanced data visualisations from OER Phase 1 and 2.

This part of the project mainly focused around process rather than product. It’s rewarding to see that CETIS staff are already using some of documented processes to generate there own visualisations (David’s post | Sheila’s post). Visualisations that were produced include: OER Phase 1 and 2 maps [day 20], timelines [day 30], wordclouds [day 36] and project relationship [day 8]     

OER MapOER TimelineOER WordcloudOER Project relates force layout

  • Recommendations on use and applicability of visualisation libraries with PROD data to enhance the existing OER dataset.

The main recommendation is that visualisation libraries should be integrated into the PROD website or workflows. In particular it is perceived that it would be most useful to incorporate summary visualisations in general queries for programme/strand information. The types of visualisations that are most useful would be the automatic generation of charts already used by CETIS staff in their summaries of programme information (histograms, bubble charts and word clouds). Most of these are possible using the Google Visualisation API. This may emerge as the most preferential library given Wilbert’s existing work with the Sgvizler library. The disadvantage of the Google Visualisation API is that the appearance of the charts is very standard and investment in a more glossy solution such as raphaelJS (used in WP2) might be considered. [Although I never got around to it I’m also still interested in investigating an entirely new navigation/exploration solution based on my Guardian Tag Explorer]

  • Recommendations and example workflows including sample data base queries used to create the enhanced visualisations.

In the blog post on day 8 an example spreadsheet was created (which has been further developed) and shared for reuse. As part of this process an online document of SPARQL queries used in the spreadsheet was also created and maintained. Throughout the project this resource has been revisited and reused as part of other outputs. It has also been used to quickly filter PROD data.

This technique of importing SPARQL data has already been highlighted by CETIS staff but to move the idea forward it might be worth considering releasing a set of prepared spreadsheets to the public. The main advantage of this is there is a steep learning curve when using linked data and SPARQL queries. Having a set of spreadsheets ready for people to use should make PROD data more accessible, allowing people to explore the information in a familiar environment.

An issue to consider if promoting the PROD Datastore Spreadsheet is weather to make a live link to the PROD data or create as fixed data. For example, the mapping solutions from day 20 broke after PROD data was updated. Removing whitespace from some institution names meant that location data which was looked-up using a combination of fixed and live data failed to find the amended institution names. This is not so much an issue with PROD data but in the way the spreadsheet was developed. Consequently if CETIS were to go down the route of providing ready made Google Spreadsheets it should make clear to users that data might change at any point.  Alternatively instead of providing spreadsheets with live data CETIS might consider release versions of spreadsheets with fixed results (this could be provided for programmes or strands with have been completed). Production of the static sheets could be achieved manually by uploading a collection of csv reports (sheets for project information, technologies, standards etc) or be automated using Google Apps Script.

In day 36 an additional workflow using a combination of R, Sweave and R2HTML was also explored. Using R opens the prospect of enhanced analysis of data being directly pulled via SPARQL queries to generate standard reports. The main question is whether there would be enough return in investment to set this up, or should the focus be of more lightweight solutions useable by a wider user base 

  • Issues around potential workflows for mirroring data from our PROD database and linking it to other datasets in our Kasabi triple store.

It was hoped that stored queries could be used with the Kasabi triple store but a number of issues prevented this.  The main reason was the issue of exposing Kasabi API keys when using client side visualisations and in shared Google Spreadsheets. There was also an issue with custom APIs within Kasabi not working when they included OPTIONAL statements in queries. This issue is being investigated by Kasabi.

During the course of the project it was also discovered that this is missing metadata on comments stored as linked data. Currently general comments are the only ones to include author and timestamp information. For deeper analysis of projects it might be worth also including this data in related projects comments and comments associated with technology and standards choices [day 30b]

  • Identification of other datasets that would enhance PROD queries, and some exploration of how transform and upload them.

No other data sets were identified.

  • General recommendations on wider issues of data, and observed data maintenance issues within PROD.

Exploring PROD data has been a useful opportunity to explore it’s validity and coverage. As mentioned earlier geo data for institutions, particularly for the OER Programme, is patchy (a list of missing institutions was given to Wilbert). Another observation was whitespaces on data values prevented some lookups from working. It might be worth seeing if these could be trimmed on data entry.

Another consideration if CETIS want PROD data to be attractive to the data visualisation community is to natively provide output in csv or JSON format. This is already partially implemented in the PROD API and already available on some stored Kasabi stored queries using a csv XSLT stylesheet.

Content Outputs (WP2)

Before I mention some of the outputs it’s worth highlighting some of the processes to get there. This generally followed Paul Bradshaw’s The inverted pyramid of data journalism highlighted in day 16 of compile, clean, context and combine.

Two datasets were compiled and cleaned as part of this project: UKOER records on Jorum; and an archive of #ukoer tweets from April 2009* to January 2012 (the archive from April 2009 – March 2010 is only partial, more complete data recovered from TwapperKeeper exists for the remaining period.

UKOER records on Jorum

As the Jorum API was offline for the duration of this project an alternative method for compiling  UKOER records had to be found. This resulted in the detailed documentation of a recipe for extracting OAI service records using Google Refine [day 11]. Using Google Refine proved very advantageous as not only were ukoer records extracted, but it was possible to clean and combine the dataset with other information. In particular three areas of enhancement were achieved:

  • reconciling creator metadata to institutional names – as this data is entered by the person submitting the record there can be numerous variations in format which can make this processes difficult. Fortunately with Google Refine it was possible to extract enough data to match organisations stored in PROD data (made possible via the Kasabi Reconciliation API).
  • extract Jorum record view counts by scraping record pages – a request was made for this data but at the time wasn’t available. Google Refine was used to lookup each Jorum record page and parse the view count that is publically displayed.
  • return social share counts for Jorum records – using the sharedcount.com API social shares for a range of services (Facebook, Twitter, Google+ etc) were returned for individual Jorum records

The processed Jorum UKOER dataset is available in this Google Spreadsheet (the top 380 Jorum UKOER social share counts is available separately)

#ukoer Twitter Archive

The process of extracting an archive of #ukoer tweets for March 2010 to January 2012 was more straight forward as this data was stored on TwapperKeeper. Tweets prior to this date were more problematic as no publically complied archive could be found. The solution was to extract partial data from the Topsy Otter API (the process for doing this is still to be documented).   

The #ukoer Twitter archive is available in this Google Spreadsheet

In the following sections the visualisations produced are summarised.

  • Collections mapped by geographical location of the host institution

UKOER submissions

Title: Jorum UKOER Geo-Collections
About: Having reconciled the majority of Jorum UKOER records to an institution name, the geographic location of these institutions was obtained from PROD data. For institutions without this data a manual lookup was done. Results are published on a custom Google Map  using a modified MarkerCluster Speed Test Example [day 20]
Type: Interactive web resource
Link: http://hawksey.info/maps/oer-records.html

  • Collections mapped by subject focus/Visualisations of the volume of collections

snowflake

Title: Snowflake
About: Using data exported from Google Refine a custom log file was rendered using the open source visualisation package Gource. The visualisation shows institutional deposits to Jorum over time. Individual deposits (white dots) are clustered around Jorum subject classifications [day 11]
Type: Video
Link: http://www.youtube.com/watch?v=ekqnXztr0mU

Subject Wheel

Title: Jorum Subject Wheel
About: Generated in NodeXL this image illustrates subject deposits by institutions. Line width indicates the number of deposits from the institution to the subject area and node size indicates the number of different subject areas the institution has deposited to. For example, Staffordshire University has made a lot of deposits to HE – Creative Arts and Design and very few other subjects, while Leeds Metropolitan University has made deposits to lots of subjects. 
Type: Image/Interactive web resource   
Link: http://mashe.hawksey.info/wp-content/uploads/2012/02/SubjectCircle.jpg 
Link: http://hawksey.info/nodegl/#0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc (Interactive web resource)
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdFQ2RkhMc3hvbFRUMHJCdGU3Ujh3aGc#gid=0 (Source Data)

deposit dots

Title: Jorum Records Institutional Deposits Bubble Diagram
About: Creating a pivot report from the Jorum UKOER records in Google Spreadsheet, the data was then rendered using a modification of a raphaelJS library dots example. Bubble size indicates the number of records deposited by the institution per month.
Type: Interactive web resource
Link: http://hawksey.info/labs/raphdot.html

  • Other visualisations

Capret great circle

Title: CaPRéT ‘great circle’ Tracking Map
About: Raw CaPRéT OER tracking data was processed in Google Refine converting IP log data for target website and copier location into longitude and latitude. The results were then processed in R. The map uses ‘great circle’ lines to illustrate the location of the source data and the location of the person taking a copy [day 32].
Type: Image
Link: http://mcdn.hawksey.info/wp-content/uploads/2012/01/capret.jpg

capret timemap

Title: CaPRéT timemap
About: Using the same data from the CaPRéT ‘great circle’ Tracking Map the data was rendered from a Google Spreadsheet using a modification of a timemap.js project example. Moving the time slider highlights the locations of people copying text tracked using CaPRéT. Clicking on a map pin opens an information popup with details of what was copied [day 32]
Type: Interactive web resource
Link: http://hawksey.info/maps/CaPReT.html
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdDd5UXFGdEJGUjEyN3M5clU1X2R5V0E#gid=0 (Source Data)

heart of ukoer
Title: The heart of #ukoer
About: Produced using a combination of Gephi and R ‘the heart of #ukoer’ depicts the friend relationships between 865 twitter accounts who have used the #ukoer hashtag since April 2009 to January 2012. The image represents over 24,000 friendships and node size indicates the persons weighted ‘betweenness centrality’ (how much of a community bridge that person is). Colours indicate internal community groups (programmatically detected). The wordclouds round the visualisation are a summary of that sub-groups Twitter profile descriptions [day 37, revisited day 40]. 
Type: Image/Interactive web resource 
Link: http://hawksey.info/labs/ukoer-community3-weighted-BC.jpg
Link: http://zoom.it/6ucv5

pulse of #ukoer

Title: The pulse of #ukoer
About: Produced using Gephi this image is a summary of the conversations between people using the #ukoer hashtag. Connecting lines are colour coded with green showing @replies, blue are @mentions and red are reweets [day 40].
Type: Image/Interactive web resource
Link: http://hawksey.info/labs/ukoer-conversation.jpg
Link: http://zoom.it/xpRG

ball of stuff

Title: Interactive ball of stuff
About: Using the same data from the ‘pulse of #ukoer’ an interactive version of the #ukoer twitter archive is rendered in the experimental TAGSExplorer. Click on nodes allows the user to see all the tweets that person has made in the archive and replay part of the conversation [day 40].
Type: Interactive web resource
Link: http://hawksey.info/tagsexplorer/?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE&sheet=od6

still a pulse

Title: Is there still a pulse
About: As part of the process of preserving #ukoer tweets a number of associated graphs used to detect the health of the #ukoer hashtag were produced. These are available in the Google Spreadsheet [day 40].
Type: Spreadsheet
Link: https://docs.google.com/spreadsheet/ccc?key=0AqGkLMU9sHmLdHRhaEswb0xaLUJyQnFNSTVZZmVsMFE#gid=3

Recommendations/Observations/Closing thoughts

The project has documented a number of recipes for data processing and visualisation, but in many ways has only exposed the tip of the iceberg. It is likely with current financial constraints repository managers will increasingly be required to illustrate value for money and impact. Data analysis and visualisation can help with both aspects, helping monitor repository use, but equally be used in an intelligence mode to identifying possible gaps and proactively leveraging repository resources. It was interesting to discover a lack of social sharing of Jorum records (day 24 | day 30) and perhaps more can be done in the area of frictionless sharing (see Tony Hirst’s draft WAR bid)

This project has mainly focused on Jorum and the #ukoer Twitter hashtag, due to time constraints as well as the amount of time required to compile a useful dataset. It would be useful if these datasets were more readily available, but I imagine this is less of a problem for internal repository analysis as data is easier to access.

Towards the end of the project focus shifted towards institutional repositories, some work being done to assist University of Oxford (and serendipitously Leeds Metropolitan University). If this work is to be taken forward, and this may already be a part of the OER Rapid Innovation projects, more work needs to be done with institutional repository managers to surface the tools and recipes they need to help them continue to push their work forward. 

Whilst not a purposeful aim of this project it’s very fitting that all of tools and visualisation libraries used in this project are open source or freely available. This rich selection of high quality tools and libraries also means that all of the project outputs are replicable without any software costs.

An area that is however lacking is documented uses of these tools for OER analysis. Recipes for extracting OAI data for desktop analysis were, until this project, none existent, and as mentioned more work potentially needs to be done in this area to streamline the processes for compiling, cleaning and communicating repository data.

To help the sharing of OER data I would encourage institutions to adopted an open data philosophy including information on how this data can be accessed (the Ghent University Downloads/API page is an example of good practice). There is also a lack of activity data being recorded around OER usage. This is a well established issue and hopefully projects like Learning Registry/JLeRN can address this. It’s however worth remembering that these projects are very unlikely to be magic bullets and projects like CaPRéT still have an important role.   

Was this project a success? I would say partial. Looking back at the selection of visualisation produced I feel there could have been more. So much time was spent creating recipes for data extraction and analysis, that it left little time for visualisation. I hope what has been achieved is of benefit to the sector and it’s reassuring that outputs from this project is already being used elsewhere.    

Project Posts

OER Visualisation Project: The heart and pulse of #ukoer [day 40]

It’s the last day of the OER Visualisation Project and this is my penultimate ‘official’ post. Having spent 40 days unlocking some of the data around the OER Programme there are more things I’d like to do with the data, some loose ends in terms of how-to’s I still want to document and some ideas I want to revisit. In the meantime here are some  of the outputs from my last task, looking at the #ukoer hashtag community. This follows on from day 37 when I looked at ‘the heart of #ukoer’, this time looking at some of the data pumping through the veins of UKOER. It’s worth noting that the information I’m going to present is a snapshot of OER activity, only looking at a partial archive of information tweeted using the #ukoer hashtag from April 2009 to the beginning of January 2012, but hopefully gives you an sense of what is going on.

The heart revisited

I revisited the heart after I read Tony Hirst’s What is the Potential Audience Size for a Hashtag Community?. In the original heart nodes were sized using ‘betweenness centrality’ which is a social network metric to identify nodes which are community bridges, nodes which provide a pathway to other parts of the community. When calculating betweenness centrality on a friendship network it takes no account of how much that person may have contributed. So for example someone like John Robertson (@KavuBob) was originally ranked has having the 20th highest betweenness centrality in the #ukoer hashtag community, while JISC Digital Media (@jiscdigital) is ranked 3rd. But if you look at how many tweets John has contributed (n.438) compared to JISC Digital Media (n.2) isn’t John’s potential ‘bridging’ ability higher?

Weighted Betweenness CentrailityThere may be some research in this area, and I have to admit I haven’t had the chance to look, but for now I decided to weight betweenness centrality based on the volume of the archive the user has contributed. So John goes from ranked 20th to 3rd and JISC Digital Media goes from 3rd to 55th. Here’s a graph on the winners and losers (click on the image to enlarge).

Here is the revised heart on zoom.it (and if zoom.it doesn’t work for you the heart as a .jpg

The 'heart' of #ukoer (click to enlarge)

[In the bottom left you’ll notice I’ve included a list of top community contributors (based on weighted betweenness – a small reward for those people (I was all out of #ukoer t-shirts).]

These slides also show the difference in weighted betweenness centrality (embedded below). You should ignore the change in colour palette, the node text size is depicting betweenness centrality weight [Google presentation has come on a lot recently – worth a look at if you are sick of the clutter of slideshare]:

 

The ‘pulse’ of #ukoer

In previous work I’ve explored visualising Twitter conversations using my TAGSExplorer.  Because of the way I reconstructed the #ukoer twitter archive (a story for another day) it’s compatible with this tool so you can see and explorer the #ukoer archive of the 8300 tweets I’ve saved here. One of the problems I’m finding with this tool is it takes a while to get the data from the Google Spreadsheet for big archives.

TAGSExplorer - ballofstuffThis problem was also encountered in Sam’s Visualising Twitter Networks: John Terry Captaincy Controversy. As TAGSExplorer internally generates a graph of the conversation, rather than scratching my head on some R Script it was easy to expose this data so that it can be imported into Gephi. So now if you add &output=true to a TAGSExplorer url you get a comma separated edge list to use with you SNA package of choice (the window may be blocked as a pop-up, so you need to enable). Here is the link for the #ukoer archive with edges for replies, mentions and retweets (which generates ‘a ball of awesome stuff’ (see insert above) but will eat your browser performance)

ukoer conversation (click to enlarge)Processing the data in Gephi you get a similar ball of awesome stuff (ukoer conversation on zoom.it | ukoer conversation .jpg). What does it all mean I hear you ask. These flat images don’t tell you a huge amount. Being able to explore what was said is very powerful  (hence coming up with TAGSExplorer). You can however see a lot of mentions (coloured blue and line width indicating volume) in the centre between a small number of people. It’s also interesting to contrast OLNet top right and 3d_space mid left. OLNet has a number of green lines radiating out indicating @replies indicating they are in conversations with individuals using the #ukoer tag. This compares to 3d_space which has red lines indicating retweets suggesting they are more engaged in broadcast.

Is there still a pulse?

UKOER Community StatsWhen looking at the ‘ball of awesome stuff’ it’s important to remember that this is a depiction of over 8,000 tweets from April 2009 to January 2012. How do we know if this tag is alive and kicking or not just burned out like a dwarf star?

The good news is there is still a pulse within #ukoer, or more accurately lots of individual pulses. The screenshot to the right is an extract from this Google Spreadsheet of #UKOER. As well as including 8,300 tweets from #ukoer it also lists the twitter accounts that have used this tag. On this sheet are sparklines indicating the number of tweets in the archive they’ve made and when. At the top of the list you can see some strong pulses from UKOER, xpert_project and KavuBob. You can also see others just beginning or ending their ukoer journey.

The good news is the #ukoer hashtag community is going strong December 2011 having the most tweets in one month and the number of unique Twitter accounts using the tag has probably by now tipped over the 1,000 mark.

#ukoer community growth

There is more for you to explore in this spreadsheet but alas I have a final post to write so you’ll have to be your own guide. Leave a comment if you find anything interesting or have any questions

[If you would like so explorer both the ‘heart’ and ‘pulse’ graphs more closely I’ve upload them to my installation of Raphaël Velt's Gexf-JS Viewer (it can  take 60 seconds to render the data). This also means the .gexf files are available for download:]

OER Visualisation Project: The heart of #ukoer [day 37]

UKOER Hashtag CommunityLast week I started to play with the #ukoer hashtag archive (which has generated lots of useful coding snippets to processes the data that I still need to blog … doh!). In the meantime I thought I’d share an early output. Embedded below is a zoom.it of the #ukoer hashtag community. The sketch (HT @psychemedia) is from a partial list* of twitters (n. 865) who have used the #ukoer hashtag in the last couple of years and who they currently follow. The image represents over 24,000 friendships, the average person having almost 30 connections to other people in the community.

 

3D Heart SSD
3D Heart SSD
Originally uploaded by Generation X-Ray
Publishing an early draft of this image generated a couple of ‘it looks like’ comments (HT @LornaMCampbell @glittrgirl). To me it looks like a heart, hence the title of this post. The other thing that usually raises questions is how the colour grouping are defined (HT @ambrouk). The answer in this case is it’s  generated from a modularity algorithm which tries to automatically detect community structure.

As an experiment I’ve filtered the Twitter profile information used for each of these groupings and generated a wordcloud using R (The R script used is a slight modification of one I’ve submitted to the Twitter Backchannel Analysis repository Tony started – something else I need to blog about. The modification is to SELECT a column WHERE modclass=somthing).

Right all this post has done is remind me of my post backlog and I’ve got more #ukoer visualisation to do so better get on with it.

*it’s a partial list because as far as I know there isn’t a complete archive of #ukoer tweets. The data I’m working from is from an export from TwapperKeeper for March 2010-Jan 2012  topped up with some data from Topsy for April 2009-March 2010

OER Visualisation Project: Exploring automated reporting using linked data and R/Sweave/R2HTML [day 36]

OER Phase 1 & 2 project descriptions wordcloudI’m in the final stretch of the OER Visualisation project. Recently reviewing the project spec I’m fairly happy that I’ll be able to provide everything asked for. One of the last things I wanted to explorer was automated/semi-automated programme reporting from the PROD database. From early discussions with CETIS programme level reporting, particularly of technology and standards used by projects,  emerged as a common task. CETIS already have a number of stored SPARQL queries to help them with this, but I wondered if more could be done to optimise the process. My investigations weren’t entirely successful, and at times I was thwarted by misbehaving tools, but I thought it worth sharing my discoveries to save others time and frustration.

My starting point was the statistical programming and software environment R (in my case the more GUI friendly RStudio). R is very powerful in terms of reading data, processing it and producing data analysis/visualisations. Already CETIS’s David Sherlock has used R to produce a Google Visualisation of Standards used in JISC programmes and projects over time and CETIS’s Adam Cooper has used R for Text Mining Weak Signals, so there is some in-house skills which could have built on this idea.

Two other main factors for looking at R as a solution are:

  • the modular design of the software environment makes it easy to add functionality through existing packages (as I pointed out to David there is a SPARQL package for R which means he could theoretically consume linked data directly from PROD); and
  • R has a number of ways to produce custom reports, most notably the Sweave function allows the integration of R output in LaTeX documents allowing the generation of dynamic reports  

So potentially a useful combination of features. Lets start looking at some of the details to get this to work.

Getting data in

Attempt 1 – Kasabi custom API query

Kasabi is a place where publishers can put there data for people like me to come along and try and do interesting stuff with it. One of the great things you can do with Kasabi is make custom APIs onto linked data (like this one from Wilbert Kraan) and add one of the Kasabi authored XSLT stylesheets to get the data returned in a different format, for example .csv which is easily digestible by R.

Problem: Either I’m doing something wrong or there is an issue with the data on Kasabi or an issue with Kasabi itself because I keep getting 400 errors on queries I know work like this OER Projects with location but not when converted to an API

Attempt 2 – Query the data directly in R using the SPARQL package

As I highlight to David there is a SPARQL package for R which in theory lets you construct a query in R, collect the data and put it in a data frame.

Problem: Testing the package with this query returns: 

Error in data.frame(projectID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,  :
  arguments imply differing number of rows: 460, 436, 291, 427, 426

My assumption is the package doesn’t like empty values.

Attempt 3 – Going via a sparqlproxy

For a lot of the other PROD SPARQL work I’ve done I’ve got csv files by using the Rensselaer SPARQL proxy service. R is quite happy reading a csv via the proxy service (code included later), it just means you are relying on an external service, which isn’t really necessary as Kasabi should be able to do the job and it would be better if the stored procedures were in one place (I should also say I looked at just using an XML package to read data from Talis/Kasabi but didn’t get very far. 

Processing the data

This is where my memory gets a little hazy and I wished I took more notes of the useful sites I went to. I’m pretty sure I got started with Tony Hirst’s How Might Data Journalists Show Their Working? Sweave, I know also that I looked at Vanderbilt’s Converting Documents Produced by Sweave, Nicola Sartori’s An Sweave Tutorial, Charlie Geyer’s An Sweave Demo Literate Programming in R Reproducible Research, Greg Snow’s Automating reports with Sweave and Jim Robison-Cox’s Sweave Intro (this one included instructions on installing the MikTeX latex engine for windows which with out none of this would had worked).

The general idea with R/Sweave/LaTeX is you markup a document inserting R script which can be executed to include data tables and visualisations. Here’s a very basic of an output (I think I’ve been using R for a week now so don’t laugh) example which pulls in two data sets (Project Descriptions | Project Builds On), includes some canned text and generates a wordcloud from project descriptions and a force directed graph of project relationships.

The code used do this is embedded below (also available from here):

The main features are the \something e.g. \section which is LaTeX markup for the final document and the <<>>=  and @ R script code wrappers. I’ve even less experience of LaTeX than R so I’m sure there are many things I don’t know yet/got wrong, but hopefully you can see the potential power of the solution. Things I don’t like are being locked into particular LaTeX styles (although you can create your own) and the end product being a .pdf (as Sweave/R now go hand in hand a lot of the documentation and coding examples end up in .pdf which can get very frustrating when you are trying to copy ad paste code snippets, which also makes me wonder how accessible/screen reader friendly sweave/latex pdfs are).

Looking for something that gives more flexibility in output I turned to R2HTML which includes “a driver for Sweave allows to parse HTML flat files containing R code and to automatically write the corresponding outputs (tables and graphs)” . Using a similar style of markup (example of script here) we can generate a similar report in html. The R2HTML package generates all the graph images and html so in this example it was a case of uploading the files to a webserver. Because it’s html the result can easily be styled with a CSS sheet or opened in a word processor for some layout tweaking (here’s the result of a 60 second tweak.       

Is it worth it?

After a day toiling with SPARQL queries and LaTeX markup I’d say no, but its all very dependant on how often you need to produce the reports, the type of analysis you need to do and your anticipated audience. For example, if you are just going to present wordclouds and histograms R is probably overkill and it might be better to just use some standard web data visualisation libraries like mootools or Google Visualisation API to create live dashboards. Certainly the possibilities of R/Sweave are worth knowing about.

Integrating Google Spreadsheet/Apps Script with Google Refine to update existing spreadsheets

Working on some OER Visualisation Project work today I found I needed to get some additional information from an external web service for a Google Spreadsheet I was working on. I could have of course just used Google Apps Script, a native feature within Google Spreadsheets, to do a UrlFetch and get the information but my source data was a bit messy and timeouts on Apps Script prevent you from getting lots of data in one hit. Instead it made more sense to download the data into Google Refine, clean it up, call external web services to get additional information, then somehow get it back into the original spreadsheet.

As I recently posted Integrating Google Spreadsheet/Apps Script with R: Enabling social network analysis in TAGS it wasn’t much of a cognitive leap to try a similar trick of publishing the spreadsheet as a service to allow a basic API write. In the R example, however, I used the POST method to pass data and whilst I’m pretty sure with the right Jython libraries you could do something similar in Refine, I wanted to keep it simple so here’s my method for using GET instead.

  1. In a Google Spreadsheet open Tools >  Script editor.. and paste the following code (there is some tweaking required to where you want the data to go and the secret passphrase – in this example I want to pass back a value named geo 
  2. Once tweaked Run > Setup (this get a copy of the spreadsheet id need to run as a service)
  3. Open Share > Publish as service..  and check ‘Allow anyone to invoke’ with ‘anonymous access’, not forgetting to ‘enable service’. You’ll need a copy of the service URL for later on. Click ‘Save’
    Publish as service
  4. Back in Google Refine select a column with some data you want to pass and choose Edit column > Add column by fetching URLs…
  5. The expression you use will be something like the one below where we have the service url from step 3, the ‘secret’ to match the one set in step 1 and what ever data you want to include 
    "https://docs.google.com/macros/exec?service=AKfycbw1VL_kNYEwhqELgccMgEvM9gJ3t1zhBr6bhLs84g&secret=lemon&idx="+cells['idx'].value+"&geo="+escape(value,"url")
    Add column dialog

Note: You’ll see from the script my import included an idx number which matched a row number, but as long as I didn’t add any additional rows or change the order I could have used row.index in the expression instead.

You also might want to play around with the throttling. I had problems with my first go because this was either set too low or my values weren’t url encoded.

Instead if anyone finds this useful or if they show how to do in Jython. The amount of Google Spreadsheet work I do I’m sure I’ll use this trick again ;)  

OER Visualisation Project: Visualising CaPRéT OER tracking data [day 32] #ukoer

In May 2009 JISC CETIS announced the winners of the OER Technical Mini-Projects. These projects were designed:

to explore specific technical issues that have been identified by the community during CETIS events such as #cetisrow and #cetiswmd and which have arisen from the JISC / HEA OER Programmes

JISC CETIS OER Technical Mini Projects Call
Source :
http://blogs.cetis.ac.uk/philb/2011/03/02/jisc-cetis-oer-technical-mini-projects-call/ 
License:
http://creativecommons.org/licenses/by/3.0/ 
Author: Phil Barker, JISC CETIS

One of the successfully funded projects was CaPRéT – Cut and PAste reuse and Tracking from Brandon Muramatsu, MIT OEIT and Justin Ball and Joel Duffin, Tatemae. I’ve already touched upon OER tracking in day 24 and day 30 briefly looking at social shares of OER repository records. Whilst projects like the Learning Registry have the potential to help it still early days and tracking still seems to be an afterthought, which has been picked up in various technical briefings. CaPReT tries to address part of this problem, as stated in introduction to their final report:

Teachers and students cut and paste text from OER sites all the time—usually that’s where the story ends. The OER site doesn’t know what text was cut, nor how it might be used. Enter CaPRéT: Cut and Paste Reuse Tracking. OER sites that are CaPRéT-enabled can now better understand how their content is being used.

When a user cuts and pastes text from a CaPRéT-enabled site:

  • The user gets the text as originally cut, and if their application supports the pasted text will also automatically include attribution and licensing information.
  • The OER site can also track what text was cut, allowing them to better understand how users are using their site.

The code and other resources can be found on their site. You can also read Phil Barker’s (JISC CETIS) experience testing CaPReT and feedback and comments about the project on the OER-DISCUSS list.

One of the great things about CaPReT is the activity data is available for anyone to download (or as summaries Who’s using CaPRéT right now? | CaPRéT use in the last hour, day and week | CaPRéT use by day).

One of the challenges set to me by Phil Barker was to see what I could do with the CaPReT data. Here’s what I’ve come up with. First a map of CaPReT (great circles) usage plotting source website and where in the world some text was copied from (click on image for full scale):

capret - source target map

An an interactive timeline which renders where people copied text and pop-ups with a summary of what they copied

capret timemap

Both these examples rely on the same refined data source rendered in different ways and in this post I’ll tell you how it was done. As always it would be useful to get you feedback as to whether these visualisations are useful, things you’d improve or other ways you might use the recipes. 

How was it made – getting geo data

  1. Copied the CaPReT tabular data into Excel (.csv download didn’t work well for me, columns got mixed up on unescaped commas), and saved as .xls
  2. .xls imported in Google Refine. The main operation was to convert text source and copier IP/domains to geo data using www.ipinfodb.com. An importable set of routines can be downloaded from this gist [Couple of things to say about this -  CaPReT have a realtime map, but I couldn’t see any locations come through – if they are converting IP to geo it would be useful if this was recorded in the tabular results. IP/domain location lookups can also be a bit misleading, for example, my site is hosted in Canada, I’m typing in France and soon I’ll be back to Scotland
  3. the results were then exported to Excel and duplicates removed based on ‘text copied on’ dates and then uploaded to a Google Spreadsheet   

Making CaPReT ‘great circles’ map

 

Var1

Freq

1

blogs.cetis.ac.uk

35

2

capret.mitoeit.org

8

3

dl.dropbox.com

2

4

freear.org.uk

43

5

translate.googleusercontent.com

1

6

www.carpet-project.net

4

7

www.icbl.hw.ac.uk

1

8

www.mura.org

19

This was rendered in RStudio using Nathan Yau’s (Flowing Data) How to map connections with great circles. The R code I used is here. The main differences are reading the data from a Google Spreadsheet and handling the data slightly differently (Who would have thought as.data.frame(table(dataset$domain)) would turn the source spreadsheet into (#stilldiscoveringR) –>

I should also say that there was some post production work done on the map. For some reason some of the ‘great circles’ weren’t so great and wrapped around the map a couple of times. Fortunately these anomalies can easily be removed using Inkscape (and while I was there added a drop shadow.

 

capret before post production

Making CaPReT ‘timemap’

Whilst having a look around SIMILE based timelines for day 30 I came across the timemap.js project which:

is a Javascript library to help use online maps, including Google, OpenLayers, and Bing, with a SIMILE timeline. The library allows you to load one or more datasets in JSON, KML, or GeoRSS onto both a map and a timeline simultaneously. By default, only items in the visible range of the timeline are displayed on the map.

Using the Basic Example, Google v3 (because it allows custom styling of Google Maps) and Google Spreadsheet Example I was able to format the refined data already uploaded to a Google Spreadsheet and then plug it in as a data source for the visualisation (I did have a problem with reading data from a sheet other than the first one, which I’ve logged as an issue including a possible fix).

A couple of extras I wanted to do with this example is also show and allow the user to filter based on source. There’s also an issue of scalability. Right now the map is rendering 113 entries, if CaPReT were to take off the spreadsheet would suddenly fill up and the visualisation will probably grind to a halt. 

[I might be revisiting timemap.js as they have another example for a temporal heatmap, which might be could for showing Jorum UKOER deposits by institution.] 

So there you go, two recipes for converting IP data into something else. I can already see myself using both methods in other aspects of the OER Visualisation and other project. And all of this was possible because CaPReT had some open data.

I should also say JISC CETIS have a wiki on Tracking OERs: Technical Approaches to Usage Monitoring for UKOER and Tony also recently posted Licensing and Tracking Online Content – News and OERs.

What I’ve starred this month: January 28, 2012

Here's some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.

About

This blog is authored by Martin Hawksey+

Twitter Friendviz

The visualisation below is the network of people I've mentioned or chatted with on Twitter recently (refreshed daily) and how we are connected. You can use your mouse to pan and scroll.

I made this

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Creative Commons Licence
Unless otherwise stated this work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License