3 Comments

For a while at JISC CETIS we’ve been keeping an eye on an open source platform called Booktype designed to help in the creation of books from print and electronic distribution in a range of formats.

As part of this weeks Dev8Ed Adam Hyde, who is the project lead on Booktype, gave an overview of the project highlighting some of the technical wizardry. Kirsty Pitkin has posted an overview on the session here. Because Dev8Ed was an ‘unconference event’ there was an opportunity for Adam to put one of his other hats on and talk about Booksprints.

A Book Sprint brings together a group to produce a book in 3-5 days. There is no pre-production and the group is guided by a facilitator from zero to published book. The books produced are high quality content and are made available immediately at the end of the sprint via print-on-demand services and e-book formats. - www.booksprints.net

You can read more about book sprints from the official site, but I thought it was worth sharing some of my notes and reflections on Adam’s session.

Who is already book sprinting?

Book sprints are already widely used by FLOSS Manuals, a community project to produce manuals for free and open source software.

Creating the right environment

Adam stressed that people should meet in a ‘real space’ for the duration of the sprint which is usually 3-5 days. The space is usually a shared house where people can work, sleep, prepare food and eat. As well as the physical space, mental preparation is designed to be light. Avoiding traditional publications models as a mindset appears to be key, also Adam mentioned that pre-preparing a structure can make the processes harder as more time is spent explaining this to team members than just collaboratively working on it in the first day. Something Adam also mentioned was that for each day you should start work at 9, finish at 5.

Timetable

Mon Tue Wed Thu Fri
Table of contents (sometimes takes longer) Start with review, show text, discussInteractive process discuss, write move own. Switching roles (proof, tidy) Interactive process discuss, write move own. Switching roles (proof, tidy) Finish up layout
Lunch
+critical point - getting people into the creative flow (finding chapter author key - looking what people are interested in) Interactive process discuss, write move own. Switching roles (proof, tidy) Finish writing new chapters No new content Finish up layout

 

Picture copyright flossmanuals.net
Picture copyright Adam Hyde flossmanuals.net
[License: GNU GPL2]

Above is a rough timetable for a sprint. To elaborate slightly, first morningis spend all working together on a table of contents. Use post-it notes to write areas to cover, grouping, conflicting terms, what's missing, etc. Get people writing things as quickly as possible. Once this has been drafted the facilitator has a key responsibility in assigning the chapters to the the right people, using cues from the TOC session like people with particular knowledge dealing with a specific chapter. The facilitator has to have a strong hand - doesn't have to be topic specialist. They need to drive forward production. Chapters don’t have to be done in chronological order, the main thing to to get things rolling. The facilitator should encourage discussion, if someone is struggling with something move them on, but don’t pass partial chapters to other people as it slow things down.  At 5 all stop and relax.

Tuesday starts with a review. Text is shared and discussed. This iterative processes, which includes breaking tasks with switched roles, continues to the Wednesday. On Thursday no new chapters are written and existing work is tidied up. On Friday the focus is finishing and layout.

Adam mentioned one technique for removing structural roadblocks was printing chapters then laying them on the floor, giving people scissors and markers to let them do a manual cut’n’paste job.  In writing this post I found other tips, case studies and material here.

So book sprints look like a great way to get content out. At JISC CETIS we are in the early stages of planning our own book sprint, so hopefully soon I’ll be able to share my personal experiences of some rapid content development. One thing I’m interested to find out is if the technique suits particular disciplines. Do you think a small group of academics could publish a textbook on something like ‘introduction to microbiology’ in 5 days? Is this a way JISC should fund some content?

[Update: Some comment on this on Google+]

Share this post on:
| | |
Posted in ebook and tagged on by .

2 Comments

Last week was the latest event as part of DevCSI, Dev8Ed. As part of the event I put my name forward for a session on ‘hacking stuff together with Google Spreadsheets’. The session appeared to go down well as I was asked to repeat on the second day and along with Alex Bilbie’s HTML5/CSS3 session got lots of votes for the ‘best of Dev8Ed’.

It was a bit weird talking to a room of mainly hardcore coders about hacking stuff with spreadsheets but hopefully they got an insight into a product that is quickly loosing it’s shackles as a glorified calculator into something else (one of my favourite blogs right now is Bruce Mcpherson’s Excel Ramblings which as well as slipping into Google Apps Script has some wonderful posts on subverting MS Excel).

Below is the video from my first (shorter version) of my presentation followed by slides hosted on Slideshare and Google Drive, and here is the delicious stack of links.


Hacking stuff together with Google Spreadsheets

1 Comment

Here is some text I prepared for a possible Google Apps Developer blog guest post. It doesn’t look like it’s going to get published so rather than letting it go to waste I thought I’d publish here:


Martin Hawksey is a Learning Technology Advisor for the JISC funded Centre for Educational Technology and Interoperability Standards (JISC CETIS) based in the UK. Prior to joining JISC CETIS, and in his spare time, Martin has been exploring the use of Google Apps and Apps Script for education. In this post Martin highlights some features of a Google Apps Script solution which combines Google Spreadsheet and Google Documents to speed up and standardise personal feedback returned to students at Loughborough College.

One of things that drew me to Apps Script over two years ago was the ease in which you could interact with other Google services. I also found that both using Google Spreadsheets and a coding syntax I recognised ideal as a ‘hobbyist’ programmer.

Late last year when I was approached by Loughborough College to take part in their ‘Fast Tracking Feedback’ project, I saw it as an ideal opportunity to get staff using Apps Script  and showcase the possibilities of Apps Script to the Google Apps for Education community.

The goal of the project was to produce a mechanism that allows tutors to input assignment grades using a custom UI that mirrors the final feedback sheet or enter details directly into a Google Spreadsheet.  These details are then pushed out as individually personalised Google Documents shared with the student. This sounds relatively simple, but the complication is that each assignment needs to map to a predefined set of rubrics which vary between units. For example in one course alone there are over 40 units and every unit can be assessed using multiple assignments with any combination of predefined criteria ranging from pass, merit and distinction.

Below is an example student feedback form highlighting the regions that are different for each assignment.

Example student feedback form highlighting the regions that are different for each assignment

The video below shows a demonstration of the current version of the of the ‘Fast Tracking Feedback’ system is set-up and used:

Solution highlights

A number of Apps Script Services have been used as part of this project. Lets look at how some of these have been implemented.

DocList Service – The self-filing Google Spreadsheet

The eventual plan is to rollout the Fast Tracking Feedback system to teaching teams across the College. To make the life of support staff easier it was decided to use a common filing structure. Using a standardised structure will help tutors stay organised and aid creation of support documentation.

When a tutor runs the setup function on a new feedback spreadsheet it checks that the correct folder structure exists (if not making it) and moves the current spreadsheet into the pre-defined collection.


Self-generating folder structure and organization

The code that does this is:

// code to generate folder structure and move spreadsheet into right location 
// + ROOT_FOLDER 
// |- SPREADSHEET_FOLDER 
// |- DRAFT_FOLDER 
// |- RELEASED_FOLDER 
var rootFolder = folderMakeReturn(ROOT_FOLDER); // get a the system route folder (if it deosn't existing make it 
// create/get draft and release folders 
var draftFolder = folderMakeReturn(DRAFT_FOLDER,rootFolder, ROOT_FOLDER+"/"+DRAFT_FOLDER); 
var releaseFolder = folderMakeReturn(RELEASED_FOLDER,rootFolder, ROOT_FOLDER+"/"+RELEASED_FOLDER); 
var spreadsheetFolder = folderMakeReturn(SPREADSHEET_FOLDER, rootFolder, ROOT_FOLDER+"/"+SPREADSHEET_FOLDER); 

// move spreadsheet to spreadhsheet folder 
var file = DocsList.getFileById(SpreadsheetApp.getActiveSpreadsheet().getId()); 
file.addToFolder(spreadsheetFolder); 

// function to see if folder exists in DocList and returns it 
// (optional - if it doesn't exist then makes it) 
function folderMakeReturn(folderName,optFolder,optFolderPath){ 
try { 
   if (optFolderPath != undefined){ 
     var folder = DocsList.getFolder(optFolderPath); 
   } else { 
     var folder = DocsList.getFolder(folderName); 
   } 
   return folder; 
} catch(e) { 
   if (optFolder == undefined) { 
     var folder = DocsList.createFolder(folderName); 
   } else { 
     var folder = optFolder.createFolder(folderName); 
   } 
   return folder; 
} 
}

UI Service – Hybrid approach

A central design consideration was to make the Fast Tracking Feedback system easy for College staff to support and change. Consequently wherever possible the Apps Script GUI Builder was used to create as much of the user interface as possible. Because of the dynamic nature of the assessment rubrics part of the form is added by selecting an element holder and adding labels, select lists and textareas. Other parts of the form like the student information at the top can be added and populated with data by using the GUI Builder to insert textfields which are named using normalized names matching the spreadsheet column headers. The snippet of code that does this is:

app.getElementById(NORMHEADER[i]).setText(row[NORMHEADER[i]]);

Where NORMHEADER is an array of the normalized spreadsheet column names and row is a JavaScript Object of the row data generated based on the Reading Spreadsheet data Apps Script Tutorial.

Hybrid UI construction using GUI Builder and coding

Document Services – Master and custom templates

The process for filling in personalized feedback forms has three main steps. First a duplicate of the Master Template is made giving it a temporary name (DocList Services). Next the required assessment criteria are added to the form using the Document Services mainly using the TableCell Class. Parts of the document that are going to be filled with data from the spreadsheet are identified using a similar technique to the Apps Script Simple Mail Merge Tutorial. Finally for each student the assignment specific template is duplicated and filled with their personalised feedback.

if (email && (row.feedbackLink =="" || row.feedbackLink == undefined)){
  // Get document template, copy it as a new temp doc, and save the Doc's id
  var copyId   = DocsList.getFileById(newTemplateId)
                         .makeCopy(file_prefix+" - "+email)
                         .getId();
  var copyDoc  = DocumentApp.openById(copyId);
  // move doc to tutors folder
  var file = DocsList.getFileById(copyId);
  var folder = DocsList.getFolder(ROOT_FOLDER+"/"+DRAFT_FOLDER);
  file.addToFolder(folder);

  // select the document body
  var copyBody = copyDoc.getActiveSection();

  // find edittable parts of the document
  var keys = createKeys(copyDoc);

  // loop through elements replacing text with values from spreadsheet
  for (var j in keys) {
    var text = keys[j].text;
    var replacementText = ""; // set the default replacement text to blank
    if (row[keys[j].id] != undefined){ // if column value is defined get text
      replacementText = row[keys[j].id];
    }
    copyBody.replaceText('{%'+keys[j].text+'%}', replacementText); // replace text
  }
  copyDoc.saveAndClose();

  // create a link to the document in the spreadsheet
  FEEDSHEET.getRange("B"+(parseInt(i)+startRow)).setFormula('=HYPERLINK("'+file.getUrl()+'", "'+copyId+'")');
  FEEDSHEET.getRange("C"+(parseInt(i)+startRow)).setValue("Draft");
  // you can do other things here like email a link to the document to the student
}

Currently the system is configured to place generated feedback forms into a draft folder. Once the tutor is happy for the feedback to be released either individual or class feedback forms are distributed to students from a menu option in the feedback spreadsheet for the assignment, a record being kept of the status and location of the document.

Easy record keeping

Next steps/Get the code

The Fast Tracking Feedback System is currently being piloted with a small group of staff at Loughborough College. Comments from staff will be used to refine the system over the next couple of months. The current source code and associated template files are available from here.

The Fast Tracking Feedback project is funded by the UK’s LSIS Leadership in Technology (LIT) grant scheme.

Here's some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.
Share this post on:
| | |
Posted in Starred on by .

5 Comments

Another post related to my ‘Hacking stuff together with Google Spreadsheets’ session at  Dev8eD (A free event for building, sharing and learning cool stuff in educational technology for learning and teaching!!) next week. In this example Google Apps Script is used to create a custom user interface that can be used in Google Spreadsheets, allowing tutors to enter feedback and grades based on individual assessment criteria pulled from a central data source (another Google Spreadsheet). The system then generates personalised feedback forms (Google Documents) based on the data and distributes them to students.


As part of my work outwith JISC CETIS I have been helping staff at Loughborough College with their LSIS funded Fast Tracking Feedback project. As part of this project I’ve helping staff create a system that standardises and speeds up the return of assignment feedback to students. This project has generated a number of outputs for the wider community including training material and some code snippets (sending free SMS | generating Google Documents from Spreadsheets).

As my official involvement in the project comes to the close there is another chunk of code and resources to push out in the wild. It’s complete copy of the beta system currently being piloted with staff at Loughborough College. If you want to get an idea of how it works here’s a short video demonstrating the system I did as a lightning talk at GEUG12.

The code

If you want to pick over the code for this I’ve dumped a copy in github. This is more for reference as the code makes use of the Apps Script GUI Builder for parts of the interface, which can’t be extracted from the Spreadsheet. For a functional version you’ll need to make a copy of the four documents linked to below (this is followed by some instructions on setup and usage). I should also point out that this system has been build around the British BTEC qualifications. An example of the assessment and grading criteria is on page 3 of this document. Hopefully there is enough reusable code for other qualification systems.

Files you’ll need

  • Master Spreadsheet – this is the main document with all the Apps Script in it.
  • Master Template (you’ll need to File > Make a copy) – this is a Google Document used as a template for the form
  • Criteria Sheet – spreadsheet of units and courses with the related assessment criteria/rubric
  • Student Lookup – spreadsheet of student fullnames and related Google Apps ids (used because the App Script Group Services can only return ids and email addresses)

The basic setup

  1. Place all four copied files into a folder in Google Docs (you can name the folder anything you like).
  2. Change the share setting on the folder so that either ‘Anyone with the link’ or ‘People at your Google Apps domain with the link’ can view.
  3. Open your copy of the Master Spreadsheet and open Tools > Script Editor…
  4. On lines 17-19 copy and paste the document id/keys for Master Template, Criteria Sheet and Student Lookup. You can get these by opening the documents and looking at the browser url for the highlighted bits.
    Where to find spreadsheet/document keys
  5. From the Script Editor you can also open File > Build a user interface… and then open the importStudentList and click on the ‘Enter the group …’ label to edit the text in the property pane on the right-hand-side and then similarly for the textfield beneath it. There’s also the option to customise/add logos to the entryForm GUI
  6. In the Criteria Sheet create your list of units/courses and associated assessment criteria
  7. In the Student Lookup sheet import a list of Google Apps Ids and names

There are additional options in the code to change folder names, pass, merit, distinction colouring.
That’s it. Enjoy and any of your thoughts are welcomed in the comments (I can’t make any guarantees to respond to all)

The Fast Tracking Feedback project is funded by the UK’s LSIS Leadership in Technology (LIT) grant scheme.

Another post related to my ‘Hacking stuff together with Google Spreadsheets’ (other online spreadsheet tools are available) session at  Dev8eD (A free event for building, sharing and learning cool stuff in educational technology for learning and teaching!!) next week. This time an example to demonstrate importHtml. Rather than reinventing the wheel I thought I’ve revisit Tony Hirst's Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets (hmm are we allowed to use the word ‘Olympic’ in the same sentence as Google as they are not an official sponsor ;s).

Almost 4 years on the recipe hasn’t changed much. The Winter Olympics 2010 medals page on wikipedia is still there and we can still use the importHTML formula to grab the table [=importHTML("http://en.wikipedia.org/wiki/2010_Winter_Olympics_medal_table","table",3)]

The useful thing to remember is importHtml and it’s cousins importFeed, importXML, importData, and importRange create live links to the data, so if the table on wikipedia was to change the spreadsheet would also eventually update.

Where I take a slight detour with the recipe is that Google now have a chart heatmap that doesn’t need ISO country codes. Instead this is happy try to resolve country names.

heatmap missing dataOnce the data is imported from Wikipedia if you select Insert > Chart and choosing heatmap, using the Nation and Total columns as the data range you should get a chart similar to the one below shown to the right. The problem with this is it’s missing most of the countries. To fix this we need to remove the country codes in brackets. One way to do this is trim the text from the left until the first bracket “(“. This can be done using a combination of the LEFT and FIND formula.

In your spreadsheet at cell H2 if you enter =LEFT(B2,FIND("(",B2)-2) this will return all the text in ‘Canada (CAN)’ up to ‘(‘ minus two characters to exclude the ‘(‘ and the space. You could manually fill this formula down the entire column but I like using the ARRAYFORMULA which allows you to use the same formula in multiple cells without having to manually fill it in. So our final formula in H2 is:

=ARRAYFORMULA(LEFT(B2:B27,FIND("(",B2:B27)-2))

Using the new column of cleaned country names we now get our final map

Interactive map: Click for interactive version

To recap, we used one of the import formula to pull live data into a Google Spreadsheet, stripped some unwanted text and generated a map. Because all of this is sitting in the ‘cloud’ it’ll quite happly refresh itself if the data changed. 

The final spreadsheet used in this example is here

Tony has another Data Scraping Wikipedia with Google Spreadsheets example here

2 Comments

Next week I’ll be presenting at Dev8eD (A free event for building, sharing and learning cool stuff in educational technology for learning and teaching!!) doing a session on ‘Hacking stuff together with Google Spreadsheets’ – other online spreadsheet tools are available. As part of this session I’ll be rolling out some new examples. Here’s one I’ve quickly thrown together to demonstrate UNIQUE and FILTER spreadsheet formula. It’s yet another example of me visiting the topic of electronic voting systems (clickers). The last time I did this it was gEVS – An idea for a Google Form/Visualization mashup for electronic voting, which used a single Google Form as a voting entry page and rendering the data using the Google Chart API on a separate webpage. This time round everything is done in the spreadsheet so it makes it easier to use/reuse. Below is a short video of the template in action followed by a quick explanation of how it works.

Here’s the:

*** Quick Clicker Voting System Template ***

The instructions on the ‘Readme’ tell you how to setup. If you are using this on a Google Apps domain (Google Apps for Education), then it’s possible to also record the respondents username.

record the respondents username

All the magic is happening in the FILTERED sheet. In column A cell 1 (which is hidden) there is the formula =UNIQUE(ALL!C:C). This returns a list of unique questions ids from the ALL sheet. If you now highlight cell D2 of the FILTERED sheet and select Data > Validation you can see these values are used to create a select list.

create a select list

The last bit of magic is in cells D4:D8. The first half of the formula [IF(ISNA(FILTER(ALL!D:D,ALL!C:C=$D$2,ALL!D:D=C4))] checks if there is any data. The important bit is:

COUNTA(FILTER(ALL!D:D,ALL!C:C=$D$2,ALL!D:D=C4))

This FILTERs column D of the ALL sheet using the condition that column C of ALL sheet matches what is in D2 and column D matches the right response option. This formula would return rows of data that match the query so if there are threee A responses for a particular question, three As would be inserted, one on each row. All we want is the number of rows the filter would return so it is wrapped in COUNTA (count array).

Simple, yes?

6 Comments

In the original JISC OER Rapid Innovation call one of the stipulations due to the size and durations of grants is that the main reporting process is blog-based. Amber Thomas, who is the JISC Programme Manager for this strand and a keen blogger herself, has been a long supporter of projects adopting open practices, blogging progress as they go. Brian Kelly (UKOLN) has also an interest in this area with a some posts including Beyond Blogging as an Open Practice, What About Associated Open Usage Data?

For the OERRI projects the proposal discussed at the start-up meeting was that projects adopt a taxonomy of tags to indicate keys posts (e.g. project plan, aims, outputs, nutshell etc.). For the final report projects would then compile all posts with specific tags and submit as a ms-word or pdf.

There are a number of advantages of this approach one of them, for people like me anyway, is it exposes machine readable data that can be used in a number of ways. In this post I’ll show I’ve create a quick dashboard in Google Spreadsheets which takes a list of blog RSS feeds and filters for specific tags/categories. Whilst demonstrated this with the OERRI projects the same technique could be used in other scenarios, such as, as a way to track student blogs. As part of this solution I’ll highlight some of the issues/affordances of different blogging platforms and introduce some future work to combine post content using a template structure.

OERRI Project Post Directory
Screenshot of OERRI post dashboard

The OERRI Project Post Directory

If you are not interested in how this spreadsheet was made and just  want to grab a copy to use with your own set of projects/class blogs then just:

*** open the OERRI Project Post Directory ***
File > Make a copy if you want your own editable version

The link to the document above is the one I’ll be developing throughout the programme so feel free to bookmark the link to keep track of what the projects are doing.

The way the spreadsheet is structured is the tags/categories the script uses to filter posts is in cells D2:L2 and urls are constructed from the values in columns O-Q. The basic technique being used here is building urls that look for specific posts and returning links (made pretty with some conditional formatting).

Blogging platforms used in OERRI

So how do we build a url to look for specific posts? With this technique it comes down to whether the blogging platform supports tag/category filtering so lets first look at the platforms being used in OERRI projects.

chart1This chart (right) breaks down the blogging platforms. You’ll see the most (12 of 15) are using WordPress in two flavours, ‘shared’, indicating that the blog is also a personal or team blog containing other posts not related to OERRI and ‘dedicated’, setup entirely for the project.

The 3 other platforms are 2 MEDEV blogs and the OUs project on Cloudworks. I’m not familiar with the MEDEV platform and only know a bit about cloudworks so for now I’m going to ignore these and concentrate on the WordPress blogs.

WordPress and Tag/Category Filtering

One of the benefits of WordPress is you can can an RSS feed for almost everything by adding /feed/ or ?feed=rss2 to urls (other platforms also support this, I a vague recollection of doing something similar in blogger(?)). For example, if you want a feed of all my Google Apps posts you can use http://mashe.hawksey.info/category/google-apps/feed/.

Even better is you can combine tags/categories with a ‘+’ operator so if you want a feed of all my Google Apps posts that are also categorised with Twitter you can use http://mashe.hawksey.info/category/google-apps+twitter/feed/.

So to get the Bebop ‘nutshell’ categorised post as a RSS item we can use: http://bebop.blogs.lincoln.ac.uk/category/nutshell/feed/

Looking at one of the shared wordpress blogs to get the ‘nutshell’ from RedFeather you can use: http://blogs.ecs.soton.ac.uk/oneshare/tag/redfeather+nutshell/feed/

Using Google Spreadsheet importFeed formula to get a post url

The ‘import’ functions in Google Spreadsheet must be my favourites and I know lots of social media professionals who use them to pull data into a spreadsheet and produce reports for clients from the data. With importFeed we can go and see if a blog post under a certain category exists and then return something back, in this case the post link. For my first iteration of this spreadsheet I used the formula below:

importFeed formula

This works well but one of the drawback of importFeed is we can only have a maximum of 50 of them in one spreadsheet. With 15 projects and 9 tag/categories the maths doesn’t add up.

To get around this I switched to Google Apps Script (macros for Google Spreadsheets I write a lot about). This doesn’t have an importFeed function built-in but I can do a UrlFetch and Xml parse. Here’s the code which does this (included in the template):

Note this code also uses the Cache Service to improve performance and make sure I don’t go over my UrlFetch quota.

We can call this function like other spreadsheet formula using ‘=fetchUrlfromRSS(aUrl)’.

Trouble at the tagging mill

So we have a problem getting data from none WordPress blogs, which I’m quietly ignoring for now, the next problem is people not tagging/categorising posts correctly. For example, I can see Access to Math have 10 post including a ‘nutshell’ but none of these are tagged. From a machine side there’s not much I can do about this but at least from the dashboard I can spot something isn’t right.

Tags for a template

I’m sure once projects are politely reminded to tag posts they’ll oblige. One incentive might be to say if posts are tagged correctly then the code above could be easily added to to not just pull post links but the full post text which could then be used to generate the projects final submission.

Summary

So stay tuned to the OERRI Project Post Directory spreadsheet to see if I can incorporate MEDEV and Cloudworks feeds, and also if I can create a template for final posts. Given Brian’s post on usage data mentioned at the beginning should I also be tracking post activity data on social networks or is that a false metric?

I’m sure there was something else but it has entirely slipped my mind …

BTW here’s the OPML file for the RSS feeds of the blogs that are live (also visible here as a Google Reader bundle)

3 Comments

A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.

big data[1][2] consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[3] search, sharing, analytics,[4] and visualizing – BY Wikipedia

In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.

Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.

Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.

As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.

[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]

Open source as a Service

some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service

I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure.  First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.

If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).

Open algorithms, machines and people

amplab - state of the artThe use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.

They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:

  • Apache Mesos – an open source cluster manager
  • Spark – an open source interactive and interactive data analysis system
  • SCADS – consistency adjustable data store (license unknown)
  • PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)

amplab - machines

In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.

Here are the real applications of AMPLab Anthony mentioned:

[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]

OpenStack

Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.

NoSQL

In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia

NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:

This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.

[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]

Summary

The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.

Here are videos of all the presentations (including links to slides where available)

BTW Here is an archive of tweets from #esym12

For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published

5 Comments

Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 

wordclouds

Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! ...  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)