A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.
big data consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing – BY Wikipedia
In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.
Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.
Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.
As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.
[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]
Open source as a Service
some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service
I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure. First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.
— Martin Hawksey (@mhawksey) May 10, 2012
If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).
Open algorithms, machines and people
The use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.
They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:
- Apache Mesos – an open source cluster manager
- Spark – an open source interactive and interactive data analysis system
- SCADS – consistency adjustable data store (license unknown)
- PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)
In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.
AMPlab gamifying and crowdsourcing analysis to judge accuracy of machine learning algorithms #esym12
— Matt Johnson (@mhj_work) May 10, 2012
Here are the real applications of AMPLab Anthony mentioned:
- Mobile Millennium Project (traffic monitoring)
- Microsimulation of urban development
- Crowd based opinion formation – Opinion Space (broken site)
[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]
Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.
In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia
NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:
“When all you have is a hammer, everything looks like a nail” – great quote from Simon Metson #esym12
— Matt Johnson (@mhj_work) May 10, 2012
Ah, NoSQL isn’t a silver bullet. #esym12
— Brian Kelly (@briankelly) May 10, 2012
— Fiona Murphy (@DrFionaLM) May 10, 2012
This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.
[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]
The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.
For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).
Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published