Tag Archives: #jiscar

For the Analytics Reconnoitre I’ve been trying to get my head around ‘Analytics as a service’ asking myself what new “as-a-service” offerings are emerging. Let start by defining what ‘as-a-service’ is before looking at some of the analytics offering. For this I’m going to use the five key characteristics used in the JISC CETIS Cloud Computing in Institutions briefing paper:

As a service: Key characteristics

Rapid elasticity: Capabilities can be rapidly and elastically provisioned to quickly scale up and rapidly released to quickly scale down.

Ubiquitous network access: Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones and laptops, etc.).

Pay per use: Capabilities are charged using a metered, fee-for-service, or advertising based billing model to promote optimisation of resource use.

On-demand self-service: A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed without requiring human interaction with each service’s provider.

Location independent data centres: The provider’s computing resources are usually pooled to serve all consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.

Delivery Levels

The JISC CETIS briefing then goes on to name three delivery levels for as a service offerings in software (SaaS), platform (PaaS) and infrastructure (IaaS). Here are my suggestions for Analytics as a Service and Data as a Service:

Analytics as a Service (AaaS): The capability provided to the consumer is to use the providers applications running on a cloud infrastructure to extract “actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data”* Examples include application specific solutions like Google Analytics and more general offering like Amazon AWS.

*definition proposed by Adam Cooper in Analytics and Big Data. In many instances AaaS is a subset of SaaS

Data as a Service (DaaS): The capability provided to the consumer is to use the provider’s data on demand regardless of location and affiliation. Education specific data services are provided by HESA, UCAS and others (more examples in the JISC infoNet BI infoKit). Cost models include subscription and volume based. As well as DaaS option there are a growing number of Open Data providers including Government initiatives like data.gov.uk. These fall outside the definition used here of ‘as-a-service’ offering.


Web Analytics: Web analytics as a service is not a new phenomenon and the current market leader Google Analytics has been around since 2005. Google’s ‘as a service’ offering is available for free or as a paid for premium service. The service provides a number of standard web-based dashboards which allow administrators to analyses website traffic. Recently Google have also start recording and reporting social referrals from networks like Facebook, Twitter and their own Google+. Detailed social activity streams are also available from Google’s Social Data Hub partners. These streams extract conversations and social actions like bookmarking around website resources. As well as the web interface Google have options for downloading processed data and API access for use in other applications and services.

Customer Relationship Management: As part of the CETIS Cloud Computing briefing Enrollment Rx was used to illustrate how their CRM solution offered as Software as a Service in turn build upon the Platform as a Service offered by Salesforce. As part of this Enrollment Rx integrate Salesforce’s analytics tools and dashboards within their own product. Within Salesforce’s appexchange there are over 100 other applications tagged with ‘analytics’, including SvimEdu which is a complete enterprise resource planning package targeted at the education sector.

BenchmarkinginHE:  Benchmarking In HE is a HEFCE funded project which aims to offer benchmarking tools and data for universities and colleges. Many of the data sources (listed here) are Open Data provided by organisations like HESA but some are only available on a subscription basis. For example, the Higher Education Information Database for Institutions (heidi) which is managed by HESA is operated on a subscription basis and operated on a not-for-profit basis. The current tool available to institutions via BenchmarkinginHE is BenchmarkerHE, an online database of shared financial data with reporting options.

Big Data Analytics: Similar to the CRM illustration there are other examples of raw analytics services that are also relayered with 3rd party applications. An example of this is Amazon’s Elastic MapReduce (EMR). MapReduce is a programming framework for processing large datasets using multiple computers originally developed by Google and now features in open source frameworks like Apache Hadoop. Elastic MapReduce was developed as part of one of the offering in Amazon Web Services (AWS) based on Hadoop and is ‘elastic’ because it can easily scale.  Karmasphere Analytics for Amazon EMR is a service which provides a graphical layer to interface Amazon EMR providing tools to create queries to generate reduced datasets which can be visually viewed or exported into other tools like MS Excel.

Spare notes

There is one more illustration I have in mind but doesn’t entirely fit with the ‘as-a-service’ ethos. There are a growing number of sites that let you publish datasets for analysis. These services don’t include tools to process the data, instead they provide an infrastructure to set bounties. Examples include Kaggle and Amazon Mechanical Turk, the later being a component of UC Berkeley’s AMPLab, which I’ve written about here.

Risk and Opportunities

A number of risks and opportunities are identified in the JISC CETIS Cloud Computing in Institutions briefing paper. One additional opportunity offered by analytics as a service is the argument that ‘as-a-service’ offering can, to a degree, remove the reliance on the need to have a dedicated data scientist. For example, a recent NY Times article asked ‘Will Amazon Offer Analytics as a Service?’, in which they speculate if Amazon will make and sell pattern-finding algorithms, removing the burden from the customer to develop their own.

Available Products and Services

A range of analytics and data services are available. Here are a couple I’ve mentioned in this post topped up with some more.

Google Analytics: A free Google product that provides website analytics. Standard reporting includes analysis of: audience; advertising; traffic sources; content; and conversions. Data can be analysed via the Google Analytics web interface or downloaded/emailed to users. Analytics also has a Data API allowing which can be used by 3rd party web services or in desktop applications. Website visitors are tracked in Google Analytics using a combination of cookies (rather than server logs) and most recently social activity. Google market share is reported to be around 50% but in a recent survey of 134 Universities UK websites 88% (n.118) were using Google Analytics. http://www.google.com/analytics/

Enrollment Rx (text from CETIS briefing): Is a relatively small company in the US that offers a Customer Relationship Management solution as Software as a Service. The service allows institutions to track prospective students through the application and enrollment process. The system is not free, but the combination of web delivery on the user end, and Platform as a Service at the backend, are intended to keep prices competitive. http://www.enrollmentrx.com/

Salesforce for Higher Education: Higher education institutions are using the salesforce.com platform for its instant scalability, ease of configuration, and support for multiple functional roles. Imagine a unified view of every interaction prospects, students, alumni, donors and affiliates have with your department or institution. Combine this with all of the tools you need to drive growth and success – campaign management, real-time analytics, web portals, and the ability to build custom applications without having to code – and you’re well on your way to getting your school to work smarter. http://www.salesforcefoundation.org/highered

Karmasphere Analytics for Amazon EMR: Karmasphere provides a graphical, high productivity solution for working with large structured and unstructured data sets on Amazon Elastic MapReduce. By combining the scalability and flexibility of Amazon Elastic MapReduce with the ease-of-use and graphical interface of Karmasphere desktop tools, you can quickly and cost-effectively build powerful Apache Hadoop-based applications to generate insights from your data. Launch new or access existing Amazon Elastic MapReduce job flows directly from the Karmasphere Analyst or Karmasphere Studio desktop tools, all with hourly pricing and no upfront fees or long-term commitments. http://aws.amazon.com/elasticmapreduce/karmasphere/

Kaggle: Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems - the world's best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model. http://www.kaggle.com/about


A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.

big data[1][2] consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[3] search, sharing, analytics,[4] and visualizing – BY Wikipedia

In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.

Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.

Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.

As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.

[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]

Open source as a Service

some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service

I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure.  First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.

If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).

Open algorithms, machines and people

amplab - state of the artThe use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.

They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:

  • Apache Mesos – an open source cluster manager
  • Spark – an open source interactive and interactive data analysis system
  • SCADS – consistency adjustable data store (license unknown)
  • PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)

amplab - machines

In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.

Here are the real applications of AMPLab Anthony mentioned:

[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]


Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.


In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia

NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:

This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.

[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]


The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.

Here are videos of all the presentations (including links to slides where available)

BTW Here is an archive of tweets from #esym12

For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published


As part of my role at JISC CETIS I’ve been asked to contribute to our ‘Analytics Reconnoitre’ which is a JISC commissioned project looking at the data and analytics landscape. One of my first tasks is to report on the broad landscape and trends in analytics service and data providers. Whilst I’m still putting this report together it’s been interesting to note how one particular analytics tools, R, keeps pinging on my radar. I thought it would be useful to loosely join these together and share.

Before R, the bigger ‘data science’ picture 

Before I go into R there is some more scene setting required. As part of the Analytics Reconnoitre Adam Cooper (JISC CETIS) has already published Analytics and Big Data - Reflections from the Teradata Universe Conference 2012 and Making Sense of “Analytics”.

The Analytics and Big Data post is an excellent summary of the Teradata Universe event and Adam is also able to note some very useful thoughts on ‘What this Means for Post-compulsory Education’. This includes identifying pathways for education to move forward with business intelligence and analytics. One of these I particularly liked was:

Experiment with being more analytical at craft-scale
Rather than thinking in terms of infrastructure or major initiatives, get some practical value with the infrastructure you have. Invest in someone with "data scientist" skills as master crafts-person and give them access to all data but don't neglect the value of developing apprentices and of developing wider appreciation of the capabilities and limitations of analytics.

[I’m biased towards this path because it encapsulates a lot of what I aspire to be. The craft model was one introduced to me by Joss Winn at this year’s Dev8D and coming for a family of craftsmen it makes me more comfortable to think I’m continuing the tradition in some way.]

Here are Adams observations and reflections on ‘data science’ from the same bog post:

"Data Scientist" is a term which seems to be capturing the imagination in the corporate big data and analytics community but which has not been much used in our community.

A facetious definition of data scientist is "a business analyst who lives in California". Stephen Brobst gave his distinctions between data scientist and business analyst in his talk. His characterisation of a business analyst is someone who: is interested in understanding the answers to a business question; uses BI tools with filters to generate reports. A data scientist, on the other hand, is someone who: wants to know what the question should be; embodies a combination of curiosity, data gathering skills, statistical and modelling expertise and strong communication skills. Brobst argues that the working environment for a data scientist should allow them to self-provision data, rather than having to rely on what is formally supported in the organisation, to enable them to be inquisitive and creative.

Michael Rappa from the Institute for Advanced Analytics doesn't mention curiosity but offers a similar conception of the skill-set for a data scientist in an interview in Forbes magazine. The Guardian Data Blog has also reported on various views of what comprises a data scientist in March 2012, following the Strata Conference.

While it can be a sign of hype for new terminology to be spawned, the distinctions being drawn by Brobst and others are appealing to me because they are putting space between mainstream practice of business analysis and some arguably more effective practices. As universities and colleges move forward, we should be cautious of adopt the prevailing view from industry - the established business analyst role with a focus on reporting and descriptive statistics - and miss out on a set of more effective practices. Our lack of baked-in BI culture might actually be a benefit if it allows us to more quickly adopt the data scientist perspective alongside necessary management reporting. Furthermore, our IT environment is such that self-provisioning is more tractable.

R in data science and in business

For those that don’t know R is an open source statistical programming language. If you want more background about the development of R the Information Age cover this in their piece Putting the R in analytics. An important thing to note, which is covered in the story, is R was developed by two academics at University of Auckland and continues to have a very strong and active academic community supporting it. Whilst initially used as an academic tool the article highlights how it is being adopted by the business sector.

I originally picked up the Information Age post via the Revolutions blog (hosted by Revolution Analytics) in the post Information Age: graduates driving industry adoption of R, which includes one of the following quotes from Information Age:

This popularity in academia means that R is being taught to statistics students, says Matthew Aldridge, co-founder of UK- based data analysis consultancy Mango Solutions. “We're seeing a lot of academic departments using R, versus SPSS which was what they always used to teach at university,” he says. “That means a lot of students are coming out with R skills.”

Finance and accounting advisory Deloitte, which uses R for various statistical analyses and to visualise data for presentations, has found this to be the case. “Many of the analytical hires coming out of school now have more experience with R than with SAS and SPSS, which was not the case years ago,” says Michael Petrillo, a senior project lead at Deloitte's New York branch.

Revolutions have picked up other stories related to R in big data and analytics. Two I have bookmarked are Yes, you need more than just R for Big Data Analytics in which Revolutions editor David Smith underlines that having tools like R aren’t enough and a wider data science approach is needed because “it combines the tool expertise with statistical expertise and the domain expertise required to understand the problem and the data applicable to it” .

Smith also reminds use that:

The R software is just one piece of software ecosystem — an analytics stack, if you will — of tools used to analyze Big Data. For one thing R isn't a data store in its own right: you also need a data layer where R can access structured and unstructured data for analysis. (For example, see how you can use R to extract data from Hadoop in the slides from today's webinar by Antonio Piccolboni.) At the analytics layer, you need statistical algorithms that work with Big Data, like those in Revolution R Enterprise. And at the presentation layer, you need the ability to embed the results of the analysis in reports, BI tools, or data apps.

[Revolutions also has a comprehensive list of R integrated throughout the enterprise analytics stack which includes vendor integrations from IBM, Oracle, SAP and more]

The second post from Revolutions is R and Foursquare's recommendation engine which is another graphic illustration of how R is being used in the business sector separately from vendor tools.

Closing thoughts

At this point it’s worth highlighting another of Adam’s thoughts on directions for academia in Analytics and Big Data:

Don't focus on IT infrastructure (or tools)
Avoid the temptation (and sales pitches) to focus on IT infrastructure as a means to get going with analytics. While good tools are necessary, they are not the right place to start.

I agree about not being blinkered by specific tools and as pointed out earlier R can only ever be just one piece of software in the ecosystem and any good data scientist will use the right tool for the job. It’s interesting to see an academic tool being adopted, and arguable driving, part of the commercial sector. Will academia follow where they have led – if you see what I mean?