As part of my role at JISC CETIS I’ve been asked to contribute to our ‘Analytics Reconnoitre’ which is a JISC commissioned project looking at the data and analytics landscape. One of my first tasks is to report on the broad landscape and trends in analytics service and data providers. Whilst I’m still putting this report together it’s been interesting to note how one particular analytics tools, R, keeps pinging on my radar. I thought it would be useful to loosely join these together and share.
Before R, the bigger ‘data science’ picture
Before I go into R there is some more scene setting required. As part of the Analytics Reconnoitre Adam Cooper (JISC CETIS) has already published Analytics and Big Data - Reflections from the Teradata Universe Conference 2012 and Making Sense of “Analytics”.
The Analytics and Big Data post is an excellent summary of the Teradata Universe event and Adam is also able to note some very useful thoughts on ‘What this Means for Post-compulsory Education’. This includes identifying pathways for education to move forward with business intelligence and analytics. One of these I particularly liked was:
Experiment with being more analytical at craft-scale
Rather than thinking in terms of infrastructure or major initiatives, get some practical value with the infrastructure you have. Invest in someone with "data scientist" skills as master crafts-person and give them access to all data but don't neglect the value of developing apprentices and of developing wider appreciation of the capabilities and limitations of analytics.
[I’m biased towards this path because it encapsulates a lot of what I aspire to be. The craft model was one introduced to me by Joss Winn at this year’s Dev8D and coming for a family of craftsmen it makes me more comfortable to think I’m continuing the tradition in some way.]
Here are Adams observations and reflections on ‘data science’ from the same bog post:
"Data Scientist" is a term which seems to be capturing the imagination in the corporate big data and analytics community but which has not been much used in our community.
A facetious definition of data scientist is "a business analyst who lives in California". Stephen Brobst gave his distinctions between data scientist and business analyst in his talk. His characterisation of a business analyst is someone who: is interested in understanding the answers to a business question; uses BI tools with filters to generate reports. A data scientist, on the other hand, is someone who: wants to know what the question should be; embodies a combination of curiosity, data gathering skills, statistical and modelling expertise and strong communication skills. Brobst argues that the working environment for a data scientist should allow them to self-provision data, rather than having to rely on what is formally supported in the organisation, to enable them to be inquisitive and creative.
Michael Rappa from the Institute for Advanced Analytics doesn't mention curiosity but offers a similar conception of the skill-set for a data scientist in an interview in Forbes magazine. The Guardian Data Blog has also reported on various views of what comprises a data scientist in March 2012, following the Strata Conference.
While it can be a sign of hype for new terminology to be spawned, the distinctions being drawn by Brobst and others are appealing to me because they are putting space between mainstream practice of business analysis and some arguably more effective practices. As universities and colleges move forward, we should be cautious of adopt the prevailing view from industry - the established business analyst role with a focus on reporting and descriptive statistics - and miss out on a set of more effective practices. Our lack of baked-in BI culture might actually be a benefit if it allows us to more quickly adopt the data scientist perspective alongside necessary management reporting. Furthermore, our IT environment is such that self-provisioning is more tractable.
R in data science and in business
For those that don’t know R is an open source statistical programming language. If you want more background about the development of R the Information Age cover this in their piece Putting the R in analytics. An important thing to note, which is covered in the story, is R was developed by two academics at University of Auckland and continues to have a very strong and active academic community supporting it. Whilst initially used as an academic tool the article highlights how it is being adopted by the business sector.
I originally picked up the Information Age post via the Revolutions blog (hosted by Revolution Analytics) in the post Information Age: graduates driving industry adoption of R, which includes one of the following quotes from Information Age:
This popularity in academia means that R is being taught to statistics students, says Matthew Aldridge, co-founder of UK- based data analysis consultancy Mango Solutions. “We're seeing a lot of academic departments using R, versus SPSS which was what they always used to teach at university,” he says. “That means a lot of students are coming out with R skills.”
Finance and accounting advisory Deloitte, which uses R for various statistical analyses and to visualise data for presentations, has found this to be the case. “Many of the analytical hires coming out of school now have more experience with R than with SAS and SPSS, which was not the case years ago,” says Michael Petrillo, a senior project lead at Deloitte's New York branch.
Revolutions have picked up other stories related to R in big data and analytics. Two I have bookmarked are Yes, you need more than just R for Big Data Analytics in which Revolutions editor David Smith underlines that having tools like R aren’t enough and a wider data science approach is needed because “it combines the tool expertise with statistical expertise and the domain expertise required to understand the problem and the data applicable to it” .
Smith also reminds use that:
The R software is just one piece of software ecosystem — an analytics stack, if you will — of tools used to analyze Big Data. For one thing R isn't a data store in its own right: you also need a data layer where R can access structured and unstructured data for analysis. (For example, see how you can use R to extract data from Hadoop in the slides from today's webinar by Antonio Piccolboni.) At the analytics layer, you need statistical algorithms that work with Big Data, like those in Revolution R Enterprise. And at the presentation layer, you need the ability to embed the results of the analysis in reports, BI tools, or data apps.
[Revolutions also has a comprehensive list of R integrated throughout the enterprise analytics stack which includes vendor integrations from IBM, Oracle, SAP and more]
The second post from Revolutions is R and Foursquare's recommendation engine which is another graphic illustration of how R is being used in the business sector separately from vendor tools.
At this point it’s worth highlighting another of Adam’s thoughts on directions for academia in Analytics and Big Data:
Don't focus on IT infrastructure (or tools)
Avoid the temptation (and sales pitches) to focus on IT infrastructure as a means to get going with analytics. While good tools are necessary, they are not the right place to start.
I agree about not being blinkered by specific tools and as pointed out earlier R can only ever be just one piece of software in the ecosystem and any good data scientist will use the right tool for the job. It’s interesting to see an academic tool being adopted, and arguable driving, part of the commercial sector. Will academia follow where they have led – if you see what I mean?