JISC

Learning Analytics appears to be increasingly an emerging area of interest for institutions and I'm aware of a number of staff being asked to contribute on this area in their teaching and learning strategies. I thought it would be useful to spotlight some resources produced by Cetis and Jisc in this area that might help. The list is in no ways exhaustive and if you have any other resources you think worth highlighting either leave a comment or get in touch and I’ll add them to the post.

**New** Ferguson, R. (2013). Learning Analytics for Open and Distance Education. In S. Mishra (Ed.), CEMCA EdTech Notes. New Delhi, India: Commonwealth Educational Media Centre for Asia (CEMCA).
**More New** Society for Learning Analytics Research (SoLAR) has a collection of useful resources including these introductory articles

SoLAR recently ran a open course Strategy & Policy for Systemic Learning Analytics. Something worth looking at might be the recording of the session  Belinda Tynan (OU’s PVC Learning & Teaching) did on Designing Systemic Analytics at The Open University.

**Even Newer** from SoLAR looking more at a national strategy for the learning analytics. Improving the Quality and Productivity of the Higher Education Sector Policy and Strategy for Systems-Level Deployment of Learning Analytics

Highlights from Jisc Cetis Analytics Series include

  • Analytics; what is changing and why does it matter?
    This paper provides a high level overview to the CETIS Analytics Series. The series explores a number of key issues around the potential strategic advantages and insights which the increased attention on, and use of, analytics is bringing to the education sector. It is aimed primarily at managers and early adopters in Further and Higher Education who have a strategic role in developing the use of analytics in the following areas:

    • Whole Institutional Issues,
    • Ethical and Legal Issues,
    • Learning and Teaching,
    • Research Management,
    • Technology and Infrastructure.
  • Analytics for Learning and Teaching
    A broad view is taken of analytics for Learning and Teaching applications in Higher Education. In this we discriminate between learning analytics and academic analytics: uses for learning analytics are concerned with the optimisation of learning and teaching per se, while uses of educational analytics are concerned with optimisation of activities around learning and teaching, for example, student recruitment.
  • Legal, Risk and Ethical Aspects of Analytics in Higher Education
    The collection, processing and retention of data for analytical purposes has become commonplace in modern business, and consequently the associated legal considerations and ethical implications have also grown in importance. Who really owns this information? Who is ultimately responsible for maintaining it? What are the privacy issues and obligations? What practices pose ethical challenges?
    Also of interest the LAK13 on An evaluation of policy frameworks for addressing ethical considerations in learning analytics
  • Institutional Readiness for Analytics
    This briefing paper is written for managers and early adopters in further and higher education who are thinking about how they can build capability in their institution to make better use of data that is held on their IT systems about the organisation and provision of the student experience. It will be of interest to institutions developing plans, those charged with the provision of analytical data, and administrators or academics who wish to use data to inform their decision making. The document identifies the capabilities that individuals and institutions need to initiate, execute, and act upon analytical intelligence
  • Case Study, Acting on Assessment AnalyticsOver the past five years, as part of its overall developments in teaching and learning, The University of Huddersfield has been active in developing new approaches to assessment and feedback methodologies. This has included the implementation of related technologies such as e-submission and marking tools.In this case study Dr Cath Ellis shares with us how her interest in learning analytics began and how she and colleagues are making practical use of assessment data both for student feedback and overall course design processes.
    Aspects of this case study and other work in this area are available in this webinar recording on Learning analytics for assessment and feedback

Examples of Learning Analytic Tools

Taken from Dyckhoff, A. L., et al. "Supporting action research with learning analytics."Proceedings of the Third International Conference on Learning Analytics and Knowledge. ACM, 2013.

  • LOCO-Analyst [1, 4],
  • TADA-Ed [46],
  • Data Model to Ease Analysis and Mining [38],
  • Student Inspector [50],
  • MATEP [56–58],
  • CourseVis [43, 45],
  • GISMO [44],
  • Course Signals [3],
  • Check My Activity [25],
  • Moodog [54, 55],
  • TrAVis [41, 42],
  • Moodle Mining Tool [48],
  • EDM Vis [34],
  • AAT [29],
  • Teacher ADVisor [37],
  • E-learning Web Miner [26],
  • ARGUNAUT [30],
  • Biometricsbased Student Attendance Module [27],
  • CAMera and ZeitgeistDashboard [51, 52],
  • Student Activity Meter [28],
  • Discussion Interaction Analysis System (DIAS) [8–11],
  • CoSyLMSAnalytics [49],
  • Network Visualization Resource and SNAPP [5, 6, 17, 18],
  • i-Bee [47],
  • iHelp [12], and
  • Participation Tool [32]

References for these tools are listed here

Here is a more general set of Analytics Tools and Infrastructure from the Analytics Series

A quick reminder that the Analytics in UK Further and Higher Education Survey is still open.

Share this post on:
| | |
Posted in Analytics, JISC CETIS on by .

3 Comments

I’ve written a very long blog post which I’ll publish soon on text-mining public JISCMail (Listerv) lists using OpenRefine. It concludes with displaying list activity, posts over time and individual activity. The technique I used isn’t straight forward but as the output might be of benefit to other people like Brian Kelly who reported the Decline in JISCMail Use Across the Web Management Community I wondered if there was a better way of doing it. Here’s my answer:

*** JISCMail Public List Activity Overview Template ***
[Give it 30 seconds to render the results] 

JISCMail Public List Activity Overview Template

By making a copy of this spreadsheet and entering the url of the homepage of a public JISCMail List like OER-DISCUSS, it goes off and collects each months archives for almost the last 3 years, graphs the overall list activity as well as letting you see individual contributions (a limitation is matching variations in display names so in the OER-DISCUSS example Pat Lockley and Patrick Lockley get counted separately even though they are the same person).

How it works

On the data sheet cell A2 uses importXML to grab all the archive links. In cell B2 the importHTML function is used to grab the table of posts on each month’s archive page and does a QUERY to return post author names, the values being turned into a string from an array using JOIN. In cell A53 a UNIQUE list of author names (minus ‘) is generated using a combination of SPLIT and JOINS. This data is then used on the Dashboard sheet (to get the SPARKLINES I had to write a custom function using Google Apps Script.

function getActivity(user,source) {
  var output = [];
  for (i in source){
    var rows = source[i][0];
    var count = rows.match(new RegExp(user,"g"));
    if (count){
      output.push([count.length]);
    } else {
      if (source[i][0]!=""){
        output.push([0]);
      }
    }
  }
  output.reverse();
  return output;
}

If you are interested in learning more about the functions used I recently posted Feeding Google Spreadsheets: Exercises in using importHTML, importFeed, importXML, importRange and importData (with some QUERY too). You should be able to use this template with any other public JISCMail list. Any questions get in touch.

The JISC OER Rapid Innovation programme is coming to a close and as the 15 projects do their final tiding up it’s time to start thinking about the programme as a whole, emerging trends, lesson learned and help projects disseminate their outputs. One of the discussions we’ve started with Amber Thomas, the programme manager, is how best to go about this. Part of our support role at CETIS has been to record some of the technical and standards decisions taken by the projects. Phil Barker and I ended up having technical conversations with 13 of the 15 projects which are recorded in the CETIS PROD Database in the Rapid Innovation strand. One idea was see if there were any technology or standards themes we could use to illustrate what has been done in these areas. Here are a couple of ways to look at this data.

To start with PROD has some experimental tools to visualise the data. By selecting the Rapid Innovation strand as selecting ‘Stuff’ we get this tag cloud. We can see JSON, HTML5 and RSS are all very prominent. Unfortunately some of the context is lost as we don’t know without digging deeper which projects used JSON etc. 

PROD Wordcloud

To get more context I thought it would be useful to put the data in a network graph (click to enlarge).

NodeXL Graph

NodeXL Graph - top selectedWe can now see which projects (blue dots) used which tech/standards (brown dots) and again JSON, HTML5 and RSS are prominent. Selecting these (image right) we can see it covers most of the projects (no. 10), so these might be some technology themes we could talk about. But what about the remain projects?

As a happy accident I put the list of technologies/standards into Voyant Tools (similar to Wordle but way more powerful – I need to write a post about it) and got the graph below:

Voyant Tools Wordcloud

Because the wordcloud is generated from words rather than phrases the frequency is different: : api (16), rss (6), youtube (6), html5 (5), json (5). So maybe there is also a story about APIs and YouTube.

Share this post on:
| | |
Posted in JISC, JISC CETIS and tagged on by .

2 Comments

Flickr Tag Error: Call to display photo '4132778021' failed.

Error state follows:

  • stat: fail
  • code: 95
  • message: SSL is required
I thought it would be useful to give a summary of some of the tools I use/developed at CETIS to monitor the pulse of the web around our and JISC work. All of these are available for reuse and documented to varying degrees. All of the tools also use Google Spreadsheets/Apps Script which is free for anyone to use with a Google account, and all the recipes use free tools (the exception being owning a copy of Excel, but most institutions have this as standard).

Tools

Hashtag archiving, analysis and interfacing

Hashtag archiving, analysis and interfacingUsed with: CETIS COMMS, OERRI Programme, #UKOER, …
What it does: It’s a Google Spreadsheet template which can be setup to automatically archive Twitter searches. The template includes some summaries to show top contributors and frequency or tweets. There are a number of add-on interfaces that can be used to navigate the data in different ways, including TAGSExplorer and TAGSArc.

More info: http://mashe.hawksey.info/2011/11/twitter-how-to-archive-event-hashtags-and-visualize-conversation/

Monitoring Twitter searches and combining with Google Analytics

Monitoring Twitter searches and combining with Google AnalyticsUsed with: CETIS COMMS
What it does: Archives all tweets linking to to the .cetis.ac.uk domain and combines with our Google Analytics data to monitor influential distributors of our work.

More info: http://mashe.hawksey.info/2012/03/combine-twitter-and-google-analytics-data-to-find-your-top-content-distributors/

RSS Feed Activity Data Monitoring

RSS Feed Activity Data MonitoringUsed with: CETIS COMMS, OERRI Programme
What it does: Gives a dashboard view of the total social shares from a range of services (Facebook, Twitter, Google+ for a single or combination of RSS feeds. At CETIS we also monitor the social popularity of blogs referencing .cetis.ac.uk by using a RSS feed from Google’s Blog Search e.g. http://www.google.com/search?q=link:cetis.ac.uk&hl=en&tbm=blg&output=rss&num=20

More info: http://mashe.hawksey.info/2012/06/rss-feed-social-share-counting/

Post Activity

Blog Activity Data Feed Template OverviewUsed with: CETIS COMMS
What it does: Gives more detailed activity data around socially shared urls combining individual tweets from Topsy, Delicious, and post comments.

More info: http://mashe.hawksey.info/2012/08/blog-activity-data-feed-template/

Post Directory

Post DirectoryUsed with: OERRI Programme
What it does: Dashboards all the project blogs from the OERRI Programme and monitors when they release blog posts with predefined tags/categories. The dashboard also combines the social monitoring techniques mentioned above so that projects and the programme support team can monitor social shares for individual blog posts.

More info: http://mashe.hawksey.info/2012/08/how-jisc-cetis-dashboard-social-activity-around-blog-posts-using-a-splash-of-data-science/

Automatic final report generation

OERRI DashboardUsed with: OERRI Programme
What is does: As an extension to the Post Directory this tool combines project blog posts from a predefined set of tags/categories into a final report as an editable MS Word/HTML document. Currently only the original post content, including images, is compiled in individual reports but it would be easy to also incorporate some of the social tracking and/or post comments data.

More info: http://mashe.hawksey.info/2012/09/converting-blog-post-urls-into-ms-word-documents-using-google-apps-script-oerri/

Recipes

As well as standalone tools I’ve documented a number of recipes to analysis monitoring data.

Twitter conversation graph

Twitter conversation graphUsed with: #moocmooc, #cfhe12
What it does: Using data from the Twitter Archiving Google Spreadsheet template (TAGS) this recipe shows you how you can use a free Excel add-on, NodeXL, to graph threaded conversations. I’m still developing this technique but my belief is there are opportunities to give a better overview of conversations within hashtag communities, identifying key moments.

More info: http://mashe.hawksey.info/2012/08/first-look-at-analysing-threaded-twitter-discussions-from-large-archives-using-nodexl-moocmooc/

Community blogosphere graph

Community blogosphere graphUsed with: #ds106
What it does: Outlines how data from blog posts (in this case a corpus collected by the FeedWordPress plugin used in DS106) can be refined and graphed to show blog post interlinking within a community. An idea explored in this recipe is using measures used in social network analysis to highlight key posts.

More info: http://mashe.hawksey.info/2012/10/visualizing-cmooc-data-extracting-and-analysing-data-from-feedwordpress-part-1-ds106-nodexl/

Activity data visualisation (gource)

Activity data visualisation (gource)Used with: #ukoer
What it does: Documents how data can be extracted (in this case records from Jorum) and cleaned using Google Refine (soon to be renamed OpenRefine). This data is then exported as a custom log file which can be played in an open source visualisation tool called Gource. The purpose of this technique is to give the viewer a sense of the volume and size of data submitted or created by users within a community.

More info: http://mashe.hawksey.info/2011/12/google-refining-jorum-ukoer/

So now go forth and reuse!

3 Comments

If you haven’t already you should check out Jorum's 2012 Summer of Enhancements and you’ll see it’s a lot more than a spring clean. In summary there are 4 major projects going on:

  • JDEP - Improving discoverability through semantic technology
  • JEAP - Expanding Jorum’s collection through aggregation projects
  • JPEP - Exposing activity data and paradata
  • JUEP - Improving the front-end UI and user experience (UI/UX)
SEO the Game by Subtle Network Design - The Apprentice Card
Image Copyright subtlenetwork.com

As I was tasked to write the chapter on OER Search Engine Optimisation (SEO) and Discoverability as part of our recent OER Booksprint I thought I’d share some personal reflections on the JDEP - Improving discoverability through semantic technology project (touching upon JEAP - Expanding Jorum’s collection through aggregation projects).

Looking through JDEP the focus appears to be mainly improving internal discoverability within Jorum with better indexing. There are some very interesting developments in this area most of which are beyond my realm of expertise.

Autonomy IDOL

The first aspect is deploying Autonomy IDOL which uses “meaning-based search to unlock significant research material”. Autonomy is a HP owned company and IDOL (Intelligent Data Operating Layer) was recently used in a project by Mimas, JISC Collections and the British Library to unlocks hidden collections. With Autonomy IDOL it means that:

rather than searching simply by a specific keyword or phrase that could have a number of definitions or interpretations, our interface aims to understand relationships between documents and information and recognize the meaning behind the search query.

This is achieved by:

  • cluster search results around related conceptual themes
  • full-text indexing of documents and associated materials
  • text-mining of full-text documents
  • dynamic clustering and serendipitous browsing
  • visualisation approaches to search results

An aspect of Autonomy IDOL that caught my eye was:

 conceptual clustering capability of text, video and speech

Will Jorum be able to index resources using Autonomy's Speech Analytics solution?

If so that would be very useful, the issue may be how Jorum resources are packaged and where resources are hosted. If you would like to see Autonomy IDOL in action you can try the Institutional Repository Search which searches across 160 UK repositories.

Will Jorum be implementing an Amazon style recommendation system?

One thing it’ll be interesting to see (and this is perhaps more of a future aspiration) is the integration of a Amazon style recommendation system. The CORE project has already published a similar documents plugin, but given Jorum already has single sign-on I wonder how easy it would be integrate a solution to make resource recommendations based on usage data (here’s a paper on A Recommender System for the DSpace Open Repository Platform).

Elasticsearch

This is a term I’ve heard of but don’t really know enough to comment on. I’m mentioning it here mainly to highlight the report Cottage Labs prepared Investigating the suitability of Apache Solr and Elasticsearch for Mimas Jorum / Dashboard, which outlines the problem and solution for indexing and statistical querying.

External discoverability and SEO

Will Jorum be improving search engine optimisation?

From the forthcoming chapter on OER SEO and Discoverability:

Why SEO and discoverability are important

In common with other types of web resources, the majority of people will use a search engine to find open educational resources, therefore it is important to ensure that OERs feature prominently in search engine results.  In addition to ensuring that resources can be found by general search engines it is also important to make sure they also are easily discoverable in sites that are content or type specific e.g iTunes, YouTube, Flickr.

Although search engine optimisation can be complex, particularly given that search engines may change their algorithms with little or no prior warning or documentation, there is growing awareness that if institutions, projects or individuals wish to have a visible web presence and to disseminate their resources efficiently and effectively search engine optimisation and ranking can not be ignored1.

The statistics are compelling:

  • Over 80% of web searches are performed using Google [Ref 1]
  • Traffic from Google searches varies from repository to repository but ranges between 50-80% are not uncommon [Ref 2]
  • As an indication 83% of college students begin their information search in a search engine [Ref 3]

Given the current dominance of Google as the preferred search engine, it is important to understand how to optimise open educational resources to be discovered via Google Search. However SEO techniques are not specific to Google and are applicable to optimise resource discovery by other search engines.

By all accounts the only way for Jorum is up as it was recently reported in the JISCMail REPOSITORIES-LIST that “just over 5% of Jorum traffic comes directly from Google referrals”. So what is going wrong?

I’m not an SEO expert but a quick check using a search for site:dspace.jorum.ac.uk returns 135,000 results so content is being indexed (Jorum should have access to Googe Webmaster Tools to get detailed index and ranking data). Resource pages include metadata including DC.creator, DC.subject and more. One thing I noticed was missing from Jorum resource pages was <meta name="description" content="A description of the page" />. Why might this be important? Google will ignore meta tags it doesn't know (and here is the list of metatags Google knows).

Another factor might be that Google, apparently (can’t find a reference) trusts metadata that is human readable by using RDFa markup. So instead of hiding meta tags in the of a page Google might weight the data better if it was inline markup:

Current Jorum resource html source
Current Jorum resource html source

With example of RDFa markup
With example of RDFa markup

[Taking this one step further Jorum might want to use schema.org to improve how resources are displayed in search results]

It’ll will be interesting to see if JEAP - Expanding Jorum’s collection through aggregation projects will improve SEO because of backlink love.

Looking further ahead

Will there be a LTI interface to allow institutions to integrate Jorum into their VLE?

Final thought. It's been interesting to see Blackboard enter the repository marketplace with xpLor (see Michael Feldstein’s Blackboard’s New Platform Strategy for details). A feature of this cloud service that particularly caught my eye was the use of IMS Learning Tools Interoperability (LTI) to allow institutions to integrate a repository within their existing VLE (CETIS IMS Learning Tools Interoperability Briefing paper). As I understand it with this institutions would be able to seamlessly deposit and search for resources. I wonder Is this type of solution on the Jorum roadmap or do you feel there would be a lack of appetite within the sector for such a solution?

Fin

Those are my thoughts anyway. I know Jorum would welcome additional feedback on their Summer of Enhancements. I also welcome any thoughts on my thoughts ;)

BTW Here's a nice presentation on Improving Institutional Repository Search Engine Visibility in Google and Google Scholar

Jorum has a Dashboard Beta (for exposing usage and other stats about OER in Jorum) up for the community to have a play with: we would like to get your feedback!

For more information see the blog post here: http://www.jorum.ac.uk/blog/post/38/collecting-statistics-just-got-a-whole-lot-sweeter

Pertinent info: the Dashboard has live Jorum stats behind it, but the stats have some irregularities, so the stats themselves come with a health warning. We’re moving from quite an old version of DSpace to the most recent version over the summer, at which point we will have more reliable stats.

We also have a special project going over the summer to enhance our statistics and other paradata provision, so we’d love to get as much community feedback as possible to feed into that work. We’ll be doing a specific blog post about that as soon as we have contractors finalised!

Feedback by any of the mechanisms suggested in the blog post, or via discussion here on the list, all welcome.

The above message came from Sarah Currier on the [email protected] list. This was my response:

It always warms my heart to see a little more data being made openly available :)

I imagine (and I might be wrong) that the main users of this data might be repository managers wanting to analyse how their institutional resources are doing. So to be able to filter uploads/downloads/views for their resources and compare with overall figures would be useful.

Another (perhaps equally important) use case would be individuals wanting to know how their resources are doing, so a personal dashboard of resources uploaded, downloads, views would also be useful. This is an area Lincoln's Bebop project were interested in so it might be an idea to work with them to find out what data would be useful to them and in what format (although saying that think I only found one #ukoer record for Lincoln {hmm I wonder if anyone else would find it useful if you pushed data to Google Spreadsheets a la Guardian datastore (here's some I captured as part of the OER Visualisation Project}) ).

I'm interested to hear what the list think about these two points

You might also want to consider how the data is licensed on the developer page. Back to my favourite example, Gent use the Open Data Commons licence  http://opendatacommons.org/licenses/odbl/summary/

So what do you think of the beta dashboard? Do you think the two use cases I outline are valid or is there a more pertinent one? (If you want to leave a comment here I’ll make sure they are passed on to the Jorum team, or you can use other means).

[I’d also like to add a personal note that I’ve been impressed with the recent developments from Jorum/Mimas. There was a rocky period when I was at the JISC RSC when Jorum didn’t look aligned to what was going on in the wider world, but since then they’ve managed to turn it around and developments like this demonstrate a commitment to a better service]

Update: Bruce Mcpherson has been working some Excel/Google Spreadsheet magic and has links to examples in this comment thread

Share this post on:
| | |
Posted in API, Data, Jorum, OER and tagged on by .

Another post related to my ‘Hacking stuff together with Google Spreadsheets’ (other online spreadsheet tools are available) session at  Dev8eD (A free event for building, sharing and learning cool stuff in educational technology for learning and teaching!!) next week. This time an example to demonstrate importHtml. Rather than reinventing the wheel I thought I’ve revisit Tony Hirst's Creating a Winter Olympics 2010 Medal Map In Google Spreadsheets (hmm are we allowed to use the word ‘Olympic’ in the same sentence as Google as they are not an official sponsor ;s).

Almost 4 years on the recipe hasn’t changed much. The Winter Olympics 2010 medals page on wikipedia is still there and we can still use the importHTML formula to grab the table [=importHTML("http://en.wikipedia.org/wiki/2010_Winter_Olympics_medal_table","table",3)]

The useful thing to remember is importHtml and it’s cousins importFeed, importXML, importData, and importRange create live links to the data, so if the table on wikipedia was to change the spreadsheet would also eventually update.

Where I take a slight detour with the recipe is that Google now have a chart heatmap that doesn’t need ISO country codes. Instead this is happy try to resolve country names.

heatmap missing dataOnce the data is imported from Wikipedia if you select Insert > Chart and choosing heatmap, using the Nation and Total columns as the data range you should get a chart similar to the one below shown to the right. The problem with this is it’s missing most of the countries. To fix this we need to remove the country codes in brackets. One way to do this is trim the text from the left until the first bracket “(“. This can be done using a combination of the LEFT and FIND formula.

In your spreadsheet at cell H2 if you enter =LEFT(B2,FIND("(",B2)-2) this will return all the text in ‘Canada (CAN)’ up to ‘(‘ minus two characters to exclude the ‘(‘ and the space. You could manually fill this formula down the entire column but I like using the ARRAYFORMULA which allows you to use the same formula in multiple cells without having to manually fill it in. So our final formula in H2 is:

=ARRAYFORMULA(LEFT(B2:B27,FIND("(",B2:B27)-2))

Using the new column of cleaned country names we now get our final map

Interactive map: Click for interactive version

To recap, we used one of the import formula to pull live data into a Google Spreadsheet, stripped some unwanted text and generated a map. Because all of this is sitting in the ‘cloud’ it’ll quite happly refresh itself if the data changed. 

The final spreadsheet used in this example is here

Tony has another Data Scraping Wikipedia with Google Spreadsheets example here

2 Comments

Next week I’ll be presenting at Dev8eD (A free event for building, sharing and learning cool stuff in educational technology for learning and teaching!!) doing a session on ‘Hacking stuff together with Google Spreadsheets’ – other online spreadsheet tools are available. As part of this session I’ll be rolling out some new examples. Here’s one I’ve quickly thrown together to demonstrate UNIQUE and FILTER spreadsheet formula. It’s yet another example of me visiting the topic of electronic voting systems (clickers). The last time I did this it was gEVS – An idea for a Google Form/Visualization mashup for electronic voting, which used a single Google Form as a voting entry page and rendering the data using the Google Chart API on a separate webpage. This time round everything is done in the spreadsheet so it makes it easier to use/reuse. Below is a short video of the template in action followed by a quick explanation of how it works.

Here’s the:

*** Quick Clicker Voting System Template ***

The instructions on the ‘Readme’ tell you how to setup. If you are using this on a Google Apps domain (Google Apps for Education), then it’s possible to also record the respondents username.

record the respondents username

All the magic is happening in the FILTERED sheet. In column A cell 1 (which is hidden) there is the formula =UNIQUE(ALL!C:C). This returns a list of unique questions ids from the ALL sheet. If you now highlight cell D2 of the FILTERED sheet and select Data > Validation you can see these values are used to create a select list.

create a select list

The last bit of magic is in cells D4:D8. The first half of the formula [IF(ISNA(FILTER(ALL!D:D,ALL!C:C=$D$2,ALL!D:D=C4))] checks if there is any data. The important bit is:

COUNTA(FILTER(ALL!D:D,ALL!C:C=$D$2,ALL!D:D=C4))

This FILTERs column D of the ALL sheet using the condition that column C of ALL sheet matches what is in D2 and column D matches the right response option. This formula would return rows of data that match the query so if there are threee A responses for a particular question, three As would be inserted, one on each row. All we want is the number of rows the filter would return so it is wrapped in COUNTA (count array).

Simple, yes?

6 Comments

In the original JISC OER Rapid Innovation call one of the stipulations due to the size and durations of grants is that the main reporting process is blog-based. Amber Thomas, who is the JISC Programme Manager for this strand and a keen blogger herself, has been a long supporter of projects adopting open practices, blogging progress as they go. Brian Kelly (UKOLN) has also an interest in this area with a some posts including Beyond Blogging as an Open Practice, What About Associated Open Usage Data?

For the OERRI projects the proposal discussed at the start-up meeting was that projects adopt a taxonomy of tags to indicate keys posts (e.g. project plan, aims, outputs, nutshell etc.). For the final report projects would then compile all posts with specific tags and submit as a ms-word or pdf.

There are a number of advantages of this approach one of them, for people like me anyway, is it exposes machine readable data that can be used in a number of ways. In this post I’ll show I’ve create a quick dashboard in Google Spreadsheets which takes a list of blog RSS feeds and filters for specific tags/categories. Whilst demonstrated this with the OERRI projects the same technique could be used in other scenarios, such as, as a way to track student blogs. As part of this solution I’ll highlight some of the issues/affordances of different blogging platforms and introduce some future work to combine post content using a template structure.

OERRI Project Post Directory
Screenshot of OERRI post dashboard

The OERRI Project Post Directory

If you are not interested in how this spreadsheet was made and just  want to grab a copy to use with your own set of projects/class blogs then just:

*** open the OERRI Project Post Directory ***
File > Make a copy if you want your own editable version

The link to the document above is the one I’ll be developing throughout the programme so feel free to bookmark the link to keep track of what the projects are doing.

The way the spreadsheet is structured is the tags/categories the script uses to filter posts is in cells D2:L2 and urls are constructed from the values in columns O-Q. The basic technique being used here is building urls that look for specific posts and returning links (made pretty with some conditional formatting).

Blogging platforms used in OERRI

So how do we build a url to look for specific posts? With this technique it comes down to whether the blogging platform supports tag/category filtering so lets first look at the platforms being used in OERRI projects.

chart1This chart (right) breaks down the blogging platforms. You’ll see the most (12 of 15) are using WordPress in two flavours, ‘shared’, indicating that the blog is also a personal or team blog containing other posts not related to OERRI and ‘dedicated’, setup entirely for the project.

The 3 other platforms are 2 MEDEV blogs and the OUs project on Cloudworks. I’m not familiar with the MEDEV platform and only know a bit about cloudworks so for now I’m going to ignore these and concentrate on the WordPress blogs.

WordPress and Tag/Category Filtering

One of the benefits of WordPress is you can can an RSS feed for almost everything by adding /feed/ or ?feed=rss2 to urls (other platforms also support this, I a vague recollection of doing something similar in blogger(?)). For example, if you want a feed of all my Google Apps posts you can use http://mashe.hawksey.info/category/google-apps/feed/.

Even better is you can combine tags/categories with a ‘+’ operator so if you want a feed of all my Google Apps posts that are also categorised with Twitter you can use http://mashe.hawksey.info/category/google-apps+twitter/feed/.

So to get the Bebop ‘nutshell’ categorised post as a RSS item we can use: http://bebop.blogs.lincoln.ac.uk/category/nutshell/feed/

Looking at one of the shared wordpress blogs to get the ‘nutshell’ from RedFeather you can use: http://blogs.ecs.soton.ac.uk/oneshare/tag/redfeather+nutshell/feed/

Using Google Spreadsheet importFeed formula to get a post url

The ‘import’ functions in Google Spreadsheet must be my favourites and I know lots of social media professionals who use them to pull data into a spreadsheet and produce reports for clients from the data. With importFeed we can go and see if a blog post under a certain category exists and then return something back, in this case the post link. For my first iteration of this spreadsheet I used the formula below:

importFeed formula

This works well but one of the drawback of importFeed is we can only have a maximum of 50 of them in one spreadsheet. With 15 projects and 9 tag/categories the maths doesn’t add up.

To get around this I switched to Google Apps Script (macros for Google Spreadsheets I write a lot about). This doesn’t have an importFeed function built-in but I can do a UrlFetch and Xml parse. Here’s the code which does this (included in the template):

Note this code also uses the Cache Service to improve performance and make sure I don’t go over my UrlFetch quota.

We can call this function like other spreadsheet formula using ‘=fetchUrlfromRSS(aUrl)’.

Trouble at the tagging mill

So we have a problem getting data from none WordPress blogs, which I’m quietly ignoring for now, the next problem is people not tagging/categorising posts correctly. For example, I can see Access to Math have 10 post including a ‘nutshell’ but none of these are tagged. From a machine side there’s not much I can do about this but at least from the dashboard I can spot something isn’t right.

Tags for a template

I’m sure once projects are politely reminded to tag posts they’ll oblige. One incentive might be to say if posts are tagged correctly then the code above could be easily added to to not just pull post links but the full post text which could then be used to generate the projects final submission.

Summary

So stay tuned to the OERRI Project Post Directory spreadsheet to see if I can incorporate MEDEV and Cloudworks feeds, and also if I can create a template for final posts. Given Brian’s post on usage data mentioned at the beginning should I also be tracking post activity data on social networks or is that a false metric?

I’m sure there was something else but it has entirely slipped my mind …

BTW here’s the OPML file for the RSS feeds of the blogs that are live (also visible here as a Google Reader bundle)

3 Comments

A couple of weeks ago it was Big Data Week, “a series of interconnected activities and conversations around the world across not only technology but also the commercial use case for Big Data”.

big data[1][2] consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,[3] search, sharing, analytics,[4] and visualizing – BY Wikipedia

In O’Reilly Radar there was a piece on Big data in Europe which had Q&A from Big Data Week founder/organizer Stewart Townsend, and Carlos Somohano both of whom are big in Big Data.

Maybe I’m being naïve but I was surprised that there was no reference to what universities/research sector is doing with handling and analysing large data sets. For example at the Sanger Institute alone each of their DNA sequencers are generating 1 terabyte (1024 gigabytes) of data a day, storing over 17 petabytes (17 million gigabytes) which is doubling every year.

Those figures trip off my tongue because last week I was at the Eduserv Symposium 2012: Big Data, Big Deal? which had many examples of how institutions are dealing with ‘big data’. There were a couple of things I took away from this event like the prevalence of open source software as well as the number of vendors wrapping open source tools with their own systems to sell as service. Another clear message was a lack of data scientists who can turn raw data into information and knowledge.

As part of the Analytics Reconnoitre we are undertaking at JISC CETIS in this post I want to summarise some of the open source tools and ‘as a service’ offering in the Big Data scene.

[Disclaimer: I should say first I coming to this area cold. I’m not an information systems expert so what you’ll see here is a very top-level view more often than not me joining the dots from things I’ve learned 5 minutes ago. So if you’ve spot anything I’ve got wrong or bits I’m missing let me know]

Open source as a Service

some of the aaS’s
CaaS – Cluster as a Service
IaaS – Infrastructure as a Service
SaaS – Software as a Service
PaaS – Platform as a Service

I’ve already highlighted how the open source R statistical computing environment is being used as an analytics layer. Open source is alive and well in other parts of the infrastructure.  First up at the was Rob Anderson from Isilon Systems (division of EMC) talking about Big Data and implications for storage. Rob did a great job introducing Big Data and a couple of things I took away were the message that there is a real demand for talented ‘data scientists’ and getting organisations to think differently about data.

If you look some of the products/services EMC offer you’ll find EMC Greenplum Database and HD Community Editions (Greenplum are a set of products to handle ‘Big Data’). You’ll see that these include the open source Apache Hadoop ecosystem. If like me you’ve heard of Hadoop but don’t really understand what it is, here is a useful post on Open source solutions for processing big data and getting Knowledge. This highlights components of the Hadoop most of which appear in the Greenplum Community Edition (I was very surprised to see the NoSQL database Cassandra which is now part of Hadoop was originally developed by Facebook and released as open source code – more about NoSQL later).

Open algorithms, machines and people

amplab - state of the artThe use of open source in big data was also highlighted by Anthony D Joseph Professor at the University of California, Berkeley in his talk. Anthony was highlighting UC Berkeley’s AMPLab which is exploring “Making Sense at Scale” by tightly integrating algorithms, machines and people (AMP). The slide (right) from Anthony’s presentation summaries what they are doing, combining 3 strands to solve big data problems.

They are achieving this by combining existing tools with new components. In the slide below you have the following pieces developed by AMPLab:

  • Apache Mesos – an open source cluster manager
  • Spark – an open source interactive and interactive data analysis system
  • SCADS – consistency adjustable data store (license unknown)
  • PIQL – Performance (predictive) Insightful Query Language (part of SCADS. There’s also PIQL-on-RAILS plugin MIT license)

amplab - machines

In the Applications/tools box is: Advanced ML algorithms; Interactive data mining; Collaborative visualisation. I’m not entirely sure what these are but in Anthony’s presentation he mentioned more open source tools are required particularly in ‘new analysis environments’.

Here are the real applications of AMPLab Anthony mentioned:

[Another site mentioned by Anthony worth bookmarking/visiting is DataKind – ‘helping non-profits through pro bono data collections, analysis and visualisation’]

OpenStack

Another cloud/big data/open source tool I know of but not mentioned at the event is OpenStack. This was initially developed by commercial hosting service Rackspace and NASA (who it has been said are ‘the largest collector of data in human history’). Like Hadoop OpenStack is a collection of tools/projects rather than one product. OpenStack contains OpenStack Compute, OpenStack Object Storage and OpenStack Image Service.

NoSQL

In computing, NoSQL is a class of database management system identified by its non-adherence to the widely-used relational database management system (RDBMS) model … It does not use SQL as its query language … NoSQL database systems are developed to manage large volumes of data that do not necessarily follow a fixed schema – BY wikipedia

NoSQL came up in Simon Metson’s (University of Bristol), Big science, Big Data session. This class of database is common in big data applications but Simon underlined that it’s not always the right tool for the job:

This view is echoed by Nick Jackson (University of Lincoln) who did an ‘awesome’ introduction to MongoDB (one of the many open source NoSQL solutions) as part of the Managing Research Data Hack Data organised by DevCSI/JISC MRD. A strongly recommend you look at the resources that came out of this event including other presentations from University of Bristol on data.bris.

[BTW the MongoDB site has a very useful page highlighting how it differs from another open source NoSQL solution CouchDB. So even NoSQL solutions come in many flavours. Also Simon Hodson Programme Manager, JISC MRD gave a lightening talk on JISC and Big Data at the Eduserv event]

Summary

The amount of open source solutions in this area is perhaps not surprising as the majority of the web (65% according to the last netcraft survey) is run on the open source Apache server. It’s interesting to see that code is not only being contributed by the academic/research community but also companies like Facebook who deal with big data on a daily basis. Assuming the challenge isn’t technical it then becomes about organisations understanding what they can do with data and having the talent in place (data scientists) to turn data into ‘actionable insights’.

Here are videos of all the presentations (including links to slides where available)

BTW Here is an archive of tweets from #esym12

For those of you who have made it this far through my dearth on links please feel free to now leave this site and watch some of the videos from the Data Scientist Summit 2011 (I’m still working my way through but there are some inspirational presentations).

Update Sander van der Waal at OSS Watch who was also at #esym12 as also posted The dominance of open source tools in Big Data Published