Learning Analytics appears to be increasingly on peoples mind’s and I’m aware of a number of staff being asked to contribute on this area in their teaching and learning strategies. I thought it would be useful to spotlight some resources produced by Cetis and Jisc in this area that might help. The list is in no ways exhaustive and if you have any other resources you think worth highlighting either leave a comment or get in touch and I’ll add them to the post.
Analytics; what is changing and why does it matter? This paper provides a high level overview to the CETIS Analytics Series. The series explores a number of key issues around the potential strategic advantages and insights which the increased attention on, and use of, analytics is bringing to the education sector. It is aimed primarily at managers and early adopters in Further and Higher Education who have a strategic role in developing the use of analytics in the following areas:
Whole Institutional Issues,
Ethical and Legal Issues,
Learning and Teaching,
Research Management,
Technology and Infrastructure.
Analytics for Learning and Teaching A broad view is taken of analytics for Learning and Teaching applications in Higher Education. In this we discriminate between learning analytics and academic analytics: uses for learning analytics are concerned with the optimisation of learning and teaching per se, while uses of educational analytics are concerned with optimisation of activities around learning and teaching, for example, student recruitment.
Institutional Readiness for Analytics This briefing paper is written for managers and early adopters in further and higher education who are thinking about how they can build capability in their institution to make better use of data that is held on their IT systems about the organisation and provision of the student experience. It will be of interest to institutions developing plans, those charged with the provision of analytical data, and administrators or academics who wish to use data to inform their decision making. The document identifies the capabilities that individuals and institutions need to initiate, execute, and act upon analytical intelligence
Case Study, Acting on Assessment AnalyticsOver the past five years, as part of its overall developments in teaching and learning, The University of Huddersfield has been active in developing new approaches to assessment and feedback methodologies. This has included the implementation of related technologies such as e-submission and marking tools.In this case study Dr Cath Ellis shares with us how her interest in learning analytics began and how she and colleagues are making practical use of assessment data both for student feedback and overall course design processes. Aspects of this case study and other work in this area are available in this webinar recording on Learning analytics for assessment and feedback
The term ‘dashboard’ currently seems to have a similar effect to marmite, you either love it or hate it. Fortunately it looks like the trend is towards increased personalisation of the way data is presented for both visual and exploratory purposes (an example of this is Thomson Reuters finance system Eikon), so if you don’t like marmite you can have something else instead.
When faced with a similar challenge for ocTEL it made sense to evolve this existing work rather than reinvent. As previously outlined in the ocTEL course recipe, we use the WordPress platform as a course hub, creating additional functionality using plugins. For the discussion forums we use the bbPress plugin. Like the rest of WordPress, bbPress creates a number of data feeds that we can use for forums, topics and replies. As far as I can tell though these feeds follow the WordPress global reading settings for the number of items available and are limited to RSS2.0.
[If you have your own bbPress forums you can use the code above to create your own data feed and File > Make a copy of the spreadsheet to dashboard your forums.]
So now we have some insight to the ocTEL forum activity it raises some interesting question. Besides turning the insight into action the questions at the top of my mind are: how do we promote this data endpoint? what other data feeds and formats would be useful? How do we license the data? Your thoughts as always are very welcome.
Just a note to say on 26th April at 4pm BST (in your timezone) I’ll be giving a webinar on using Google Apps Script to extract data from Canvas (LMS by Instructure). Even if you’re not a Canvas user hopefully there will be some general techniques you’ll find useful. It might also be useful for people thinking about applying basic learning analytic techniques to your courses. I’ll update this post with any resources from the session.
Martin Hawksey of Jisc CETIS started playing around with discussion usage data in Canvas when he was participating in the Learning Analytics course in Canvas Network. Using Google’s cloud scripting service Google Apps Script, free for Google Apps and Google account users, he could dig deep into the data and start doing some interesting (and valuable) analysis all from the comfort and familiarity of a spreadsheet.
Join us for this free 50-minute technical webinar on Friday, April 26th as Martin details his experience using the Canvas APIs and walks through building scripts to get meaningful Canvas data using Google Apps Script and the Canvas APIs.
Having collect some survey data for Analytics and Institutional Capabilities at #cetis13 (here’s a copy of the form) we were looking for a way to take us beyond the Google Form reporting and gaining extra insight. In particular I was interested in trying to see if there were any relationships between the multiple choice questions. Using a Sankey diagram seemed like a possible solution and my colleague David Sherlock quickly came up with a modification of Mike Bostock’s Sankey example to accept a csv input (I’ll link if/when it gets written up).
Seeing this I thought it might be useful to make a wrapper to generate Sankey diagrams for data stored in a Google Sheet. The solution was relatively straight forward, using the Google Visualisation API to get the data in the right shape for d3js. An example of the prototype is here
There’s no interface yet for you to select a spreadsheet, sheets, columns etc but you can take you Google Spreadsheet ‘publish to the web’ and then add the following data to the url.
One of the issues with this solution is you might not want to make all your data available. To get around this I’ve written a Google Apps Script that lets you use a custom formula to preformat the data. To see this in action this Spreadsheet contains an example. The formula is in cell A1 and uses the format =setSankey(datarange, cols , separator)
datarange – sheet/cell reference for source data eg ‘Form Responses – Edit’!A2:D Note must start with column A
cols – comma separated list of columns to use to generate a chart for eg “B,C,D”
separator {optional} – used to split multi value cells defaults to “, ” eg ” | “
Obviously creating your own querystrings to render the data isn’t ideal and it would be relatively straight forward to create a UI wrapper similar to the one used in EDGESExplorer, but it’s something I reluctant to do unless there is enough demand. The other consideration is the question – does the sankey diagram provide useful insight for the type of data or is it just more ‘damn lies’.
It would have of course been nice to write a Google Gadget to include this in a Spreadsheet … but Google are discontinuing those :(.
As part of LAK13 I’ve already written a series of blog posts highlighting a couple of ways to extract data from Canvas VLE. Prompted by a question by On and my colleague Sheila MacNeill I wanted to show you a way of getting feed data into a spreadsheet without using any code. The solution is to use Yahoo Pipes, but as this post will highlight this isn’t entirely straight forward and you need to be aware of several tricks to get the job done. As LAK13 isn’t using Google Groups for this post I’ll be using the Learning Analytics Google Group as a data source.
Sniffing for data
First we need to find a data source. Looking for an auto-detected RSS/Atom feed by visiting the group homepage reveals nothing. [I always forget browsers seem are moving away from telling you when they detect a feed. To get around this I use the Chrome RSS Subscription Extension which indicates with the orange RSS icon when a page has a feed.]
At this point I could grab the Atom links and with a bit of tweaking process it with my existing Google Apps Script Code, but lets look at a ‘no code’ solution.
Feeding Google Sheets with Yahoo Pipes
At this point it’s worth reminding you that you could use the importFeed formula in a Google Spreadsheet which would import the data from a Google Group. The issue however is it’s limited to the last 20 items so we need a better way of feeding the sheet.
A great tool for manipulating rss/atom (other data) feeds is Yahoo Pipes. Pipes gives you a drag and drop programming environment where you can use blocks to perform operations and wire the outputs together. I learned most of my pipework from the Pipemaster – Tony Hirst and if you are looking for a starting point this is a good one.
Here I’ve created a pipe that takes a Google Group shortname does some minor manipulation, which I’ll explain later, and output a result. When we run the pipe we get some export options:
The one I’m looking for is .csv because it’ll easily import into Google Sheets, but it’s not there … Just as we had to know the old Google Group interface has RSS feeds, with Yahoo Pipes we have to know the csv trick. Here’s the url for ‘Get as JSON’:
and if we swap &_render=json for &_render=csv by magic we have a csv version of the output (whilst we are here also notice the group name used when the pipe is run is also in the url. This means if we know the group shortname we don’t need to enter a name a ‘Run pipe’, we can build the url to get the csv.
Now in a Google Spreadsheet if you enter the formula =importData("http://pipes.yahoo.com/pipes/pipe.run?_id=14ffe600e0c0a9315007b922e41be8ad&_render=csv&group=learninganalytics")we get the groups last 100 messages in a spreadsheet.
Extra tricks
There were a couple of extra tricks I skipped worth highlighting. RSS/Atom feeds permit multilevel data, so an element like ‘author’ can have sub elements like ‘name’, ‘email’. CSVs on the other hand are 2D, rows and columns.
When Yahoo Pipes generates a csv file it ignores sub elements, so in this case it’ll generate an author column but won’t include name or email. To get around this we need to pull the data we want into the first level (doing something similar for content).
The next little trick is to get the feed dates in a format Google Spreadsheets recognise as a date rather than a string. In the feed dates are in ISO 8601 format e.g. 2013-03-05T13:13:06Z. By removing the ‘T’ and ‘Z’ Google Spreadsheets will automatically parse as a date. To do this we use a Regex block to look for T or Z (T|Z) replacing with a single space (which is why the ‘with’ box looks empty).
I’ve wrapped the data in a modified version of the dashboard used for the Canvas data feed.
Given we’ve had to dig out the data feeds from the old Google Group interface my suspicion is once this is shut off for good the feeds will also disappear. Who knows, Google may actually have a Group API by then ;s
Limited to last 100 messages
The eagle eyed amongst you will have spotted I was able to increase the number of messages returned to 100 by adding num=100 to the feed query. This is the limit though and you can’t use paging to get older results. There are a couple of ways you could store the feed results like using the FeedWordPress plugin. I’m experimenting with using IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet, but as the data isn’t stored nicely (particularly dates which are saved like ‘February 15, 2013 at 01:48PM’ *sigh*) it makes it harder to use.
At 4pm GMT today (see in your own timezone) I’ll be doing a Google Hangout On Air with Marc Smith one of NodeXL’s creators and coordinators. NodeXL is a handy free add-in for MS Excel (Windows) allowing you to generate, explore and analyse network graphs. I started using NodeXL a couple of years ago as I found it was an easy way to extract a social graph (friend/follower relationships) from Twitter. In the session we’ll go through extracting a social graph from a Twitter hashtag and analysing the results.
You can view/participate in a number of ways. I’ll embed the livestream in this post, or you can see/join the hangout when it appears in my Google+ timeline. You can ask question via Twitter or Google+ using the combined hashtags #lak13 and #nodexl e.g. ‘What the best place to find out more about NodeXL? #lak13 #nodexl’ (this is a LAK13 fringe event)
In my last post on Canvas Network Discussion Activity Data I mentioned I was a little disappointed to not be able to use social network analysis (SNA) modelling techniques on the Canvas Network discussion boards. My main barrier was accessing the data via the Canvas API using my preferred toolset. Fortunately Brian Whitmer at instructure.com (the company behind Canvas) posted a comment highlighting that as a student on the course it was easy for me to get access to this data using a token generated on my Canvas profile page. With this new information in this post I’ll cover three main areas:
a very quick introduction into techniques/opportunities for analysing threaded networks using SNA;
how I retrieved data from the Canvas platform for the #LAK13 discussions; and finally
some analysis using the NodeXL add-on for Microsoft Excel (Windows).
On Friday 1st March at 4pm GMT I’ll also be having a live Hangout on Air with Marc Smith, one of the original creators and continued project coordinator of NodeXL. The live feed will be embedded belowembedded here and you can ask question via Twitter or Google+ using the combined hashtags #lak13 and #nodexl e.g. ‘What the best place to find out more about NodeXL? #lak13 #nodexl’. For the hangout session we’ll look at how easy it is to use NodeXL to analyse a Twitter hashtag community in a couple of clicks. [The rest of this post is introducing a more advanced use of NodeXL so if I lose you in the rest of this post fear not as the session will be a lot easier going]
Opportunities for analysing threaded networks using SNA
Hello graph
One of the dangers of SNA is people see images like the one to the right and miss the point of using this modelling technique (the image is from one of my first goes at using NodeXL so I’d be the first to admit there’s room for improvement). If you do there are couple of things to bear in mind. Often these images are exhaust, generated as a snapshot of a visual and/or exploratory analytic. A certain level of literacy is required to understand the underlying structure of the graph. Taking this last point a threshold concept for me was understanding that a basic concept of these graphs are more often than not constructed from an edge list which is essentially two columns of data representing and start point and end point to a node. For example, all the names in columns Vertex 1 and Vertex 2 are nodes and each row represents an edge between the nodes so in this case Beth –> Adam generates the highlighted edge.
There is a lot more to understand about these types of graphs, but this basic concept means I know if I have any relationship data its easy to graph and explore.
Hello SNAPP
At this point it’s worth mentioning the browser plugin SNAPP.
The Social Networks Adapting Pedagogical Practice (SNAPP) tool performs real-time social network analysis and visualization of discussion forum activity within popular commercial and open source Learning Management Systems (LMS). SNAPP essentially serves as a diagnostic instrument, allowing teaching staff to evaluate student behavioural patterns against learning activity design objectives and intervene as required a timely manner.
Valuable interaction data is stored within a discussion forum but from the default threaded display of messages it is difficult to determine the level and direction of activity between participants. SNAPP infers relationship ties from the post-reply data and renders a social network diagram below the forum thread. The social network visualization can be filtered based upon user activity and social network data can be exported for further analysis in NetDraw. SNAPP integrates seamlessly with a variety of Learning Management Systems (Blackboard, Moodle and Desire2Learn) and must be triggered while a forum thread is displayed in a Web browser.
The social network diagrams can be used to identify:
isolated students
facilitator-centric network patterns where a tutor or academic is central to the network with little interaction occurring between student participants
group malfunction
users that bridge smaller clustered networks and serve as information brokers
Hopefully with that quick intro you can see there might be some value in using SNA from threaded discussion analysis. Reading the SNAPP overview hopefully you spotted that it currently doesn’t support extracting data from Canvas discussion boards. This is an opportunity to understand some of the analysis SNAPP is doing behind the scenes.
Hello Google Apps Script
If you have been following my posts you’ll see that I favour using Google Apps Script as a lightweight tool for extracting data. Thanks to Brian (Instructure) I’ve got a way to access the Discussion Topics API. Looking at the API documents I decided the best way to proceed was to get all of the LAK13 discussion topics (top level information) and use this to get the full topic data. If you speak JSON we are essentially turning this:
The code to do this is available here. I’m going to spare you the details of the code but here are the instructions is you’d like to export data from other Canvas hosted discussion boards. If you’re not interested in that you can just jump to the next section.
Generating an edge list (extracting data) from Canvas to Google Sheets
Create a new Google Spreadsheet and then in Tool > Script editor copy in the code from here
If you are not pulling data from LAK13 you need to edit values in lines 2-4. If you visit your course homepage hopefully you can decode the url pattern based on the example for LAK13 https://learn.canvas.net/courses/33 (I should also point out you need to be enrolled on the course to receive data. Also read Canvas API Policy)
Next you need an access token which is generated from your Canvas Profile Settings page. Scroll down to the bottom and click New Access Token, filling in a purpose and leaving expires blank. Make a copy of the token as it’s needed for the next step (I added a copy to a .txt file just in case the next step didn’t work
Back in the Script Editor in Google Spreadsheets click File > Project Properties. In the ‘Project properties’ tab click ‘+ Add row’ and replace (name) with access_token and (value) with the token you got from Canvas before clicking Save
Make sure everything is saved in the Script Editor and then Run > getCanvasDiscussionEdges, wait for the script to finish and on Sheet1 you should have a bunch of data to play with.
Using NodeXL to analyse Canvas Discussions
There are a number of different questions we could ask of the LAK13 data. The particular one I want to look at is who are the core community members stimulating/facilitating discussion (e.g. applying a connectivist theory who are the people you might want to connect with). To do this we need to (I’m assuming you’ve already installed NodeXL):
Download the data extracted to the Google Spreadsheet (File > Download as > Microsoft Excel). [If you just want the data I’ve extracted here’s the download link – the data is automatically refreshed nightly]
Open the download file in Excel and in the created_at column select all and Format Cells as General (I needed to do this because NodeXL was miss formating dates on import)
Start a new NodeXL Template (I use the NodeXL Excel Template option from my windows Start menu)
From the NodeXL ribbon you want to Import > From Open Workbook
In the import dialog vert1_name and vert2_name are edges, anything else prefixed with ‘vert’ is assigned to the corresponding Vertex n property column and everything else is an Vertex 1 property:
Once imported you can open the Edges sheet, select the created_at column and Format Cells reassigns a date/time format.
In Prepare Data chose ‘Count and merge duplicate edges’ and select Count and Vertex1 and Vertex 2
In the Graph section of the NodeXL ribbon we want to make this a directed graph (replies are directed) and choose you layout algorithm (I usually go Harel-Koren Fast Multiscale)
Next we want to prepare the data we want to analyse. In the Autofill Columns (within Visual Properties portion of the ribbon) set Edge Visibility to ‘topic_id’ and in Edge Visibility Options set ‘If the source column number is: Not equal to 558’ Show otherwise Skip (this will skip edges that are responses to the Pre-course discussion forum – I’ll let you question this decision in the comments/forum)
Click Ok then Autofill
Next open the Vertices sheet and select all the rows (Ctrl+A) and from the Visibility option select ‘Show if in an Edge’
Now find the row in the Vertices sheet for George Siemens and Skip (doing this were creating a ‘what if George wasn’t there’ scenario
Open the Graph Metrics window and add Vertex in-degree, vertex out-degree, Vertex betweenness and closeness centrality and Top items (in the Top items options you’ll need to add these as the metrics you want top 10s for), finally click Calculate metrics.
At this point you could use the calculated metrics to weight nodes in a graph, but for now I’m going to skip that. You should now have a Top Items sheet with some useful information. In the Betweenness Centrality list you should have these names:
Martin Hawksey
Simon Knight
Alex Perrier
Khaldoon Dhou
Rosa Estriégana Valdehita
Maha Al-Freih
Suzanne Shaffer
Maxim Skryabin
Bryan Braul
Peter Robertso
Excluding the pre-course discussion forum and George Siemens the discussions these people have engaged with provide the shortest paths to other people engaging in discussions on the Canvas site. Strategically these are potentially useful people within the network that you might want to follow, question or engage with.
Getting to this point obliviously hasn’t been straight forward and had SNAPP been available in this instance it would have turned this in to a far shorter post. Programmatically using tools like R we could have arrived at the same answer with a couple of lines of code (that might be my challenge for next week ;). What it has hopefully illustrated is if you have data in an edge format (two column relationships) tools like NodeXL make it possible for you use SNA modelling techniques to gain insight. (I’m sure it also illustrates that data wrangling isn’t always straight forward, but guess what that’s life).
In my last post I looked at the data available around a course hashtag from Twitter. For this next post I want to start looking at what’s available around the Canvas Network platform which is being used to host Learning Analytics and Knowledge (LAK13). Sizing up what was available I did come across the Canvas LMS API documentation, which provides a similar method of accessing data as the Twitter API. I wasn’t sure if this extended to Canvas Network but because the authentication method it uses (oAuth2) isn’t possible using my dev tools of choice (mainly Google Apps Script) I looked for something else.
Whilst browsing on the discussion page for the course I noticed that my browser was auto-detecting a feed:
Looking for an easy way to consume this I first turned to the importFeed formula in Google Spreadsheet’s but unfortunately it only returned the last 20 results. A trick I’ve used in the past is to put feeds through Yahoo Pipes to get a JSON/CSV to work with, but as working with dates this way isn’t straight forward I opted for some Google Apps Script which would create a custom formula to fetch the feed from Canvas Network and enter the results into a sheet. The 12 lines of code for the main part of this are below:
function getCanvasDiscussions(url) {
var response = UrlFetchApp.fetch(url);
var contentHeader = response.getHeaders();
if (response.getResponseCode() == 200) {
var d = Xml.parse(response.getContentText()).feed.entry;
var output = [['published','updated','title','author','link','id','content']];
for (i in d){
output.push([getDateFromIso(d[i].published.Text),getDateFromIso(d[i].updated.Text),d[i].title.Text,d[i].author.name.Text,d[i].link.href,d[i].id.Text,d[i].content.Text]);
}
return output;
}
}
The getDateFromIso is a subfunction I use quite often and is available in this stackoverflow answer. Adding the above code to a Google Sheet (via Tools > Script editor..) allows me to use a custom formula to fetch the data.
Below is a quick look at the data returned (here it is published in a table). In the columns we have publish dates, title, author, link, uri and post content. As it goes this isn’t too bad. The big thing that is missing is whilst we can see which topic the message is in the reply threading is lost.
Even with this like the #lak13 Twitter dashboard from last week I can quickly add some formulas to process the data and get an overview of what is going on (for the live view visit this spreadsheet – File > Make a copy is you want to edit).
This obviously isn’t a complicated analytic and it wouldn’t surprise me if the course tutors didn’t have something similar on the backend of Canvas Network. As a student it’s useful for me to see how I’m doing compared to others on the course and get a sense of who else is contributing. [Adam Cooper has a great post on How to do Analytics Right... with some tips he picked up for John Campbell who is behind Purdue’s Signals Project which fits in nicely here.]
Summary
So with a bit of ken and a couple lines of code I can see how the #lak13 discussions are going. Again I’ve avoided any deep analytics such as analysing what has been said, to who, at what time, but hopefully now that I’ve highlighted and freed the data you can do something else with it. Not being able to extract the conversation thread is a little disappointing as it would have been nice to fire up SNAPP or NodeXL, but I’ll have to save those for another day ;)
I’m enrolled on the Learning Analytics and Knowledge (LAK13) which is an open online course introducing data and analytics in learning. As part of my personal assignment I thought it would be useful to share some of the data collection and analysis techniques I use for similar courses and take the opportunity to extend some of these. I should warn you that some of these posts will include very technical information. Please don’t run away as more often than not I’ll leave you with a spreadsheet where you fill in a cell and the rest is done for you. To begin with let’s start with Twitter.
Twitter basics
Like other courses LAK is using a course tag hashtag to allow aggregation of tweets, in this case #lak13. Participants can either watch the Twitter Search for #lak13, or depending on their Twitter application of choice, view the stream there. Until recently a common complaint of the Twitter search is it was limited to the last 7 days (Twitter are now rolling out search for a small percentage of older tweets). Whilst this limit is perhaps less of an issue given the velocity of the Twitter stream for course tutors and students having longitudinal data can be useful. Fortunately the Twitter API (API is a way for machines to talk to each other) gives developers a way to use Twitter’s data and use it in their applications. Twitter’s API is in transition from version 1 to 1.1, version 1 being switched off this March, which is making things interesting. The biggest impact for the part of the API handling search results is the:
removal of data returned in ATOM feed format; and
removal of access without login
This means you’ll soon no longer to be able to create a Twitter search which you can watch in an RSS Feed Aggregator like Google Reader like this one for #lak13.
All is not lost as the new version of the API still allows access to search results but only as JSON.
I don’t want to get too bogged down in JSON but basically it provides a structured way of sharing data and many websites and web services will have lots of JSON data being passed to your browser and rendered nicely for you to view. Let’s for example take a single tweet:
Whilst the tweet looks like it just has some text, links and a profile image underneath the surface there is so much more data. To give you an idea highlighted are 11 lines from 130 lines of metadata associated with a single tweet. Here is the raw data for you to explore for yourself. In it you’ll see information about the user including location and friend/follower counts; a breakdown of entities like other people mentioned and links; and ids for the tweet and in reply to.
One other Twitter basic that catches a lot of people out is the Search API is limited to the last 1500 tweets. So if you have a popular tag with over 1500 tweets in a day, at the end of the day only the last 1500 tweets are accessible via the Search API.
Archiving tweets for analysis
So there is potentially some rich data contained in tweets, but how can we capture this for analysis? There are a number of paid for services like eventifier that allow you to specify a hashtag for archive/analysis. As well as not being free the raw data isn’t also always available. My solution has been to develop a Google Spreadsheet to archive searches from Twitter (TAGS). This is just one of many other solutions like pulling data directly using R and Tableau the main advantage with this solution for me is I can set it up and it’s happy to automatically collect new data.
This makes it easy to get overviews of the data using the built-in templates:
… or, as I’d like to spend the rest of this post, quickly looking at ways to create different views.
As you will no doubt discover using a spreadsheet environment to do this has pros and cons. On the plus side it’s easy to use built-in charts and formula to analyse the data, identifying queries that might be useful for further analysis. The downside is you are limited in the level of complexity. For example, trying to do things like term extraction, n-grams etc is probably not going to work. All is not lost as Google Sheets makes it easy to extract and consume the data in other applications like R, Datameer and others.
If you open this spreadsheet and File > Make a copy it’ll give you a version that you can edit. In cell A1 of the Archive sheet you should see the following formula =importRange(“0AqGkLMU9sHmLdEZJRXFiNjdUTDJqRkNhLUxtZE5FZmc”,”Archive!A:K”)
What this does is pull the first couple of columns from this sheet where I’m already collecting LAK13 tweets (Note this techniques doesn’t scale well, so when LAK starts hitting thousands of tweets you are better doing manipulations in the source spreadsheet than using importRange. I’m doing it this way to get you started and try some things out).
FILTER, FREQUENCY and QUERY
On the Summary sheet I’ve extended the summary available in TAGS by including weekly breakdowns. The entire sheet is made with a handful of different formula used in slightly different ways with a dusting of conditional formatting. I’ve highlighted a couple of these:
FILTER – returns an array of dates the person named in cell B2 has made in the archive
FREQUENCY – calculates the frequency distribution of these dates based on the dates listed in S15:S22 and returns a count for each distribution in rows starting from the cell the formula is in
TRANSPOSE – converts the values from a vertical to horizontal response so it fills values across the sheet and not down
cell P2 =COUNTIF(H2:O2,">0")
counts if the values in row 2 from column H to O are greater than zero giving number of weeks the users has participated
cells H2:O – conditional formatting
cell B1 =QUERY(Archive!A:B," Select B, COUNT(A) WHERE B <> '' GROUP BY B ORDER BY COUNT(A) desc LABEL B 'Top Tweeters', COUNT(A) 'No.'",TRUE)
QUERY – allows you to use Google’s Query Language which is similar to SQL used in relational databases. In the example using the data source as columns A and B in the archive sheet we select columns B (screen name of tweeter) and count of A (could be any other column with a unique value) where B is not blank. The results are grouped by B (screen name) and ordered by count. The query also renames the columns.
QUERY Out
To give you some examples of possible queries you can use with data from Twitter in the spreadsheet you copied is a Query sheet with some examples. Included are some sample queries to filter tweets with ‘?’, which might indicate questions (even if rhetorical), time based filters and counts of messages between users.
The ability to export the data in this way opens up some other opportunities. Below is a screenshot of a ego/conversation centric view of #lak13 tweets rendered using the D3 javascript library. Whilst this view onto the archive is experimental hopefully it illustrates some of the opportunities.
Summary
Hopefully this post has highlighted some of the limitations of Twitter search, but also how data can be collected and the opportunities to rapidly prototype some basic queries. I’m conscious that I have provided any answers about how this can be used within learning analytics beyond the surface activity monitoring but I’m going to let you work that one out. If you want so see some of my work in this area you might want to check out the following posts:
After posting Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet) Tony quickly came back with WordPress Stats in R, which includes a handy custom function for querying the WordPress Stats API. Something I wanted to do in my original post was to generate sparklines for post activity. I wasn’t able to find a way of doing this in Google Sheets because of the number of data points (n. 366) and didn’t think Google Apps Script would handle the dataset because of timeouts. So given the start Tony had made with R it seemed like the ideal opportunity to refresh my R skills.
Note that 478 different post_titles are returned compared to 178 post_permalinks showing some of the data is missing and in fact wrong. This isn’t important for this particular project. I’ve uploaded a sample of the returned data here. The data returned includes a day stamp, post title and number of views. Posts with no views on that day are not returned.
For the sparklines I initially headed off down a dead end (lines 99-107) making a pivot table in R. This wasn’t entirely wasted effort because the process of casting this and handling null values was used later in the project. Having got stuck at this point searching for more information on sparklines in R turned up this question on stackoverflow and this set of slides referencing the sparkTable package. Initially I tried dropping and renaming some of the columns from the original data but on line 96 when trying the reshapeExt I got:
Error in `[<-.data.frame`(`*tmp*`, , attr(dat, "reshapeLong")[["timevar"]], :
replacement has 32504 rows, data has 32538
Searching for variations around the error message didn’t turn anything up and if you know what is wrong I’d be very grateful for guidance. Instead I decided to follow a hunch and instead of using partial time series data filled in the blanks by casting and then melting i.e. from this:
to this (filling in missing data with ‘0’)
before going back to this
Doing this got me past reshapeExt and outputting a sparkTable as html (the other option is LaTeX) embedding 478 sparkBars as png images. The final step was to wrap the html using jQuery DataTables (the main thing was to add <thead> and <tbody> tags).
Was it worth it? I’ll let you decide if the product is any good, the process of doing it was definitely useful. So how would you improve it?
This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).