Twitter has recently frustrated a number of developers and mashup artists moving to tighter restrictions on it’s latest API. Top of the list for many are all Twitter Search API requests need to be authenticated (you can’t just grab and run, a request has to be via a Twitter account), removal of XML/Atom feeds and reduced rate limits. There are some gains which don’t appear to be widely written about so I’ll share here
you now specify how many tweets you want to get from Twitter, up to a maximum of 18,000 tweets
Previously in the old API the hard limits were 1,500 tweets from the last 7 days. This meant of you requested a very popular search term you’d only get the last 1,500 tweets making any tweets made earlier in the day inaccessible. In the new API there is still the ‘last 7 days’ limit but you can page back a lot further. Because the API limits to 100 tweets per call and 180 calls per hour this means you could potentially get 18,000 tweets in one hit. If you cache the maximum tweet id, wait an hour for the rate limit to refresh you could theoretically get even more (I’ve removed the 1.5k limit in TAGSv5.0, but haven’t fully tested how much of the 18k you can get before hit by script timeouts).
#2 Increased metadata with a tweet
Below is an illustration of the data returned in a single search result comparing the old and new search API.
If you look at the old data and the new data the main addition is a lot more profile data. A lot of this isn’t of huge interest (unless you wanted to do a colour analysis of profile colours), but there is some useful stuff. For example in this example I have profile information for the original and retweeter. as well as friend/follower counts, location and more (I’ve already shown how you can combine this data with Google Analytics for comparative analysis).
Whilst I’m sure this won’t appease the hardcore Twitter devs/3rd party for hackademics like myself grabbing extra tweets and more rich data has it’s benefits.
As part of LAK13 I’ve already written a series of blog posts highlighting a couple of ways to extract data from Canvas VLE. Prompted by a question by On and my colleague Sheila MacNeill I wanted to show you a way of getting feed data into a spreadsheet without using any code. The solution is to use Yahoo Pipes, but as this post will highlight this isn’t entirely straight forward and you need to be aware of several tricks to get the job done. As LAK13 isn’t using Google Groups for this post I’ll be using the Learning Analytics Google Group as a data source.
Sniffing for data
First we need to find a data source. Looking for an auto-detected RSS/Atom feed by visiting the group homepage reveals nothing. [I always forget browsers seem are moving away from telling you when they detect a feed. To get around this I use the Chrome RSS Subscription Extension which indicates with the orange RSS icon when a page has a feed.]
At this point I could grab the Atom links and with a bit of tweaking process it with my existing Google Apps Script Code, but lets look at a ‘no code’ solution.
Feeding Google Sheets with Yahoo Pipes
At this point it’s worth reminding you that you could use the importFeed formula in a Google Spreadsheet which would import the data from a Google Group. The issue however is it’s limited to the last 20 items so we need a better way of feeding the sheet.
A great tool for manipulating rss/atom (other data) feeds is Yahoo Pipes. Pipes gives you a drag and drop programming environment where you can use blocks to perform operations and wire the outputs together. I learned most of my pipework from the Pipemaster – Tony Hirst and if you are looking for a starting point this is a good one.
Here I’ve created a pipe that takes a Google Group shortname does some minor manipulation, which I’ll explain later, and output a result. When we run the pipe we get some export options:
The one I’m looking for is .csv because it’ll easily import into Google Sheets, but it’s not there … Just as we had to know the old Google Group interface has RSS feeds, with Yahoo Pipes we have to know the csv trick. Here’s the url for ‘Get as JSON’:
and if we swap &_render=json for &_render=csv by magic we have a csv version of the output (whilst we are here also notice the group name used when the pipe is run is also in the url. This means if we know the group shortname we don’t need to enter a name a ‘Run pipe’, we can build the url to get the csv.
Now in a Google Spreadsheet if you enter the formula =importData("http://pipes.yahoo.com/pipes/pipe.run?_id=14ffe600e0c0a9315007b922e41be8ad&_render=csv&group=learninganalytics")we get the groups last 100 messages in a spreadsheet.
Extra tricks
There were a couple of extra tricks I skipped worth highlighting. RSS/Atom feeds permit multilevel data, so an element like ‘author’ can have sub elements like ‘name’, ‘email’. CSVs on the other hand are 2D, rows and columns.
When Yahoo Pipes generates a csv file it ignores sub elements, so in this case it’ll generate an author column but won’t include name or email. To get around this we need to pull the data we want into the first level (doing something similar for content).
The next little trick is to get the feed dates in a format Google Spreadsheets recognise as a date rather than a string. In the feed dates are in ISO 8601 format e.g. 2013-03-05T13:13:06Z. By removing the ‘T’ and ‘Z’ Google Spreadsheets will automatically parse as a date. To do this we use a Regex block to look for T or Z (T|Z) replacing with a single space (which is why the ‘with’ box looks empty).
I’ve wrapped the data in a modified version of the dashboard used for the Canvas data feed.
Given we’ve had to dig out the data feeds from the old Google Group interface my suspicion is once this is shut off for good the feeds will also disappear. Who knows, Google may actually have a Group API by then ;s
Limited to last 100 messages
The eagle eyed amongst you will have spotted I was able to increase the number of messages returned to 100 by adding num=100 to the feed query. This is the limit though and you can’t use paging to get older results. There are a couple of ways you could store the feed results like using the FeedWordPress plugin. I’m experimenting with using IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet, but as the data isn’t stored nicely (particularly dates which are saved like ‘February 15, 2013 at 01:48PM’ *sigh*) it makes it harder to use.
I’m enrolled on the Learning Analytics and Knowledge (LAK13) which is an open online course introducing data and analytics in learning. As part of my personal assignment I thought it would be useful to share some of the data collection and analysis techniques I use for similar courses and take the opportunity to extend some of these. I should warn you that some of these posts will include very technical information. Please don’t run away as more often than not I’ll leave you with a spreadsheet where you fill in a cell and the rest is done for you. To begin with let’s start with Twitter.
Twitter basics
Like other courses LAK is using a course tag hashtag to allow aggregation of tweets, in this case #lak13. Participants can either watch the Twitter Search for #lak13, or depending on their Twitter application of choice, view the stream there. Until recently a common complaint of the Twitter search is it was limited to the last 7 days (Twitter are now rolling out search for a small percentage of older tweets). Whilst this limit is perhaps less of an issue given the velocity of the Twitter stream for course tutors and students having longitudinal data can be useful. Fortunately the Twitter API (API is a way for machines to talk to each other) gives developers a way to use Twitter’s data and use it in their applications. Twitter’s API is in transition from version 1 to 1.1, version 1 being switched off this March, which is making things interesting. The biggest impact for the part of the API handling search results is the:
removal of data returned in ATOM feed format; and
removal of access without login
This means you’ll soon no longer to be able to create a Twitter search which you can watch in an RSS Feed Aggregator like Google Reader like this one for #lak13.
All is not lost as the new version of the API still allows access to search results but only as JSON.
I don’t want to get too bogged down in JSON but basically it provides a structured way of sharing data and many websites and web services will have lots of JSON data being passed to your browser and rendered nicely for you to view. Let’s for example take a single tweet:
Whilst the tweet looks like it just has some text, links and a profile image underneath the surface there is so much more data. To give you an idea highlighted are 11 lines from 130 lines of metadata associated with a single tweet. Here is the raw data for you to explore for yourself. In it you’ll see information about the user including location and friend/follower counts; a breakdown of entities like other people mentioned and links; and ids for the tweet and in reply to.
One other Twitter basic that catches a lot of people out is the Search API is limited to the last 1500 tweets. So if you have a popular tag with over 1500 tweets in a day, at the end of the day only the last 1500 tweets are accessible via the Search API.
Archiving tweets for analysis
So there is potentially some rich data contained in tweets, but how can we capture this for analysis? There are a number of paid for services like eventifier that allow you to specify a hashtag for archive/analysis. As well as not being free the raw data isn’t also always available. My solution has been to develop a Google Spreadsheet to archive searches from Twitter (TAGS). This is just one of many other solutions like pulling data directly using R and Tableau the main advantage with this solution for me is I can set it up and it’s happy to automatically collect new data.
This makes it easy to get overviews of the data using the built-in templates:
… or, as I’d like to spend the rest of this post, quickly looking at ways to create different views.
As you will no doubt discover using a spreadsheet environment to do this has pros and cons. On the plus side it’s easy to use built-in charts and formula to analyse the data, identifying queries that might be useful for further analysis. The downside is you are limited in the level of complexity. For example, trying to do things like term extraction, n-grams etc is probably not going to work. All is not lost as Google Sheets makes it easy to extract and consume the data in other applications like R, Datameer and others.
If you open this spreadsheet and File > Make a copy it’ll give you a version that you can edit. In cell A1 of the Archive sheet you should see the following formula =importRange(“0AqGkLMU9sHmLdEZJRXFiNjdUTDJqRkNhLUxtZE5FZmc”,”Archive!A:K”)
What this does is pull the first couple of columns from this sheet where I’m already collecting LAK13 tweets (Note this techniques doesn’t scale well, so when LAK starts hitting thousands of tweets you are better doing manipulations in the source spreadsheet than using importRange. I’m doing it this way to get you started and try some things out).
FILTER, FREQUENCY and QUERY
On the Summary sheet I’ve extended the summary available in TAGS by including weekly breakdowns. The entire sheet is made with a handful of different formula used in slightly different ways with a dusting of conditional formatting. I’ve highlighted a couple of these:
FILTER – returns an array of dates the person named in cell B2 has made in the archive
FREQUENCY – calculates the frequency distribution of these dates based on the dates listed in S15:S22 and returns a count for each distribution in rows starting from the cell the formula is in
TRANSPOSE – converts the values from a vertical to horizontal response so it fills values across the sheet and not down
cell P2 =COUNTIF(H2:O2,">0")
counts if the values in row 2 from column H to O are greater than zero giving number of weeks the users has participated
cells H2:O – conditional formatting
cell B1 =QUERY(Archive!A:B," Select B, COUNT(A) WHERE B <> '' GROUP BY B ORDER BY COUNT(A) desc LABEL B 'Top Tweeters', COUNT(A) 'No.'",TRUE)
QUERY – allows you to use Google’s Query Language which is similar to SQL used in relational databases. In the example using the data source as columns A and B in the archive sheet we select columns B (screen name of tweeter) and count of A (could be any other column with a unique value) where B is not blank. The results are grouped by B (screen name) and ordered by count. The query also renames the columns.
QUERY Out
To give you some examples of possible queries you can use with data from Twitter in the spreadsheet you copied is a Query sheet with some examples. Included are some sample queries to filter tweets with ‘?’, which might indicate questions (even if rhetorical), time based filters and counts of messages between users.
The ability to export the data in this way opens up some other opportunities. Below is a screenshot of a ego/conversation centric view of #lak13 tweets rendered using the D3 javascript library. Whilst this view onto the archive is experimental hopefully it illustrates some of the opportunities.
Summary
Hopefully this post has highlighted some of the limitations of Twitter search, but also how data can be collected and the opportunities to rapidly prototype some basic queries. I’m conscious that I have provided any answers about how this can be used within learning analytics beyond the surface activity monitoring but I’m going to let you work that one out. If you want so see some of my work in this area you might want to check out the following posts:
For a couple of years now to support my research in Twitter community analysis/visualisation I’ve been developing my Twitter Archiving Google Spreadsheet (TAGS). To allow other to explore the possibilities of data generated by Twitter I’ve released copies of this template to the community.
What will happen to my existing TAGS sheets that aren’t version 5.0?
When Twitter turn off the old API (test outages this March) all authenticated and unauthenticated search requests will stop working.
How do I upgrade existing versions of TAGS spreadsheets (v3.x to v4.0) to keep collecting beyond March 2013?
As I can’t push an update to existing copies of TAGS you’ll have to manually update by opening your spreadsheet, then opening Tools > Script editor… and replacing the section of code that starts function getTweets() { and finishes 134 lines later (possiblly with the line function twDate(aDate){ ) with the code here. [And yes I know that’s a pain in the ass but best I could do] … or you can just start a new archive using TAGSv5.0
Abstract
A method and computer-readable medium for generating an activity stream is provided. The activity stream includes a ranked set of objects that are presented to one or more users. The ranking of objects is updated to reflect events associated with objects.
“The patent lists several “embodiments” of the concept – examples of approaches that could be pursued to implement the activity stream. These embodiments include re-ranking of a book chapter based on recent student comments or preferences and notifications when 75% of students have completed an assigned reading.”
Another example of ranking based on associated event I stumbled on today was in bookmarking site Diigo. As part of Diigo’s service you can create groups to collaboratively collect and curate resources. As well as a chronological view there is an option to sort by ‘popular’. Here’s an example of a Diigo Group for #etmooc.
I’m not entirely sure what the ranking is based on but it doesn’t appear to solely be on page views. I’d imagine number bookmarks by other users is another factor or the number of times someone hits the ‘Like’ button. Another example of prior art?
An aside: Using Yahoo Pipes to add social share counts to a feed
I had a quick go with pulling resources from the Diigo ‘popular’ page and using social share counts as another factor in ranking results. I got as far as this pipe which scrapes the top 20 popular items (no api/feed available) and then loops through the results using this sub-pipe, which uses the sharedcount.com API to take a url and see how many times it’s been shared across social networks (here’s a more general pipe that adds social counts to any RSS feed).
On one hand because a lot of the bookmarked resource aren’t brand new the counts might be a weak indicator of something (you may want to normalise the weighting based on age of resource), but overall this feels like a rabbit hole (not least because I couldn’t use the social share data to rank the results), so I’m tagging this as a failed idea.
It did get me wondering if within a open course (cMOOC) context pulling bookmarked resources into your course hub using FeedWordPress and then getting participants to rate might be … umm interesting (GD Star Ratings plugin looks good for this although I did spot on the etmooc hub they are using WP PostRatings).
Like a growing number of other people I’ve requested and got a complete archive of my tweets from Twitter … well almost complete. The issue is that while Twitter have done a great job of packaging the archives even going as far as creating a search interface powered by HTML and JavaScript as soon as you’ve requested the data it is stale. The other issue is unless you have some webhosting where can you share your archive to give other people access.
Fortunately as Google recently announced site publishing on Google Drive by uploading your Twitter archive to a folder and then sharing the folder so that it’s ‘Public on the web’ you can let other people explore your archive (here’s mine). Note: Mark Sample (@samplereality) has discovered that if you have file conversion on during upload this will break your archive. [You can also use the Public folder in Dropbox if you don’t want to use a Google account]
So next we need to keep the data fresh. Looking at how Twitter have put the archive together we can see tweets are stored in /data/js/tweets/ with a file for each months tweets and some metadata about the archive in /data/js/, the most important being tweet_index.js.
Fortunately not only does Google Apps Script provides an easy way to interface Drive and other Google Apps/3rd party services but the syntax is based on JavaScript making it easy to handle the existing data files. Given all of this it’s possible to read the existing data, fetch new status updates and write new data files keeping the archive fresh.
To do all of this I’ve come up with this Google Spreadsheet template:
Note: There is currently an open issue which is producing the error message ‘We’re sorry, a server error occurred. Please wait a bit and try again.’ Hopefully the ticket will be resolved soon
The video below hopefully explains how to setup and use:
A nice feature of this solution is that even if you don’t publically share your archive, if you are using the Google Drive app to syncs files with your computer the archive stays fresh on your local machine.
The model this solution uses is also quite interesting. There are a number of ways to create interfaces and apps using Google Apps Script. Writing data files to Google Drive and having a static html coded based interface is ideal for scenarios like this one where you don’t rely on heavy write processes or dynamic content (aware of course that there will be some sanitisation of code).
It would be easy to hook some extra code to push the refreshed files to another webserver or sync my local Google Drive with my webhost but for now I’m happy for Google to host my data ;s
It looks like Twitter are finally rolling out the option to download all your tweets. As well as providing a nice offline search interface it appears that “the archive also includes CSV and JSON files, the latter complete with each tweet’s metadata”. I’m looking forward to see the data visualisations/mashups people come up with around their data.
Here’s the template (File > Make a copy) and follow the instructions if you want to try (please be aware of the Twitter Developer Rules of the Road). I’ve updated the code to make it compatibly with version 1.1 of the Twitter API. One of the options I’ve added is a JSON dump which is saved to your Google Drive. It only took two lines of code using Google Apps Script HT +Romain Vialard
var blob = Utilities.newBlob(Utilities.jsonStringify(json), "application/json", filename);
DocsList.createFile(blob);
[The JSON dump is a bit buggy – some issues with character escaping somewhere]
REF impact involves an assessment of “significance” as well as “reach,” so the mere fact that research has been disseminated to a wide audience does not constitute an impact by itself; one has also to show the effect it has on those to whom it is disseminated. For this reason, citing the fact that a researcher has appeared on a primetime radio show with several million potential listeners might be one element of an impact statement, but one needs also to evidence that the audience has actively listened to what was being put out, and that it has affected, changed or benefitted them in some way
In the age of the second screen Alistair goes on to highlight how Twitter can be used as evidence of engagement, listeners tweeting personal reflections, feedback or just disseminating the information more widely. But as Alistair points out:
When a piece of academic work receives broadcast media coverage, then, it is useful to have a strategy in place to gather emerging responses, and it is also far easier to do this as it happens rather than retrospectively.
A strategy is required because, as Alistair points, out the Twitter search is limited to the last 7 days. While there are ways to view this activity in realtime how do you capture the evidence. Here’s my response to the problem:
I was recently asked to write a guest post for Big Data Week on using Google Apps as an interface for Big Data. For the post I decided to revisit an old recipe which uses Google Sheets (Spreadsheets) and Google Apps Script to interface the Twitter and Google Analytics API. One of the results is the bubble graph shown below which shows who has been tweeting my blog posts, how many visits their tweet generated, the number of retweets and how many followers the person has (click on the image for the interactive version). You can read more about his this was done and get a copy of the template in Getting Creative with Big Data and Google Apps
At IWMW12 I made a searchable/filterable version of TAGS Spreadsheets. This feature lets you use the Google Visualisation API to filter tweets stored in a Google Spreadsheet (more about TAGS). It has been available via a separate web interface for some time but I’ve never got around to publicizing it. As TAGSExplorer also uses the Google Visualisation API to wrap the same data in a different visualisation tool (predominantly d3.js) it made sense to merge the two. So now in any existing TAGSExplorer archive (like this one for #jiscel12) you have should now also have a button to ‘Search Archive’.
The archive view has some basic text filtering from tweeted text and who tweeted the message as well as a time range filter (dragging the handles indicated). The scattered dots indicate when messages were tweeted. The denser the dots, the more tweets made.
I’ve hastily thrown this together so feedback very welcome.
This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).