Twitter has recently frustrated a number of developers and mashup artists moving to tighter restrictions on it’s latest API. Top of the list for many are all Twitter Search API requests need to be authenticated (you can’t just grab and run, a request has to be via a Twitter account), removal of XML/Atom feeds and reduced rate limits. There are some gains which don’t appear to be widely written about so I’ll share here
you now specify how many tweets you want to get from Twitter, up to a maximum of 18,000 tweets
Previously in the old API the hard limits were 1,500 tweets from the last 7 days. This meant of you requested a very popular search term you’d only get the last 1,500 tweets making any tweets made earlier in the day inaccessible. In the new API there is still the ‘last 7 days’ limit but you can page back a lot further. Because the API limits to 100 tweets per call and 180 calls per hour this means you could potentially get 18,000 tweets in one hit. If you cache the maximum tweet id, wait an hour for the rate limit to refresh you could theoretically get even more (I’ve removed the 1.5k limit in TAGSv5.0, but haven’t fully tested how much of the 18k you can get before hit by script timeouts).
#2 Increased metadata with a tweet
Below is an illustration of the data returned in a single search result comparing the old and new search API.
If you look at the old data and the new data the main addition is a lot more profile data. A lot of this isn’t of huge interest (unless you wanted to do a colour analysis of profile colours), but there is some useful stuff. For example in this example I have profile information for the original and retweeter. as well as friend/follower counts, location and more (I’ve already shown how you can combine this data with Google Analytics for comparative analysis).
Whilst I’m sure this won’t appease the hardcore Twitter devs/3rd party for hackademics like myself grabbing extra tweets and more rich data has it’s benefits.
I’m enrolled on the Learning Analytics and Knowledge (LAK13) which is an open online course introducing data and analytics in learning. As part of my personal assignment I thought it would be useful to share some of the data collection and analysis techniques I use for similar courses and take the opportunity to extend some of these. I should warn you that some of these posts will include very technical information. Please don’t run away as more often than not I’ll leave you with a spreadsheet where you fill in a cell and the rest is done for you. To begin with let’s start with Twitter.
Twitter basics
Like other courses LAK is using a course tag hashtag to allow aggregation of tweets, in this case #lak13. Participants can either watch the Twitter Search for #lak13, or depending on their Twitter application of choice, view the stream there. Until recently a common complaint of the Twitter search is it was limited to the last 7 days (Twitter are now rolling out search for a small percentage of older tweets). Whilst this limit is perhaps less of an issue given the velocity of the Twitter stream for course tutors and students having longitudinal data can be useful. Fortunately the Twitter API (API is a way for machines to talk to each other) gives developers a way to use Twitter’s data and use it in their applications. Twitter’s API is in transition from version 1 to 1.1, version 1 being switched off this March, which is making things interesting. The biggest impact for the part of the API handling search results is the:
removal of data returned in ATOM feed format; and
removal of access without login
This means you’ll soon no longer to be able to create a Twitter search which you can watch in an RSS Feed Aggregator like Google Reader like this one for #lak13.
All is not lost as the new version of the API still allows access to search results but only as JSON.
I don’t want to get too bogged down in JSON but basically it provides a structured way of sharing data and many websites and web services will have lots of JSON data being passed to your browser and rendered nicely for you to view. Let’s for example take a single tweet:
Whilst the tweet looks like it just has some text, links and a profile image underneath the surface there is so much more data. To give you an idea highlighted are 11 lines from 130 lines of metadata associated with a single tweet. Here is the raw data for you to explore for yourself. In it you’ll see information about the user including location and friend/follower counts; a breakdown of entities like other people mentioned and links; and ids for the tweet and in reply to.
One other Twitter basic that catches a lot of people out is the Search API is limited to the last 1500 tweets. So if you have a popular tag with over 1500 tweets in a day, at the end of the day only the last 1500 tweets are accessible via the Search API.
Archiving tweets for analysis
So there is potentially some rich data contained in tweets, but how can we capture this for analysis? There are a number of paid for services like eventifier that allow you to specify a hashtag for archive/analysis. As well as not being free the raw data isn’t also always available. My solution has been to develop a Google Spreadsheet to archive searches from Twitter (TAGS). This is just one of many other solutions like pulling data directly using R and Tableau the main advantage with this solution for me is I can set it up and it’s happy to automatically collect new data.
This makes it easy to get overviews of the data using the built-in templates:
… or, as I’d like to spend the rest of this post, quickly looking at ways to create different views.
As you will no doubt discover using a spreadsheet environment to do this has pros and cons. On the plus side it’s easy to use built-in charts and formula to analyse the data, identifying queries that might be useful for further analysis. The downside is you are limited in the level of complexity. For example, trying to do things like term extraction, n-grams etc is probably not going to work. All is not lost as Google Sheets makes it easy to extract and consume the data in other applications like R, Datameer and others.
If you open this spreadsheet and File > Make a copy it’ll give you a version that you can edit. In cell A1 of the Archive sheet you should see the following formula =importRange(“0AqGkLMU9sHmLdEZJRXFiNjdUTDJqRkNhLUxtZE5FZmc”,”Archive!A:K”)
What this does is pull the first couple of columns from this sheet where I’m already collecting LAK13 tweets (Note this techniques doesn’t scale well, so when LAK starts hitting thousands of tweets you are better doing manipulations in the source spreadsheet than using importRange. I’m doing it this way to get you started and try some things out).
FILTER, FREQUENCY and QUERY
On the Summary sheet I’ve extended the summary available in TAGS by including weekly breakdowns. The entire sheet is made with a handful of different formula used in slightly different ways with a dusting of conditional formatting. I’ve highlighted a couple of these:
FILTER – returns an array of dates the person named in cell B2 has made in the archive
FREQUENCY – calculates the frequency distribution of these dates based on the dates listed in S15:S22 and returns a count for each distribution in rows starting from the cell the formula is in
TRANSPOSE – converts the values from a vertical to horizontal response so it fills values across the sheet and not down
cell P2 =COUNTIF(H2:O2,">0")
counts if the values in row 2 from column H to O are greater than zero giving number of weeks the users has participated
cells H2:O – conditional formatting
cell B1 =QUERY(Archive!A:B," Select B, COUNT(A) WHERE B <> '' GROUP BY B ORDER BY COUNT(A) desc LABEL B 'Top Tweeters', COUNT(A) 'No.'",TRUE)
QUERY – allows you to use Google’s Query Language which is similar to SQL used in relational databases. In the example using the data source as columns A and B in the archive sheet we select columns B (screen name of tweeter) and count of A (could be any other column with a unique value) where B is not blank. The results are grouped by B (screen name) and ordered by count. The query also renames the columns.
QUERY Out
To give you some examples of possible queries you can use with data from Twitter in the spreadsheet you copied is a Query sheet with some examples. Included are some sample queries to filter tweets with ‘?’, which might indicate questions (even if rhetorical), time based filters and counts of messages between users.
The ability to export the data in this way opens up some other opportunities. Below is a screenshot of a ego/conversation centric view of #lak13 tweets rendered using the D3 javascript library. Whilst this view onto the archive is experimental hopefully it illustrates some of the opportunities.
Summary
Hopefully this post has highlighted some of the limitations of Twitter search, but also how data can be collected and the opportunities to rapidly prototype some basic queries. I’m conscious that I have provided any answers about how this can be used within learning analytics beyond the surface activity monitoring but I’m going to let you work that one out. If you want so see some of my work in this area you might want to check out the following posts:
For a couple of years now to support my research in Twitter community analysis/visualisation I’ve been developing my Twitter Archiving Google Spreadsheet (TAGS). To allow other to explore the possibilities of data generated by Twitter I’ve released copies of this template to the community.
What will happen to my existing TAGS sheets that aren’t version 5.0?
When Twitter turn off the old API (test outages this March) all authenticated and unauthenticated search requests will stop working.
How do I upgrade existing versions of TAGS spreadsheets (v3.x to v4.0) to keep collecting beyond March 2013?
As I can’t push an update to existing copies of TAGS you’ll have to manually update by opening your spreadsheet, then opening Tools > Script editor… and replacing the section of code that starts function getTweets() { and finishes 134 lines later (possiblly with the line function twDate(aDate){ ) with the code here. [And yes I know that’s a pain in the ass but best I could do] … or you can just start a new archive using TAGSv5.0
Like a growing number of other people I’ve requested and got a complete archive of my tweets from Twitter … well almost complete. The issue is that while Twitter have done a great job of packaging the archives even going as far as creating a search interface powered by HTML and JavaScript as soon as you’ve requested the data it is stale. The other issue is unless you have some webhosting where can you share your archive to give other people access.
Fortunately as Google recently announced site publishing on Google Drive by uploading your Twitter archive to a folder and then sharing the folder so that it’s ‘Public on the web’ you can let other people explore your archive (here’s mine). Note: Mark Sample (@samplereality) has discovered that if you have file conversion on during upload this will break your archive. [You can also use the Public folder in Dropbox if you don’t want to use a Google account]
So next we need to keep the data fresh. Looking at how Twitter have put the archive together we can see tweets are stored in /data/js/tweets/ with a file for each months tweets and some metadata about the archive in /data/js/, the most important being tweet_index.js.
Fortunately not only does Google Apps Script provides an easy way to interface Drive and other Google Apps/3rd party services but the syntax is based on JavaScript making it easy to handle the existing data files. Given all of this it’s possible to read the existing data, fetch new status updates and write new data files keeping the archive fresh.
To do all of this I’ve come up with this Google Spreadsheet template:
Note: There is currently an open issue which is producing the error message ‘We’re sorry, a server error occurred. Please wait a bit and try again.’ Hopefully the ticket will be resolved soon
The video below hopefully explains how to setup and use:
A nice feature of this solution is that even if you don’t publically share your archive, if you are using the Google Drive app to syncs files with your computer the archive stays fresh on your local machine.
The model this solution uses is also quite interesting. There are a number of ways to create interfaces and apps using Google Apps Script. Writing data files to Google Drive and having a static html coded based interface is ideal for scenarios like this one where you don’t rely on heavy write processes or dynamic content (aware of course that there will be some sanitisation of code).
It would be easy to hook some extra code to push the refreshed files to another webserver or sync my local Google Drive with my webhost but for now I’m happy for Google to host my data ;s
It looks like Twitter are finally rolling out the option to download all your tweets. As well as providing a nice offline search interface it appears that “the archive also includes CSV and JSON files, the latter complete with each tweet’s metadata”. I’m looking forward to see the data visualisations/mashups people come up with around their data.
Here’s the template (File > Make a copy) and follow the instructions if you want to try (please be aware of the Twitter Developer Rules of the Road). I’ve updated the code to make it compatibly with version 1.1 of the Twitter API. One of the options I’ve added is a JSON dump which is saved to your Google Drive. It only took two lines of code using Google Apps Script HT +Romain Vialard
var blob = Utilities.newBlob(Utilities.jsonStringify(json), "application/json", filename);
DocsList.createFile(blob);
[The JSON dump is a bit buggy – some issues with character escaping somewhere]
REF impact involves an assessment of “significance” as well as “reach,” so the mere fact that research has been disseminated to a wide audience does not constitute an impact by itself; one has also to show the effect it has on those to whom it is disseminated. For this reason, citing the fact that a researcher has appeared on a primetime radio show with several million potential listeners might be one element of an impact statement, but one needs also to evidence that the audience has actively listened to what was being put out, and that it has affected, changed or benefitted them in some way
In the age of the second screen Alistair goes on to highlight how Twitter can be used as evidence of engagement, listeners tweeting personal reflections, feedback or just disseminating the information more widely. But as Alistair points out:
When a piece of academic work receives broadcast media coverage, then, it is useful to have a strategy in place to gather emerging responses, and it is also far easier to do this as it happens rather than retrospectively.
A strategy is required because, as Alistair points, out the Twitter search is limited to the last 7 days. While there are ways to view this activity in realtime how do you capture the evidence. Here’s my response to the problem:
I was recently asked to write a guest post for Big Data Week on using Google Apps as an interface for Big Data. For the post I decided to revisit an old recipe which uses Google Sheets (Spreadsheets) and Google Apps Script to interface the Twitter and Google Analytics API. One of the results is the bubble graph shown below which shows who has been tweeting my blog posts, how many visits their tweet generated, the number of retweets and how many followers the person has (click on the image for the interactive version). You can read more about his this was done and get a copy of the template in Getting Creative with Big Data and Google Apps
At IWMW12 I made a searchable/filterable version of TAGS Spreadsheets. This feature lets you use the Google Visualisation API to filter tweets stored in a Google Spreadsheet (more about TAGS). It has been available via a separate web interface for some time but I’ve never got around to publicizing it. As TAGSExplorer also uses the Google Visualisation API to wrap the same data in a different visualisation tool (predominantly d3.js) it made sense to merge the two. So now in any existing TAGSExplorer archive (like this one for #jiscel12) you have should now also have a button to ‘Search Archive’.
The archive view has some basic text filtering from tweeted text and who tweeted the message as well as a time range filter (dragging the handles indicated). The scattered dots indicate when messages were tweeted. The denser the dots, the more tweets made.
I’ve hastily thrown this together so feedback very welcome.
It’s here folks. The most advanced aggregation and visualisation of tweets for the JISC Innovating e-Learning 2012 online conference taking place next week. Over two years ago I started developing a Google Spreadsheet to archive tweets and since not only have I been evolving the code I’ve been creating tools which use the spreadsheet as a data source. It’s pleasing to see these tools being used for a wide range of projects from citizen journalism, to a long list of academics, students and community groups, and even TV broadcasters.
I’ve been a little remise in posting some of the latest developments and I’ll have to cover those soon. For now here’s your #jiscel12 Twitter basecamp.
Overview of features
Whilst I probably just looks like another spreadsheet you should explore:
[Contains a number of summaries – I find ‘most RTs in last 24hrs’ one of the most useful (how this works also need documenting]
Currently these are automatically updating every hour, but I’ll probably crank up the frequency next week. Your thought on these always gratefully received ;)
For week 3 of cfhe12 analysis I thought I’d turn back to the Twitter data. I’m currently trying to prepare a Shuttleworth Fellowship application which has got me thinking more about the general premise of cMOOCs that “knowledge is distributed across a network of connections, and therefore that learning consists of the ability to construct and traverse those networks” (from week 1 of cck11).
The aspect, which features in my Shuttleworth application, is providing mechanisms that aggregate data from distributed sub-networks which then can be processed to produce actionable insights to tutors or participants. The process I plan to adopted is to look at the data using heavyweight tools, like NodeXL, or just applying a bit of curiosity (this person has stopped tweeting, why? etc), and then converting some of these patterns into very lightweight applications or views to remove the complexity and highlight key parts of the data.
Some examples for you:
Tweets
Tweets from CFHE12 are being collected in this Google Spreadsheet. As part of this template there are a number of summary views, one of these being a breakdown of individual participant activity. As part of this sparklines are used to display someone’s twitter activity. Looking at gsiemens you can see there is steady activity posting 45 tweets tagged #cfhe12. Towards the bottom of the graph is ViplavBaxi, who after initial high activity is no longer contributing to the hashtag. So what has happened to ViplavBaxi? There are a number of possible answers but let me highlight a couple which also highlights the limitation of the technique:
they have lost interest in the course ot time commitments prevent them from contributing (high drop outs aren’t unexpected in MOOCs)
no longer using #cfhe12 hashtag – the archive is only of #cfhe12 so if the have joined a sub community communicating without the hashtag it’s not recorded
found a different communication channel – this technique is only looking at Twitter activity, the person may have moved to another network channel like the discussion forum
Another interesting activity summary is for dieGoerelebt. They are one of the top 5 contributors in terms number of tweets, but recently their activity has trailed off. You can also see the ‘@s’ column, which is the number of times they’ve been mentioned in tweets is one of the lowest. Is the decline in activity a result of the lack of engagement?
The next question that springs to my mind is what did these people say. Within the spreadsheet it’s easy to filter what they said. To let you see too I’ve got this simple web interface primed with filtered tweets (I modified an existing tool I’ve developed to do this – unfortunately I’ve never documented it, but as I use it more and more I must get around to it):
From visual inspection dieGoerelebt had a high proportion of retweets. This is confirmed when I added a percentage of tweets that are retweets.
Something I noted in the filtered view for a persons tweets was that a lot of the context is lost (I can see they are @replying to someone, but I don’t know what they said.
So again I start exploring some ideas that branch off into many more avenues to follow. One thought is that the micro analysis of tweets might not my beneficial or practical, and given the issues with extracting a full conversation from Twitter a macro view might be better. Providing a summary of overall activity and the mode in which Twitter is being by people may be of the most use to tutors and participants to identify people they might want to connect with. As always your thoughts are greatly appreciated.
In this post I’ve taken an ego-centric approach contributions. In the next couple of days I’ll share an ego-centric approach to community connections.
This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).