Having experimented with the Twitter API for a couple of years I’m familiar with a number of the limitations, particularly when it comes to the Twitter Search API. The biggest being the maximum number of tweets (1,500) and timeframe (last 7 days) you can extract in one go. This means that whilst tools like NodeXL’s* the built-in import Twitter searches are convenient to grab network data for small and sort events you need to find an alternative data source for larger or extended timeframes.
*I recall reading Gephi was going to get a social network import plugin not sure of the status
In this post I highlight how you can use Google Spreadsheets to feed social network analysis tools like NodeXL and Gephi with larger conversation data sets. I’ll be working through this in NodeXL, but the core part of getting the edges data into a .csv format is applicable to other tools like Gephi.
The data source
There are a number of tools/services you can use to grab tweets over an extended time. You can install your own server based solutions like ThinkUp or the JISC funded YourTwapperKeeper, use desktop based tools like Archivist Desktop Version or even buy twitter data from datasift.
My solution is to use a Google Spreadsheet I’ve written, which sits in the cloud quietly consuming Twitter search terms. Here’s an example of a copy of that spreadsheet used to collect almost 8,000 tweets from ALT-C 2011 between the 31st August and 19th September. Using a Google Spreadsheet means a lot of the work has already been done for us. Items like who the made the tweet, what they said and when are all in separate cells and ready to be formatted into edge data, but some work is required as follows:
Formatting the data for import
- Whilst logged in to your Google account open the ALT-C Spreadsheet and File > Make a copy (this will allow you to edit your own version).
- On the Archive sheet insert 3 columns to the right of column A (from_user)
- In cell B2 insert the formula
=EXPAND(ArrayFormula(LOWER(A2:A)))– this fills the sheet with who the tweet is from_user in lowercase (Vertex 1 of our edge data)
An aside – Twitter @replies and bad data
For vertex 2 we need to extract if the tweet is a reply to another user and there are a couple of issues to be aware of.
- @replies and @mentions are different the conversations were extracting are @replies, that is between the tweeter and another user mentioned at the beginning of the tweet. Conversations that mention other users in the body of the tweet will be ignored (related post I did on Who can see your @reply in twitter)
The Twitter Search API is broken! In the spreadsheet you’ll see that Twitter has conveniently already returned a to_user_id_str if the tweet is an @reply. You’d think you could just lookup these user ids and get a username, but you can’t because as it says in the Search API “The user ids in the Search API are different from those in the REST API … This defect is being tracked by Issue 214 . This means that the to_user_id and from_user_id field vary from the actualy [sic] user id on Twitter.com”Update: twitter have fixed this problem so you can skip step 1 below if using the latest version (2.3) of my archiver spreadsheet
So to extract an @reply we have to extract this from the tweet. Here’s how:
- Back in your archive sheet enter the following formula into D2
=EXPAND(ArrayFormula(IF(LEFT(E2:E,1)="@",MID(E2:E,2,FIND(" ",E2:E)-2),)))– at first it may appear that nothing has happened but if you scroll down you should see some entries in the column. What the formula does is look if the tweet text in column E starts with an ‘@’ and if it does extract the text from the tweet from after the ‘@’ to the first space (this isn’t a full proof method but we can do some clean up later)
- To build our vertex 2 column we can then insert in cell C2
=EXPAND(ArrayFormula(LOWER(IF(D2:D="",A2:A,D2:D))))– what this does is look to see if there is a value in the D column and if there is insert this otherwise use the sender’s username (also converting it to lowercase)
- Finally, just so we can keep track of things put a column header in cells B1 and C1 (I used vertex_1 and vertex_2)
You can now export this sheet as a .csv by File > Download as > CSV (Current Sheet)
Formatting the data in NodeXL and making it useful
- Open the downloaded .csv file in Excel. At this point we need to do some tiding up of the data. Because not all users have left a space after the @username in their tweet and because formula we used wasn’t sensitive enough to parse only the username we need to remove some rouge characters. Selecting the vertex_2 column find and replace of non-alphanumeric (excluding ‘_’) characters e.g. ‘:’, ‘.’, ‘,’ and ‘#’ [If anyone would be willing to write a macro to automate all of this I would be very grateful ;)].
- Now open a new NodeXL template and from the NodeXL tab Import > From Open Workbook…
- Make sure your .csv file is selected, Columns have headers is ticked and vertex_1, vertex_2, text and date_time are ticked.
- Tick profile_image_url as ‘Is Vertex 1 Property Column’ as ‘Is Edge Column’.
- Finally make sure vertex_1 and vertex_2 are selected for Vertex 1 and 2, then click Import
You’re now free to analysis the data in any way you want.
Some more NodeXL tricks I use
After calculating the in/out degree I added 3 more Other columns to the Vertices called Tweets, Replies and Tooltip Build.
- In Tweets I used the formula
=COUNTIF(Edges!A:A,[@Vertex])which counts the number of times the vertex appears in the Vertex 1 of the Edge sheet ie how many tweets the user has made
- In Replies I used the formula =COUNTIFS(Edges!A:A,"<>"&[@Vertex],Edges!B:B,[@Vertex]) which counts if the vertex is mentioned in Vertex 2 but not Vertex 1 of the edges ie the number of @replies they got
- Finally in Tooltip Build I use
=[@Vertex]&" - Tweets: "&[@Tweets]&" @Replies: "&[@Replies]which I then use to autofill my tooltip for vertices labels
Some basic auto-filling later and here’s a graph of the #altc2011 twitter conversation (I need to do a follow-up post to look at this data in more detail ;):
So to briefly recap we took our big chunk of data which was already in a spreadsheet for us, chucked a couple of formula at it and got a network diagram with 7987 tweets in it … simples (he says having spent an *entire day* iterating this solution to this point so please share or comment to make it seem worthwhile ;)
Update: Here’s a copy of the NodeXL .xlsx file for the ALT-C 2011 data if you want to play (you’ll need NodeXL installed)