Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine

Last month I posted Free the tweets! Export TwapperKeeper archives using Google Spreadsheet, which was a response to the announcement that TwapperKeeper would be removing public access to archives on the 6th January. This solution was limited to archives smaller than 15,000 tweets (although minor tweaking could probably get more). Since then Tony Hirst has come up with a couple of other solutions:

One of the limits these solutions have is they only collect the data stored on TwapperKeeper missing lots of other juicy data like in_reply_to, location, retweet_count (here’s what a tweet used to look like, now there is more data). Whilst this data is probably of little interest to most people for people like me it opens the opportunity to do other interesting stuff. So here’s a way you can make a copy of a Twapper Keeper archive and rebuild the data using Google Refine.

  1. You’re going to need a copy of Google Refine and install/run it
  2. Visit the Twapper Keeper page of the archive you want. On the page copy the RSS Permalink into the URL box in Google Refine adding &l=50000 to the end e.g. the ukoer archive is http://twapperkeeper.com/rss.php?type=hashtag&name=ukoer&l=50000 and click Next.
  3. In the preview window that appears switch ‘Parse data as’ to XML files. Scroll the top pane down to hover over the ‘item’ tag and click
    Refined parse xml 
  4. You should now have a preview with the data split in columns. Enter a Project name and click ‘Create Project’
    Refined Columns
  5. From the ‘item – title’ column dropdown select Edit column > Add column based on this column…
  6. In the dialog that opens enter a New column name ‘id_str’ and the Expression smartSplit(value," ")[-1] (this splits the cell and returns the last group of text)
  7. From the new id_str column dropdown select Edit column > Add column by fetching URLs… and enter a name ‘meta’, change throttle delay to 500 and enter the expression "https://api.twitter.com/1/statuses/show.json?id="+value+"&include_entities=true" (that’s with quotes), then click OK.  [What we’ve done is extract a tweet id and then asked refine to fetch details about it from the Twitter API]
  8. Wait a bit (it took about 4 hours to get data from 8,000 tweets)

Once you have the raw data you can use Refine to make new columns using the expression parseJson(value) then navigate the object namespace. You might find it useful to paste a cell value into http://jsonviewer.stack.hu/ to see the structure. So to make a column which extracts the tweets full name you’d use parseJson(value).user.name

So don’t delay ‘free the tweets’

15 thoughts on “Free (and rebuild) the tweets! Export TwapperKeeper archives using Google Refine

  1. Pingback:

    A Handful of Google Hacks… « OUseful.Info, the blog…

  2. Ralf

    What does “l=50000″ do? I expect it’s a limit on the number of tweets being fetched? Is there an upper limit you are aware of?

    1. Martin Hawksey

      Post author

      Yes and yes. It is the number returned and the max is 50000. You can get more by setting a last date of the last tweet. Tony Hirst used this technique used in the last link in the intro paragraph

  3. Hywel M

    I wrote a Python script last week to scrape the full details from a TwapperKeeper archive. Here’s a link: http://dl.dropbox.com/u/15813120/twkp_publiccode.py. When it runs it asks for the name of the archive (e.g. rhanidan) and the output path (e.g. C:\\Python26\\mytestfile.csv). Here’s a text file which will give you an idea of the format of the csv output file you’ll get: http://dl.dropbox.com/u/15813120/rhanidan.csv

    That csv file can be read straight into Refine and it will get split into sensibly named columns automatically.

    I hope that may be of some use. Thanks for your post (and previous ones). I’ve found them very useful.

    @hywelm

      1. Hywel M

        So many libraries! I can see how one could. Use a regex like:
        twtsrchid = re.compile(r’tweet_id=(\d+)’)
        tweetid = twtsrchid.search(dolen).group(1)
        tweet_stat = tweepy.api.get_status(tweetid)

        Then tweet_stat.text, tweet_stat.user_id etc would give the meta data which one could put in a list or dict and write out.

        My solution worked pretty well and given there’s only a couple of days left to grab the archives I’m not going to bother to revise. Thanks for the suggestion though.

        1. Martin Hawksey

          Post author

          Not a py coder myself but looks right. Main thing is you’ve got the raw data so you can get extra if need later.

          You might like the 3 links at the beginning of this post by Tony Hirst (ouseful.info) which all use python, the last one with a way to get archives bigger than 50k.

    1. Martin Hawksey

      Post author

      Hi Sarah – looks like its Refine tripping up on the import or a temporary glitch twapperkeeper side. You might want to open the rss feed in your browser and save the page source and then import the archive into refine as a local file
      Martin

      1. Sarah Marshall

        Thanks Martin
        I think Refine was conflicting with one of my browser extensions. It’s now working fine. Just waiting for the final stage. Thanks again!
        Sarah

  4. Pingback:

    Social Interest Positioning – Visualising Facebook Friends’ Likes « OUseful.Info, the blog…

  5. Pingback:

    Noch wenige Stunden mit TwapperKeeper! « LIBREAS.Library Ideas

  6. Hywel M.

    Yes, I’m aware of the good stuff Tony Hearst produces. I’d had a look at his most recent script and tried it. I’d not used feedparser before. If I recall correctly though, it didn’t get the meta data.

  7. Pingback:

    What can we learn from the #ukoer Twitter archive [day 24] – MASHe

Comments are closed.