Tag Archives: #or2012

4 Comments

Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).

To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.

Getting the data into C2

There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:

Google Refine Create Project from Google Doc
Google Refine Create Project from Google Doc

... and then use the Templating Export tool to generate the C2 xml.

Google Refine Templating Export
Google Refine Templating Export

For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.

Prefix

<?xml version="1.0" encoding="UTF-8"?>
  <searchresult>
    <query>#or2012</query>

Row template

    <document id="{{row.index}}">
      <title>{{cells["text"].value}}</title>
      <url>{{cells["status_url"].value}}</url>
      <snippet>
        {{cells["text"].value}}
      </snippet>
    </document>

Row separator

[single carriage return]

Suffix

</searchresult>

If you want to play along here is a copy of the #or2012 archive in C2 xml

Processing the data in Carrot2

Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.

Carrot2 Workbench Search Panel
Carrot2 Workbench Search Panel

The results

Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):

Carrot2 Workbench Visualization Mode
Carrot2 Workbench Visualization Mode

The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).

View a tweet
View a tweet in Carrot2

Useful/meaningful?

This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out ;)