Yesterday I got a query about search term clustering, not a topic I’ve got a huge amount of experience with so a quick look at the Wikipedia section on Free open-source data mining software and applications turned up ‘Carrot2: Text and search results clustering framework’. Carrot2 (C2) has a desktop application (Carrot2 Workbench) and it’s clustering engine is used in a number of other tools including Apache Solr. You can also have a play with an online version of carrot2. Out-of-the-box you can use the desktop application to query and cluster a number of existing sources including Bing and Wikipedia. If you want to play with other data sources you can point C2 at other XML feeds or even XML documents as long as they are in Carrot2 format. The structure of this is relatively straight forward and all you need is a title, url and snippet (the url appears to be used for one of the clustering algorithms and part of the application interface to let you navigate to documents so could probably fill this with junk if you don’t have a valid link).
To have a quick play with this I thought I’d see what would happen if I passed a twitter archive for #or2012 into C2 and here’s the result.
Getting the data into C2
There are a number of ways I could have got the data out in C2 XML format like exporting a range to csv, convert to xml and using a XSLT style sheet or used the new Apps Script Content Service to generate a custom xml file. Instead for speed I decided to use Google Refine to import the Spreadsheet straight from Google Docs:
... and then use the Templating Export tool to generate the C2 xml.
For ease here is the values I used for prefix, row template and suffix if you want to copy and paste.
<?xml version="1.0" encoding="UTF-8"?>
[single carriage return]
If you want to play along here is a copy of the #or2012 archive in C2 xml
Processing the data in Carrot2
Open Carrot2 Workbench and in the search panel set the source to XML, pick an algorithm (STC gave a decent result), and the XML resource (which can be the url to the dropbox file included above or a local copy) then scroll this panel down a bit to hit the Process button.
Switching to ‘Visualization’ mode using the button at the top right of Workbench I get this view (click to enlarge):
The interface is designed to be used as ‘exploratory analytics’. Clicking on elements like the ‘Clusters’ folder list updates the view in the Aduna Cluster Map and Circles Visualisation as well as listing ‘Documents’ (tweets) related to the cluster. Clicking on a tweet from the Documents list or one of the dots in the Circles Visualization opens it in a tab within Workbench (hopefully that keeps Twitter happy for liberating their data ;s).
This has only been a quick first play so I’m sure I’m missing loads in terms of tuning the processing. The STC algorithm appears to be very coarse grained detecting 16 or so clusters. It’s useful to have a cluster of retweets which could be exported and further analysed. Switching to the Lingo algorithm generates 102 clusters a number of these being RT+screename. In some ways it would be useful to define some stopwords like ‘or2012’ and ‘RT’ (I’m sure an option must be in there). Part of the reason for publishing the raw data for this is in the hope that someone who actually knows what they are doing can show me what is possible. So over to you, I’ve shown you how to get the data in, help me get something meaningful out ;)