R

After posting Analysing WordPress post velocity and momentum stats with Google Sheets (Spreadsheet) Tony quickly came back with WordPress Stats in R, which includes a handy custom function for querying the WordPress Stats API. Something I wanted to do in my original post was to generate sparklines for post activity. I wasn’t able to find a way of doing this in Google Sheets  because of the number of data points (n. 366) and didn’t think Google Apps Script would handle the dataset because of timeouts. So given the start Tony had made with R it seemed like the ideal opportunity to refresh my R skills.

So below is the result (here’s the interactive version), which is followed by an explanation of how it was made.

SparkTable of postview 

Processing the data in R

Thanks to the code provided by Tony it’s easy to get postviews data back from WordPress using:

wp.postviews=wordpress.getstats.demo(APIKEY, BLOGURL, 'postviews', days='366', end='2012-12-31',  period='day', limit=-1)

This returns a data frame that looks like this:

Note that 478 different post_titles are returned compared to 178 post_permalinks showing some of the data is missing and in fact wrong. This isn’t important for this particular project. I’ve uploaded a sample of the returned data here. The data returned includes a day stamp, post title and number of views. Posts with no views on that day are not returned.

For the sparklines I initially headed off down a dead end (lines 99-107) making a pivot table in R. This wasn’t entirely wasted effort because the process of casting this and handling null values was used later in the project.  Having got stuck at this point searching for more information on sparklines in R turned up this question on stackoverflow and this set of slides referencing the sparkTable package. Initially I tried dropping and renaming some of the columns from the original data but on line 96 when trying the reshapeExt I got:

Error in `[<-.data.frame`(`*tmp*`, , attr(dat, "reshapeLong")[["timevar"]],  : 
  replacement has 32504 rows, data has 32538

Searching for variations around the error message didn’t turn anything up and if you know what is wrong I’d be very grateful for guidance. Instead I decided to follow a hunch and instead of using partial time series data filled in the blanks by casting and then melting i.e. from this:

data from worpress

to this (filling in missing data with ‘0’)

data from worpress casted

before going back to this

data from worpress then melted

Doing this got me past reshapeExt and outputting a sparkTable as html (the other option is LaTeX) embedding 478 sparkBars as png images. The final step was to wrap the html using jQuery DataTables (the main thing was to add <thead> and <tbody> tags). 

Was it worth it? I’ll let you decide if the product is any good, the process of doing it was definitely useful. So how would you improve it?

Share this post on:
| | |
Posted in Analytics, R on by .

Yesterday I got stuck into the first week of the Coursera course on Computing for Data Analysis. The course is about:

learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods.

You might be asking given that I’ve already dabbled in R why am I taking an introductory course? As I sat watching the lectures on my own (if anyone wants to do a Google Hangout and watch next weeks lectures together let me know) I reminisced about how I learned to swim. The basic story is 6 year old boy is staying a posh hotel for first time, nags parents to take him to the swimming pool, when they get there gets changed runs off and jumps in at the deep end. When I eventually came back to the surface I assumed the doggy paddle and was swimming’ … well ‘swimming’ in the sense that I wasn’t drowning.

The method of ‘throwing myself in’ is replicated throughout my life, particularly when it comes to learning. So whilst I’ve already thrown myself into R I can survive but only just and what I’ve produced is mainly as a result of trying not to drown. This revelation was particularly clear when learning about subsetting (reshaping data)

I’ve got an example where I’ve been practicing my subsetting skills with NodeXL data later in this post, but first some quick reflections about my experience on the course so far.

MOOCing about in Coursera

So hopefully you’ve already got the picture that I’m a fairly independent learner so I haven’t bothered with the built-in discussion boards, instead opting to view the lectures (I’m finding x1.5 speed suits me) and take this weeks quiz. The assignment due for week 2 is already announced and people are racing ahead to get it done (which appears to have forced the early release of next weeks content).

Something apparent to me in the Coursera  site is the lack of motivational cues. I’ve got no idea how I’m doing in relationship with my fellow 40,000 other students in terms of watching the lectures or in this weeks quiz. Trying to get my bearings in using the #compdata Twitter hashtag hasn’t been that successful because in the last 7 days there have only been 65 people using or mentioned with the tag (and of the 64 tweets 29 were ‘I just signed up for Computing for Data Analysis #compdata …’)

Things are looking up on the Twitter front though as some recent flares have gone up:

and also @ @hywelm has made himself known ;)

Will there be much community building in the remaining 3 weeks?

Mucking about with NodeXL and R

In the section above I’ve mentioned various Twitter stats. To practice this week’s main compdata topics of reading data and subsetting I thought I’d have a go at getting the answers from a dataset generated in NodeXL (I could have got them straight from NodeXL but where is the fun in that ;).

Step 1 was to fire up NodeXL and import a Twitter Search for #compdata with all of the boxes ticked except Limit to… .

Twitter Search Import from NodeXL

As a small aside I grabbed the the NodeXL Options Used to Create the Graph used in this MOOC search by Marc Smith, hit the automate button and came up with the graph shown below (look at those isolates <sigh>):

The #compdata graph

To let other people play along I then uploaded the NodeXL spreadsheet file in .xlsx to Google Docs making sure the ‘Convert documents …’ was checked and here it is as a Google Spreadsheet. By using File > Publish to the web… I can get links for .csv versions of the sheets.

In R I wrote the following script:

If you run the script you should see various answers pop out. As I’m learning this if anyone would like to suggest improvements please do. My plan is to keep adding to the data and extending the script as the weeks go buy to practices my skills and see what other answers I can find

Share this post on:
| | |
Posted in NodeXL, R on by .

5 Comments

Lou McGill from the JISC/HEA OER Programme Synthesis and Evaluation team recently contacted me as part of the OER Review asking if there was a way to analyse and visualise the Twitter followers of @SCOREProject and @ukoer. Having recently extracted data for the @jisccetis network of accounts I knew it was easy to get the information but make meaningful was another question.

There are a growing number of sites like twiangulate.com and visual.ly that make it easy to generate numbers and graphics. One of the limitations I find with these tools is they produce flat images and all opportunities for ‘visual analytics’ is lost.

Click to see twiangulate comparison of SCOREProject and UKOER
Twiangulate data
Click to see visual.ly comparison of SCOREProject and UKOER
create infographics with visual.ly

So here’s my take on the problem. A template constructed with free and open source tools that lets you visually explorer the @SCOREProject and @ukoer Twitter following.

Comparison of @SCOREProject and @ukoerIn this post I’ll give my narrative on the SCOREProject/UKOER Twitter followership and give you the basic recipe for creating your own comparisons (I should say that the solution isn’t production quality, but I need to move onto other things so someone else can tidy up).

Let start with the output. Here’s a page comparing the Twitter Following of SCOREProject and UKOER. At the top each bubble represents someone who follows SCOREProject or UKOER (hovering over a bubble we can see who they are and clicking filters the summary table at the bottom).

Bubble size matters

There are three options to change how the bubbles are sized:

  • Betweenness Centrality (a measure of the community bridging capacity); (see Sheila’s post on this)
  • In-Degree (how many other people who follower SCOREProject or ukoer also follow the person represented by the bubble); and
  • Followers count (how many people follower the person represented by the node

Clicking on ‘Grouped’ button lets you see how bubble/people follow either the SCOREProject, UKOER or both. By switching between betweeness, degree and followers we can visually spot a couple of things:

  • Betweenness Centrality: SCOREProject has 3 well connected intercommunity bubbles @GdnHigherEd, @gconole and  @A_L_T. UKOER has the SCOREProject following them which unsurprisingly makes them a great bridge to the SCOREProject community (if you are wondering where UKOER is as they don’t follow SCOREProject they don’t appear.
  • In-Degree: Switching to In-Degree we can visually see that the overall volume of the UKOER group grows more despite the SCOREProject bubble in this group decreasing substantially. This suggests to me that the UKOER following is more interconnected
  • Followers count: Here we see SCOREProject is the biggest winner thanks to being followed by @douglasi who has over 300,000 followers. So whilst SCOREProject is followed by less people than UKOER it has a potential greater reach if @douglasi ever retweeted a message.

Colourful combination

Sticking with the grouped bubble view we can see different colour grouping within the clusters for SCOREProject, UKOER and both. The most noticeable being light green used to identify Group 4 which has 115 people people following SCOREProject compared to 59 following UKOER. The groupings are created using community structure detection algorithm proposed Joerg Reichardt and Stefan Bornholdt. To give a sense of who these sub-groups might represent individual wordclouds have been generated based on the individual Twitter profile descriptions. Clicking on a word within these clouds filters the table. So for example you can explore who has used the term manager in their twitter profile (I have to say the update isn’t instant but it’ll get there. 

wordclouds

Behind the scenes

The bubble chart is coded in d3.js and based on Animated Bubble Chart by Jim Vallandingham. The modifications I made were to allow bubble resizing (lines 37-44). This also required handling the bubble charge slightly differently (line 118). I got the idea of using the bubble chart for comparison from a Twitter Abused post Rape Culture and Twitter Abuse. It also made sense to reuse Jim’s template which uses the Twitter Bootstrap. The wordclouds are also rendered using d3.js by using the d3.wordcloud extension by Jason Davies. Finally the table at the bottom is rendered using the Google Visualisation API/Google Chart Tools.

All the components play nicely together although the performance isn’t great. If I have more time I might play with the load sequencing, but it could be I’m just asking too much of things like the Google Table chart rendering 600 rows. 

How to make your own

I should say that this recipe probably won’t work for accounts with over 5,000 followers. It also involves using R (in my case RStudio). R is used to do the network analysis/community detection side. You can download a copy of the script here. There’s probably an easier recipe that skips this part worth revisiting.

  1. We start with taking a copy of Export Twitter Friends and Followers v2.1.2 [Network Mod] (as featured in Notes on extracting the JISC CETIS twitter follower network).
  2. Authenticate the spreadsheet with Twitter (instructions in the spreadsheet) and then get the followers if the accounts you are interested in using the Twitter > Get followers menu option 
  3. Once you’ve got the followers run Twitter > Combine follower sheets Method II
  4. Move to the Vertices sheet and sort the data on the friends_count column
  5. In batches of around 250 rows select values from the id_str column and run TAGS Advanced > Get friend IDs – this will start populating the friends_ids column with data. For users with over 5,000 friends reselect their id_str and rerun the menu option until the ‘next_cursor’ equals 0 
    next cursor position
  6. Next open the Script editor and open the TAGS4 file and then Run > setup.
  7. Next select Publish > Publish as a service… and allow anyone to invoke the service anonymously. Copy the service URL and paste it into the R script downloaded earlier (also add the spreadhsheet key to the R script and within your spreadsheet File > Publish to the web 
    publish as service window
  8. Run the R script! ...  and fingers crossed everything works.

The files used in the SCOREProject/UKOER can be downloaded from here. Changes you’ll need to make are adding the output csv files to the data folder, changing references in js/gtable.js and js/wordcloud.js and the labels used in coffee/coffee.vis

So there you go. I’ve spent way too much of my own time on this and haven’t really explained what is going on. Hopefully the various commenting in the source code removes some of the magic (I might revisit the R code as in some ways I think it deserves a post on its own. If you have any questions or feedback leave them in the comments ;)

8 Comments

Increasingly I find myself creating Twitter hashtag archives using my TAGS Google Spreadsheet template as a means to identify who in that community has the most influence and ultimately use this intelligence to target people that might be able to help me disseminate my work. Marc Smith at the Social Media Research Foundation has a useful overview on ‘How to build a collection of influential followers in Twitter using social network analysis and NodeXL’.

I don’t go to the extreme of seeking people to follow and gaining influence with retweets, I usually just follow interesting people who follow me, but the post introduces the important concept of:

  betweenness centrality” – a measure of how much a person acts a bridge between others.

(betweenness centrality (BC) was a big turning point in my interest and understanding of social network analysis, a moment captured by Sheila MacNeill)

To date the only way I could calculate BC on an archive of tweets was to download the data to my desktop, run it through NodeXL and review the data. This isn’t ideal as the data becomes orphaned. I have experimented with calculating BC using Google Apps Script using a modified version of some PHP code put together by Jonathan Cummings, but kept hitting timeouts before I could get anything back.

I forgot about pursuing this angle until that is I saw Tony Hirst’s A Quick View Over a MASHe Google Spreadsheet Twitter Archive of UKGC12 Tweets in which he uses the statistical computing and graphing tool ‘R’ to read a spreadsheet of archived tweets and produce some quick summary views (I highly recommend you read this post and also check the contribution from Ben Marwick in the comments). Reading this post made me think if it is that easy to read and analyse data using R could you also not somehow push the results back.

Fortunately, and I do mean fortunately, I have no experience of R, R Script, R Studio (I like having no preconceived ideas of what new tools can do – it far more rewarding to throw yourself into the unknown and see if you make it out the other side), but I do know a lot about Google Apps Script giving me a destination – just no way of getting there.

The idea, I think, is ingeniously simple. Read data, as Tony did, process it in R and then using Apps Script’s feature to be published as a service to simply POST the data back to the original spreadsheet.

As that is quite complicated I’ll recap. Fetch a Google Spreadsheet as a *.csv, do something with the data and then push the data back in the same way that you post a web form (and if you skipped the link the first time POST the data back to the original spreadsheet).

Having sunk a day of my own time (and it is my own time because I get paid for the OER Visualisation project for the hours I work on it), I’m not going to go into the details of how to setup R (or in my case RStudio) to do this – hey I learned it in a couple of hours so can you – instead I’ll give you the bits and pieces you need and general instructions.  Before I start you might want to see if the result is worth it so here’s a sheet of SNA stats for the #ukgc12 archive.

SNA Stats 

Playing with some test data

To make it easier I start with a partially complete dataset. The scenario is I’ve got my archive and run options 1-3 in the TAGS – Advanced menu to get an Edges sheet of friend/follower information.

  1. Open this Google Spreadsheet and File > Make a copy (this is a fully functioning – if I haven’t broken it of the next version of TAGS so if you clear the Archive and setup you can start collecting and using this with your own data).
  2. Once you’ve copied select File > Publish to the web and publish the spreadsheet
  3. In the new spreadsheet open Tools >  Script editor.. and Run > Setup (this get a copy of the spreadsheet id need to run as a service – in the normal scenario this is collected when the user authenticates the script with Twitter)
  4. Open Share > Publish as service..  and check ‘Allow anyone to invoke’ with ‘anonymous access’, not forgetting to ‘enable service’. You’ll need a copy of the service URL for later on. Click ‘Save’
    Publish as service
  5. Back in the script editor on line 57 enter a ‘secret’ – this will prevent anyone from uploading data will in anonymous mode (you can choose to only enable the service when required for extra security.
  6. Open your install of R and load a copy of this script.
  7. There are four things to edit in this script
    1. key – spreadsheet key, the bit after https://docs.google.com/spreadsheet/ccc?key= and before the &hl… junk
    2. gid – the sheet number of the Edges sheet, unless you insert/use a different sheet should always be 105 for a TAGS spreadsheet
    3. serviceUrl – the url you got in step 4
    4. secret -  the same secret you entered in step 5
  8. You might also need to install the packages used – most of them are standard but you may need to get igraph – used to get all social network data
  9. Run the R script – it may take some time to read a write to Google Spreadsheets so be patient

That’s it. If you go back to the spreadsheet (you may need to refresh) the SNA Metrics and Vertices sheets should be populated with data generated from R

The Apps Script Magic

Here’s the Google Apps Script snippet used to handle the data being pushed from R:

I’ve commented most of it so you can see what is happening. While Apps Script has a debugger which lets you monitor execution and variables it can’t intercept the POST so I used the original POST/GET code to dump the data into some cells then tweaked the script to read it from there to work out what needed to be done.

Final thoughts

I think this is a powerful model of reading selected, processing and then uploading data back to the source. I’m also only using the very basics of igraph and sure much more could be done to detect neighbourhoods, clusters and more. Also I wonder if more of the friendship data collection could be done in R with the TwitteR – (you R people really know how to make it hard to find info/help/support for your stuff ;) Right now I can get friend/follower info for a list of 250 users.

The intriguing aspect is just how much data can you push back to Apps Script and as there is a long list of Services could you also handle binary data like chart images (perhaps down the Blob and then Document Service route, or maybe just straight into Charts).

I welcome any comments you have about this technique and particularly value any feedback (I’m not a SNA expert so if there are errors in calculation or better measures I would welcome these)