6 Comments

In a recent post I showed that how given some urls it was easy to generated a templated report as a word document. This was partly done to show how blog posts from a JISC funded programme could be used to generate reports. One of the issues with this solution is not all the projects used WordPress as their blogging platform making it harder to get at some of the data. One thought I had was rather than dictating a particular platform JISC could aggregate all programme posts in a central database. This isn’t entirely new as at CETIS we already aggregate links to posts for selected programmes via our PROD database, the difference here would be as well as a link the post content would also be ingested making it easy to query and analysis the data from a single source. As I recently wrote about in Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine this technique is common in cMOOCs, and in particular highlighted how Digital Storytelling (DS106) do this already using the FeedWordPress plugin.

As part of a double hit I thought I’d see if there was a case for JISC implementing this type of infrastructure, and because I’ve got an interest in applying learning anlaytics to open online courses, have a look at the data stored by ds106 and see what analysis I could do. In the next series of posts I share my notes and sketches around extracting FeedWordPress data, supplement it with 3rd party sources and analysing dialog/activity around posts. In this first post I’m going to outline how all the data was imported into a local WordPress database, queried to pull out a data slice, refined and mined, before visualisation/analysis. What we end up with is a view of how ds106 blog posts are interlinked by hyperlinks within the post body. For the next in the series I record some notes on comment extraction and finally how to get more activity data from social network sites.

The data source

If JISC were hosting the data then in theory it would be  a lot easier to get to. As my test data comes from ds106 some extra preparatory work was required. Fortunately #ds106er Alan Levine (@cogdog) was able to provide me with a MySQL dump of data from their WordPress install. The dump contained over 700,000 lines of SQL commands and was over 500Mb. Problem one was then how to get this back into a local MySQL database to make slice, dice and export easier.

For a local Apache/MySQL/PHP server I used  XAMPP Lite. Importing the data wasn’t straight forward as despite tweaking the config kept having ‘allocated memory’ errors. The solution was ‘BigDump: Staggered MySQL Dump Importer’, which was able to automatically split and run the import. Even with this I got a couple of ‘MySQL has gone away’ errors so used Vim (a text editor that can handle very large files) to drop 5 blog post inserts (given ds106 had over 20,000 posts I can live with the lose).

The selected data export

Wordpress Database DiagramThose familiar with the backend of WordPress will know it has a relational structure with a number of tables holding different parts of the data. Because I knew I wanted to explore ‘ds106 dialog’ and that to do this the data would have to be compiled with other sources I needed to export a slice of the data. Whilst I was does this it also made sense to make some of the data more meaningful. For example, the wp_posts table which has most of the data I needed uses an author id number where as it would be better to get this as a display name by joining it with the wp_users data. Fortunately because MySQL is … well a ‘query language’ this is relatively easy to do by using the command below. To talk you through it I’m selecting a couple of columns from the wp_posts table and joining it with some other data on the condition that it’s a published post. As well as author display name you’ll see that I’m also returning a wfw:commentRSS. This is the comment RSS feed caught by the FeedWordPress plugin (I think it comes from here). I got wind of this metadata after reading Jim Groom’s Displaying Distributed Comments on the Hardboiled Blog. I’ll show how this is used later.

SELECT wp_posts.id, 
       wp_posts.post_author, 
       wp_posts.post_date, 
       wp_posts.post_content, 
       wp_posts.post_title, 
       wp_posts.post_name, 
       wp_posts.guid, 
       wp_postmeta.meta_value AS comment_feed, 
       wp_users.display_name 
FROM   wp_posts 
       LEFT JOIN wp_postmeta 
              ON wp_posts.id = wp_postmeta.post_id 
                 AND wp_postmeta.meta_key = 'wfw:commentRSS' 
       LEFT JOIN wp_users 
              ON wp_posts.post_author = wp_users.id 
WHERE  wp_posts.post_type = 'post' 
       AND wp_posts.post_status = 'publish'

Running this query in phpMyAdmin gives us an option to exporting as a csv giving almost 20,000 blog posts (one per row) to play with which is plenty.

The refinement

apply-operationsUsing Google Refine we can import the csv file choosing comma separated and headers in first row. Next we want to process the post content to extract outbound links. Below a the rough steps I used which you can download and apply to your own project (the steps in the file vary slightly as I noticed that the guid links had ?p={number} instead of the post permalink. I ended getting these when fetch the comment feed and extracting the link. More detail about this in the next post. Note to self: issue with non-permalinks).

  1. Create column post_as_date at index 3 based on column post_date using expression grel:value.toDate("Y-m-d H:m:s")
  2. Create column outbound_links at index 5 based on column post_content using expression grel:forEach(value.parseHtml().select("a[href~=]"),e,e.htmlAttr("href")).join("|||")
  3. Split multi-valued cells in column outbound_links
  4. Create column source_links at index 9 based on column guid using expression grel:row.record.cells["guid"].value[0]

The key step is 3. which mines all the posts for <a href> tags and extracts them into a new column joining them together with ‘|||’

image

Using Refine we can split the outbound_links column to put one link per row (column dropdown, Edit cells > Split multi-valued cells). The last step is to make sure the post source link is included in each row using fill down the right and secure way. Filtering out blank outbound_link rows and Export using Custom tabular exporter (setting I used here) I get some data that looks like this:

The data meets Gephi and NodeXL

With this data we can open in Excel and then import into NodeXL using Import > From Open Workbook, more about that later, or alternatively using Gephi change the column headings in the .csv version to source and target and import via the ‘data laboratory’ tab (for graph manipulation in Gephi Tony Hirst has a great tutorial).

Sticking with Gephi for now you can generate the images below. On the left is the entire graph which depicts the ~59,000 links contained in ~20,000 ds106 blog posts. The image on the right is zoomed in to the inset where we can start seeing each webpage referenced as a dot connected by a line which is a link in a post. Whilst this is pretty, it’s also pretty useless.

All links from ds106 links Close-up of post links

So where to go with this? In the paper A Theoretical Proposal of Learning Communities Management Through the Analysis of the Organizational Blogosphere the authors propose using the measure of ‘betweenness centrality’ (BC) as a way of identifying influential learners within a blogging collective:

The betweeness centrality has been considered in literature (Marsden, 2002) as a way to find the most valuable nodes within a social network. The strategic function of these characters is fundamental for the system of learners, because they have the widest view of what happens in the net. This is due to the fact that they intercept the majority of the knowledge and information flows within the learning community. The people with the highest betweeness centrality can share their vision on the main knowledge claims of the community in an open debate.

Extrapolating this slightly lets consider individual blog posts rather than individual learners, using BC as a way to highlight posts that may bridge concepts, or support the community of learners in some way. Rather than examine all links in the blog posts lets first consider internal linking within the ds106 community.

To do this lets turn back to NodeXL which I find a lot easier to get data in and manipulate. Opening our export file from Google Refine and then using Import > From Open Workbook we can import all the data source and target being edge properties, the rest vertex 1 properties. This gives you over 75,000 connections between over 59,000 links, back at square one. A nice feature of NodeXL is to tell it to ‘skip’ some data (edge, vertex or group). The data is kept in the spreadsheet so you can include it later but it’s not used for any calculations. I only want to analysis ds106 posts that link to other ds106 posts. Here’s brief notes on the way I did it (other ways are possible and might even be better).

Autofill options

On the ‘Edges’ sheet in the ‘Other columns’ section I created 4 columns named: Community Link, Self Link, Combined and

Internal Link, then added the following formula in each column:

  • =IF(ISERROR(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30 ,FALSE)),0,IF(LEN(VLOOKUP([@[Vertex 2]], Vertices!A:AF,30, FALSE))>12,1 ,0)) – because we only have dates, titles, names for posts made by the ds106 community, if this detail is blank on the vertices sheet then the edge is outwith the community (if it exists enter 1, otherwise 0) Tip: after doing this I copied all the Community Link values and Paste Special, Values to save recalculation on edit/open.
  • =IF([@[Vertex 1]]=[@[Vertex 2]],0,1) – test if edge is self-link
  • =[@[Community Link]]*[@[Self Link]] – multiplying these together as a simple operator
  • =IF(ISERROR(OR(FIND("/",[@[Vertex 1]],12), FIND("/",[@[Vertex 2]],12))), 0, IF(LEFT([@[Vertex 1]], FIND("/",[@[Vertex 1]], 12))=LEFT([@[Vertex 2]], FIND("/",[@[Vertex 2]],12)), 1,0)) – used to test if edge is two posts by the same author

 

Using the Autofill we can chose to skip edges based on the ‘Combined’ value (and while we are at it set the edge style based on the ‘Self Link’ value). Once this is done you can use NodeXL to calculate Groups and Metrics knowing that it’ll skip the edges it doesn’t need.

Once we’ve done all of this and applied some chart options this is the graph we get.

ds106 interlinked blog posts

Taking a closer look at one of the groupings (G1) shown below we can see how ds106 blog posts (the dots) link to each other, a dotted line indicating it’s the same post author referring to their own work.

Group G1

The first impression

We now have a dataset of ds106 blog posts and the webpages that they individually link to. There are some interesting patterns and more work to be done to interpret these. Whilst this technique has surfaced interlinks within the community there is no context in which they are made. Ranking the posts by betweenness centrality we get the following top 10:

  1. http://bavatuesdays.com/the-daily-shoot-and-manic-ravings-about-ds106/
  2. http://bavatuesdays.com/innovation-in-elearning-interview/
  3. http://cogdogblog.com/2011/04/13/do-you-feel-lucky-wabbit/
  4. http://thisevilempire.com/blog/?p=442
  5. http://bavatuesdays.com/the-ds106-99-15-suburbia/
  6. http://bavatuesdays.com/the-ds106-99-13-valley-girl/
  7. http://gforsythe.ca/2011/04/15/ds106radio-memory-lane/
  8. http://bavatuesdays.com/week-12-13-assignment-el-mashup/
  9. http://cogdogblog.com/2012/02/19/week-5-in-review/
  10. http://cogdogblog.com/2012/02/14/calling-card-bullitt/

It’s interesting that most of these come from tutors (although given the amount of co-creation the role of tutor/student is very blurred), which in some ways isn’t that surprising given they will be summarising and highlighting key work.

The Limitations

Hopefully by this point you are screaming at the screen because of the limitations of this analysis. The focus has been on blog post content, which I hope I’ve shown can be used to identify interesting moments. The big limitation here is it’s looking at a narrow slice of activity - how students are linking to blog posts without any real context. Another staple of blogs is comments. Comments provide an easy way for a reader to feedback to the author continuing the dialogue sharing thoughts, ideas and  reflections. It’s also all well and good me doing this but the real interesting thing would be to put this graphical overview in the hands of tutors and students to see if it helps them. I imagine it wouldn’t be hard to wrap the MySQL query initially used in some PHP and visualisation library like d3.js and provide users with some situational awareness of what is going on in their community. If you would like to explore the data the NodeXL file is here for download.

As part of this post was also about the feasibility of using FeedWordPress to aggregate JISC project blogs here are some of my notes:

  • Exporting post content was possible with MySQL access (interfaces would be easy to design)
  • Issue encountered with how post links stored (blogspot/blogger feeds use a internal guid=, where as other guids are links to the original posts). This was solved by fetching the comment feed and parsing the page link from the response
  • Need to be aware that FeedWordPress can't get content for partial rss feeds but plugins are available
  • Registering a feed is a manual process
  • Issue with mixed-mode feeds (if you are automatically pulling in all content if the feed source has a mixed use e.g. blog is mixture of personal and work, then data isn't as clean)
  • Getting the data into Google Refine made it easy to mine post content e.g. in one expression I'd calculated that ds106 has published over 4 million words

There are other technicalities of FeedWordPress I'd want to explore like how post edits are handled. Overall it was a lot easier having a single data source.

Your Thoughts

At this point it would be very useful to have your thoughts on this technique. And remember I’m a ‘maker of things’ so your suggestions may get incorporated into a functional tool ;)

In my next post in this series I’ll share my notes on building and analysing comments data.

1 Comment

As I mentioned in Filtering a Twitter hashtag community for questions and responses I’ve been asked to do some analysis of the Current/Future State of Higher Education (CFHE12) course. Week 1 has mainly been about creating a toolchain that makes it easier to hit a button and get some insight. The focus has mainly been on tweets with the #cfhe12 hashtag. I’m still scratching my head as to what this all means but there are already discussions to extend the scope trying to establish more context by also looking at blog and discussion forum posts. The danger I also have as a ‘maker of things’ as questions emerge I want to make things to help find the answers.

To easy into this lets start with an overview here are some key stats for 7-13th October 2012 (BST) (and already I resisting the temptation to create an overview template):

  • 762 Tweets
  • 305 Links
  • 172 RTs
  • 244 Unique twitter accounts
  • 14% (n.104) of tweets were in @reply to another person using #cfhe12

This sheet contains more details including a summary of who tweeted the most and got the most @mentions and the ‘Dashboard’ sheet which let me know that this was the most retweeted tweet:

Below are two graphs summarising the Twitter activity for week 1 of #cfhe12 (LHS) and another course earlier in the year #moocmooc (you can click on both of these for interactive versions).

summary of #cfhe12 tweets for week 1
#cfhe12 week 1 tweets

Summary of tweets from #moocmooc
#moocmooc tweets

It’s notable that the volume and proportion of tweets and @replies is higher in #moocmooc. Part of this could be down to the fact that #moocmooc was a condensed course that was one week long. Other factors may include the chosen infrastructure and how this was promoted, size of course and who was participating.

Extracting a conversation graph, which is shown below, there isn’t a great deal of @replies for week 1. In the graph each dot represents a single tweet and dots are joined if the person is @replying that tweet. I probably need to find a way for you to interact with this graph, but for now I’ve prepared these pages with conversations for groups G1-G4:

cfhe12 week 1 conversation graph
[The above graph data can be downloaded from the NodeXL Graph Gallery]

Exploring G3 and G4 some of the limitations of this technique become apparent. For example clicking on the date in the first tweet in G4 reveals the full text from Twitter, which includes text from G3 i.e. they are the same conversation and should be grouped together.

So more work to do, more things to think about, more tools needed to make sense of this easier. In the meantime any of your observations are greatly welcome.

9 Comments

In Notes on technology behind cMOOCs: Show me your aggregation architecture and I’ll show you mine I reached the point in my own mind that the key behind cMOOCs was how you aggregated and shared dispersed activity. At the time I also asked “Given the widespread use of Twitter in MOOCs are there tools/techniques required to aggregate and disseminate the course discussions?” and started looking at techniques to retrospectively analysis Twitter based discussions.  This activity hasn’t gone unnoticed and I was very grateful to be asked by Dave Cormier and George Siemens to do a weekly summary of Twitter data from their latest course Current/Future State of Higher Education (CFHE12) which started this week. This will be outwith my official CETIS work but given the increasing number of enquiries we are getting in this area it will undoubtedly feed in.

As I’ll be reporting on this course it made sense to sign-up. On one of the registration pages I noticed a couple of different hashtags left over from earlier course so asked the question:

Twitter status pageIf you visit the Twitter status page for this tweet you’ll see I got a couple of responses from AJCann and Jez Cope. If I had not sent you to that page how would have you known I got an answer? Did Jez know that Alan had already responded to me?

Given this type of dialogue, but at a higher level is a key aspect of learning and many a Greek has dined out on ‘knowing that they know nothing’ and started wondering how could this activity be aggregated and would this aggregation increase the situational awareness of participants and cause a shift in how the course community interacted with each other (I had recently read Tony Hirst’s post on Conference Situational Awareness and the example from the “London 2012 Olympic Games where it was identified that tweets relating to the congestion of the Olympic park entrances had a direct effect on crowd flow through the site” was still on my mind.

So after some late night code bashing here’s what I’ve come up with (this is very beta so your feedback is welcome – particularly if it doesn’t work). A Filtered Aggregation of #CFHE12 questions and responses (embedded below if you are viewing this post on my site):

What you have here is an aggregation of possible questions from #cfhe12 with buttons to filter for messages with and without replies. Because it’s linked to Twitter’s own embed code users can do the usual Twitter actions (reply, retweet etc). As noted there are some limitations perhaps the biggest is it isn’t 100% reliable in that I’ve got no way to include replies made without the #cfhe12 hashtag … in this version anyway.

I’ll let you go and play with and hopefully you’ll share your thoughts. Two things that spring to mind for me are: it would be nice if this page had RSS feeds just to keep the aggregation juices flowing; and wouldn’t it be interesting to use tweet favouriting to let the community curate questions/answers, a favourite representing an upvote (see Techniques for Live Tweet Curation)

Make your own

*** Open and copy TAGS v3.1Q ***

Run through the Basic and Advanced setup used in the TAGS v3.1 (you need to authenticate with Twitter).

In the spreadsheet open Tools > Script editor and follow the ‘To use Filter Questions Interface’ instructions

Upgrading an existing TAGS v3.1+ Archive

  1. imageOpen and copy TAGS v3.1Q and click on the ‘questionsFilter' sheet active.
  2. Activate the sheet tab menu and chose ‘Copy to…’.
  3. Now find your existing TAGS archive spreadsheet and copy.
  4. Once it has copied open the destination and rename the new sheet from ‘Copy of questionsFilter’ to questionsFilter
  5. Open Tools > Script editor… in your old archive and select New > File > Script file. Call the new file TAGSExtras
  6. In the new script tab copy and paste the code from here, then save
  7. Run > setup twice (first time to authorise, second to fun the function)
  8. File > Manage Versions and enter any description you like and Save New Version
  9. Publish > Deploy as web app... and click Update
  10. Run > getUrl and then open View > Logs... and copy the url into your browser address bar to view the result

How it was made (Non-techies you are free to leave ;)

The starting point was Twitter Archiving Google Spreadsheet TAGS v3. A hidden feature of this is to add a column to you Archive sheet called ‘possible_question’. When the archive collects tweets it looks for the text ‘? ‘ or ‘?’ at the end to identify the tweets might be a question and if so ‘TRUE’ is put in the archive column.

Having got a list of potential questions and associated tweet ids I could have put them in my failed lab experiment (and unfortunately titled) SpreadEmbed, but noticed that the embed.ly api doesn’t return a in-reply-to message with it embed code. To expand upon, because this is quite important, currently when you embed a tweet which is in reply you use something like this:

@mhawksey Most of us are using #cfhe12 ?

— AJCann (@AJCann) October 8, 2012
%MINIFYHTMLd82869496d94e97602160fd7591b92ea14%

Although this text doesn’t include the text of the message it is replying to Twitter clever bit of javascript renders it like this:

image

re-writing our little <blockquote> as:

Now you know why the page takes so long to render ;)

With this extra data we can use jQuery to find and filter tweets that have the class ‘twt-reply’.

To recap using TAGS we can identify tweets that might be questions and using a Twitter embed we can also automatically get the message it is in reply to. So to display a question and answer together we only need to find the answer and Twitter will render the question it is in reply to (still with me). The problem we’ve got is we can easily filter for questions (possible_question == TRUE) but not the answer. To do this I create a sheet of all the tweet id_strings that are questions (=QUERY(Archive!A:N,"select A WHERE N is not null LIMIT 50",FALSE))  and another where we know the tweet is in reply to something (=QUERY(Archive!A:N,"select A, K WHERE K starts with '2' LIMIT 50",FALSE)) . For the last bit I need to write some Google Apps Script which replaced any question tweet ids with the answer id, which gives us the ‘Combination of Qs and As’ column.

Extracting question and answer ids

To render the tweets on a page we need to get the embed snippet using Twitter’s official oembed endpoint. Because getting the embed code need authenticated access I again used Google Apps Script to fetch this data and cache the result. Using Apps Script ContentService I can expose this by publishing the spreadsheet as a web app and serving up each tweets embed code in JSONP. For example here’s the JSONP wrapped embed code for #CFHE12. The last part of the puzzle is some good old fashioned HTML/JavaScript which renders the Twitter embed code and adds some UI (the code is here).

2 Comments

A little snippet of code I’m using to get share counts for a url from a number of services in one call. The idea is I pass a url and I get some json back with counts from Facebook, Twitter, Delicious, Pinterest and Google +1s (if it’s a bad url or nothing returned from the service then null value).

json returned 

<?php
$url = $_GET['url'];
$finfo = json_decode(file_get_contents('http://api.ak.facebook.com/restserver.php?v=1.0&method=links.getStats&urls='.$url.'&format=json'));
$tinfo = json_decode(file_get_contents('http://urls.api.twitter.com/1/urls/count.json?url='.$url));
$dinfo = json_decode(file_get_contents('http://feeds.delicious.com/v2/json/urlinfo/data?url='.$url));
$pinfo = json_decode(preg_replace('/^receiveCount\((.*)\)$/', "\\1",file_get_contents('http://api.pinterest.com/v1/urls/count.json?callback=receiveCount&url='.$url)));
$gplus = gplus_shares($url);


//http://papermashup.com/google-plus-php-function/
function gplus_shares($url){
    // G+ DATA
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, "https://clients6.google.com/rpc?key=AIzaSyCKSbrvQasunBoV16zDH9R33D88CeLr9gQ");
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_POSTFIELDS, '[{"method":"pos.plusones.get","id":"p",
"params":{"nolog":true,"id":"' . $url . '","source":"widget","userId":"@viewer","groupId":"@self"},
"jsonrpc":"2.0","key":"p","apiVersion":"v1"}]');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-type: application/json'));
    $result = curl_exec ($ch);
    curl_close ($ch);
    return json_decode($result, true);
}

$output = array(
    'facebook'=> isset($finfo[0]) ? $finfo[0]->total_count : NULL,
    'twitter'=> isset($tinfo->count) ? $tinfo->count : NULL,
    'delicious'=> isset($dinfo[0]) ? $dinfo[0]->total_posts : NULL,
    'pinterest'=> isset($pinfo->count) ? $pinfo->count : NULL,
    'googlePlus'=> isset($gplus[0]['result']) ? $gplus[0]['result']['metadata']['globalCounts']['count'] : NULL

);

header('Content-Type: text/javascript');

echo "[".json_encode($output)."]";
?>

Usually I use sharedcount.com for this, but was worried that give over 15k urls to get data on I might trip their server (although I was given assurances that it was fine). I’ve deployed this code on a local webserver (XAMPP Lite) and it’s working well in parallel with Google Refine.

Here's some posts which have caught my attention this month:

Automatically generated from my Diigo Starred Items.

Yesterday I got stuck into the first week of the Coursera course on Computing for Data Analysis. The course is about:

learning the fundamental computing skills necessary for effective data analysis. You will learn to program in R and to use R for reading data, writing functions, making informative graphs, and applying modern statistical methods.

You might be asking given that I’ve already dabbled in R why am I taking an introductory course? As I sat watching the lectures on my own (if anyone wants to do a Google Hangout and watch next weeks lectures together let me know) I reminisced about how I learned to swim. The basic story is 6 year old boy is staying a posh hotel for first time, nags parents to take him to the swimming pool, when they get there gets changed runs off and jumps in at the deep end. When I eventually came back to the surface I assumed the doggy paddle and was swimming’ … well ‘swimming’ in the sense that I wasn’t drowning.

The method of ‘throwing myself in’ is replicated throughout my life, particularly when it comes to learning. So whilst I’ve already thrown myself into R I can survive but only just and what I’ve produced is mainly as a result of trying not to drown. This revelation was particularly clear when learning about subsetting (reshaping data)

I’ve got an example where I’ve been practicing my subsetting skills with NodeXL data later in this post, but first some quick reflections about my experience on the course so far.

MOOCing about in Coursera

So hopefully you’ve already got the picture that I’m a fairly independent learner so I haven’t bothered with the built-in discussion boards, instead opting to view the lectures (I’m finding x1.5 speed suits me) and take this weeks quiz. The assignment due for week 2 is already announced and people are racing ahead to get it done (which appears to have forced the early release of next weeks content).

Something apparent to me in the Coursera  site is the lack of motivational cues. I’ve got no idea how I’m doing in relationship with my fellow 40,000 other students in terms of watching the lectures or in this weeks quiz. Trying to get my bearings in using the #compdata Twitter hashtag hasn’t been that successful because in the last 7 days there have only been 65 people using or mentioned with the tag (and of the 64 tweets 29 were ‘I just signed up for Computing for Data Analysis #compdata …’)

Things are looking up on the Twitter front though as some recent flares have gone up:

and also @ @hywelm has made himself known ;)

Will there be much community building in the remaining 3 weeks?

Mucking about with NodeXL and R

In the section above I’ve mentioned various Twitter stats. To practice this week’s main compdata topics of reading data and subsetting I thought I’d have a go at getting the answers from a dataset generated in NodeXL (I could have got them straight from NodeXL but where is the fun in that ;).

Step 1 was to fire up NodeXL and import a Twitter Search for #compdata with all of the boxes ticked except Limit to… .

Twitter Search Import from NodeXL

As a small aside I grabbed the the NodeXL Options Used to Create the Graph used in this MOOC search by Marc Smith, hit the automate button and came up with the graph shown below (look at those isolates <sigh>):

The #compdata graph

To let other people play along I then uploaded the NodeXL spreadsheet file in .xlsx to Google Docs making sure the ‘Convert documents …’ was checked and here it is as a Google Spreadsheet. By using File > Publish to the web… I can get links for .csv versions of the sheets.

In R I wrote the following script:

If you run the script you should see various answers pop out. As I’m learning this if anyone would like to suggest improvements please do. My plan is to keep adding to the data and extending the script as the weeks go buy to practices my skills and see what other answers I can find

1 Comment

Via Doug Holton I spotted that there is a new YouTube beta feature to add multiple choice questions (MCQ) to your Youtube videos. Those who have already taken a course on Coursera (surely everyone has now ;), will know the video with MCQ is one of the staples. From the ‘Video Questions Editor Beta’ page it says:

Description

This is an opt-in beta for a simple Video Questions Editor on YouTube. Through this editor you can setup multiple questions to be displayed on top of your video during playback that a viewer can answer.

How can I use this feature?

The editor itself can be found on the video edit page, on the edit bar. After you have added several questions and the users have viewed them, you can see a summary of the interaction your users had with them through the analytics page, within the Annotations section.

Disclaimer

The feature represents work in progress, there is no plan for long-term support of the feature and may be removed at any time without prior notification. Your comments will help us improve and perfect the mixtures we're working on. So jump in, play around and send your feedback directly to the brains behind the scenes.

The page also includes a link to opt-in to the beta.

If you do when you go to edit one of your videos you’ll get a ‘Questions’ button

Add question

Clicking on this lets you enter your question:

Question entry

Unfortunately when I click on save it hangs on ‘Your changes are being saved...’.

Bigger picture

It’s a shame that this feature doesn’t work yet. It’s interesting to put this development into the context of Google’s recent open source release of Course Builder, which was used as the technology behind their Power Searching with Google online course (xMOOC). So is Google trying to rock the boat in the mass online education market?

5 Comments

As part of the JISC OER Rapid Innovation Programme we’ve been experimenting with monitoring project blogs by gluing together some scripts in Google Spreadsheets. First there was Using Google Spreadsheets to dashboard project/course blog feeds #oerri which was extended to include social activity around blog posts.

As the programme comes to a close projects will soon be thinking about submitting their final reports. As part of this projects agreed to submit a selection of their posts with a pre-identified set of tags shown below as a MS Word document. 

tag

structure

projectplan

detailed project plan, either in the post or as an attachment

aims

reminder of the objectives, benefits and deliverables of your project

usecase

link to / reproduce the use case you provided in your bid

nutshell

1-2 paragraph description in accessible language, an image, a 140 character description [1 post per project]

outputs

update posts on outputs as they emerge, with full links/details so that people can access them

outputslist

end of project: complete list of outputs, refer back to #projectplan and note any changes  [1 post per project]

lessonslearnt

towards of the end of the project, a list of lessons that someone like you would find useful

impact

end of project: evidence of benefits and impact of your project and any news on next steps

grandfinale

this is the follow up to the nutshell post. a description in accessible language, and a 2 minute video [1 post per project]

 

OERRI DashboardWhen this was announced at the programme start-up concerns were raised about the effort to extract some posts into a document rather than just providing links. As part of the original experimental dashboard one thing I had in mind was to automatically detect the tag specific posts and highlight which had been completed. Having got the individual post urls it hasn’t been too hard to throw a little more Google Apps Script to extract the content and wrap in a MS Word document (well almost – if you have some html and switch the file extension to .doc it’ll open in MS Word). Here’s the code and template to do it:

And here are the auto-generated reports for each project:

Projectposts (Est).PROD urlGenerated Report urlComments
Attribute images2http://prod.cetis.ac.uk/projects/attribute-image No tagged posts
bebop14http://prod.cetis.ac.uk/projects/bebopReport Link 
Breaking Down Barriers10http://prod.cetis.ac.uk/projects/geoknowledgeReport Link 
CAMILOE1http://prod.cetis.ac.uk/projects/camiloe No tagged posts
Improving Accessibility to Mathematics15http://prod.cetis.ac.uk/projects/math-accessReport Link 
Linked data approaches to OERs15http://prod.cetis.ac.uk/projects/linked-data-for-oersReport LinkPartial RSS Feed
Portfolio Commons10http://prod.cetis.ac.uk/projects/portfolio-commonsReport Link 
RedFeather18http://prod.cetis.ac.uk/projects/redfeatherReport Link 
RIDLR7http://prod.cetis.ac.uk/projects/ridlrReport LinkNot WP
sharing paradata across widget stores10http://prod.cetis.ac.uk/projects/spawsReport Link 
SPINDLE17http://prod.cetis.ac.uk/projects/spindleReport Link 
SupOERGlue6http://prod.cetis.ac.uk/projects/supoerglueReport LinkNot WP
synote mobile16http://prod.cetis.ac.uk/projects/synote-mobileReport Link 
TRACK OER12http://prod.cetis.ac.uk/projects/track-oerReport LinkNot WP
Xenith4http://prod.cetis.ac.uk/projects/xenithReport Link 
 157   

Issues

I should say that these are not issues I have with the OERRI projects, but my own issues I need to solve to make this solution work in a variety of contexts.

  • Missing tags/categories – you’ll see the dashboard has a number of blanks. In some cases it’s not the projects fault (as the majority of projects used WordPress installs it was easier to focus on these), but in other cases projects mix tags/categories or just forget to include them
  • Non-WordPress – 3 of the projects don’t use WordPress, so other ways to grab the content are required
  • RSS Summary instead of full feed – ‘Linked data approaches to OERs’ uses a summary in their RSS feed rather than full-text. As this script relies on a full text feed it can’t complete the report (one of my pet hates is RSS summary feeds – common people you’re supposed to be getting the word out, not putting up barriers.)

Hopefully it’s not a bad start and if nothing else maybe it’ll encourage projects to sort out their tagging. So what have I missed … questions welcomed.

I should say this post contains a lot of technical information, doesn't give much background and is mainly for my hard-core followers

This is a very lose sketch of an experiment I might refine which uses Jason Davies wordcloud script (add-on for d3.js) as a way to filter data hosted in a Google Spreadsheet. I was essentially interested in the Twitter account descriptions of the community using the the Social Media Week – Glasgow hashtag, but a minor detour has reminded me you can:

  • get json data straight from a Google Spreadsheet
  • you can build dynamic queries to get what you want

So I fired up NodeXL this morning and got this pretty graph of how people using the #smwgla hashtag at least twice follow each other.

people using the #smwgla hashtag at least twice follow each other

One of the features of NodeXL is to add stats columns to your data which includes friend/follower counts, location and profile descriptions.

NodeXL Stats

Uploading the data from NodeXL (Excel) to Google Spreadsheets allows me to render an interactive version of the community graph using my NodeXL Google Spreadsheet Graph Viewer.

interactive version of the #smwgla community graph

All this is doing is grabbing data from Google Spreadsheets using their Visualization API and rendering it visually using javascript/HTML5 canvas. You can use the query language part of this API to get very specific data back (if you want a play try Tony Hirst’s Guardian Datastore Explorer). Using Tony’s tool I got this query built. One thing you might notice is I’m selecting a column of twitter description WHERE it contains(‘’) <- a blank – if it’s a blank why did I bother with the WHERE statement?

Switching to Jason Davies wordcloud demo we can play with custom data sources if you have some JSON. In Tony’s tool you have options to get the data in html (tqx=out:html) and csv (tqx=out:csv). There is a third undocumented option for json tqx=out:json. Using this we can get a url for the wordcloud generator https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%2727%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

To make the wordcloud interactive, so that when you click on a term it filters the data on that term was can use the option to include {word} in our source url e.g. https://spreadsheets.google.com/tq?tqx=out:json&tq=select%20AH%20where%20AH%20contains%28%27{word}%27%29&key=0Ak-iDQSojJ9adGNUUXZnU2k3V1FRTjR3eFp0RmRNZWc&gid=118

And here is the final result, an interactive wordcloud of #smwgla Twitter account descriptions [Note: you need to hit the Go button when you click-through]:

interactive wordcloud of #smwgla Twitter account descriptions

The end result useful? Not sure, but how the data is extracted is (to me anyway).

4 Comments

importHTML is a fantastic formula you can use in Google Spreadsheets. Here’s Google’s support documentation for importHTML:

importHtml

Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page. Either "list" or "table" indicates what type of structure to pull in from the webpage. If it's "list," the function looks for the contents of <UL>, <OL>, or <DL> tags; if it's "table," it just looks for <TABLE> tags. Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4). This function returns demographic information for the population of India.

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

What’s even better is you can wrap this formula in other formula to get the data in the shape you want. A case in point I was recently asked:

Using TRANSPOSE

The answer is yes, you can TRANSPOSE a importHTML. Let use the Demographics of India table from the support documentation as an example. To switch columns into rows we can use =TRANSPOSE(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4))

This lets us change the way the data is imported from this:

"=ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4)"

to this:

"=TRANSPOSE(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4))"

Using QUERY

Lets now say we are only interested in the population figures for 1991 and 2001.  You could always just import all the data then pull it using a cell reference. Another way of doing this is to wrap our data in a QUERY formula.

The QUERY function is a built-in function that allows you to perform a query over an array of values using the Google Visualization API Query Language.

Anyone used to tinkering with databases will recognise the query language which uses the clauses like SELECT, WHERE, GROUP_BY etc.

There are a couple of ways to query our data for the population of India in 1991 and 2001.

Using LIMIT and OFFSET

  • Limit - Limits the number of returned rows.
  • Offset - Skips a given number of first rows.

Using these we could use the query "SELECT * LIMIT 2 OFFSET 4". This selects all the columns (using *) and then limits to 2 results starting from the 4th row. The order of limit/offset is important, using these the other way around won’t return any results.

"=QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT * LIMIT 2 OFFSET 4 ")"

SELECT columns

  • Select - Selects which columns to return, and in what order. If omitted, all of the table's columns are returned, in their default order.

Because we are using importHTML as our datasource when selecting the columns we need to use the syntax Col1, Col2, Col3 …. So if you just want the year and population our query could be "SELECT Col1, Col2 LIMIT 2 OFFSET 4"

"=QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT Col1, Col2 LIMIT 2 OFFSET 4 ")"

WHERE rows

  • Where - Returns only rows that match a condition. If omitted, all rows are returned.

One issue with using limit/offset is if more data is inserted into the source table it might push your results out of the range. A way around this is to include a WHERE clause to only include data on certain conditions. WHERE allows various comparison operators like <=, =, >, multiple conditions (‘and’, ‘or’ and ‘not’) and more complex string comparisons like ‘contains’. More information on WHERE conditions here. So if we only wan the population where the year is 1991 or 2001 we can use the query "SELECT Col1, Col2 where Col1='*1991*' or Col1='*2001*'"

For this last example lets also TRANSPOSE the result and remove the table header:

"=TRANSPOSE(QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT Col1, Col2 WHERE Col1='*1991*' or Col1='*2001*'",0))"

So there you using the QUERY formula to be more selective on your html import to Google Spreadsheets. Here is a copy of the spreadsheet with all the examples I’ve used in this post Any questions/clarifications leave a comment.

PS Tony Hirst has also  written about Using Google Spreadsheets Like a Database – The QUERY Formula and this is a place if you want some more query examples.

PPS I’m on leave now which is why this post has very little to do with CETIS or OER.