Feeding Google Spreadsheets: Exercises in using importHTML, importFeed, importXML, importRange and importData (with some QUERY too)

On Monday I’ll be helping Tony (Hirst) deliver a workshop at Internet Librarian International … W2 – Appropriating IT: innovative uses of emerging technologies. Tony has already posted a draft outline of the day. For the event I wanted to prepare some notes on using Google Spreadsheets/Sheets (I very confused about what they are called now. There was a very quiet announcement here of a change).

I’ve come up with some practical exercises for participants to try. There will be a lot of contextualisation going on as part of the workshop, the main idea is to seed some ideas of what’s possible. Here’s the Google Docs version of the guide. The focus is on what you can do with the importHTML, importFeed, importXML, importRange and importData spreadsheet formula with a few others, including QUERY, thrown into the mix. Using QUERY isn’t straight forward but it opens up lots of possibilities in how you reshape data (extra examples of these are included in the Google Docs version).

Note: All text and images with the exception of text in grey boxes is licensed Creative Commons Attribution 3.0 Unported License. CC-BY Hirst & Hawksey (if this isn’t clear check the source or get in touch)

importHtml

Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page.

Query is either “list” or “table” indicates what type of structure to pull in from the webpage. If it’s “list,” the function looks for the contents of <UL>, <OL>, or <DL> tags; if it’s “table,” it just looks for <TABLE> tags.

Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml(“http://en.wikipedia.org/wiki/Demographics_of_India“; “table”;4). This function returns demographic information for the population of India.

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 1: Importing a html table and graphing the result

  1. Create a new Google Spreadsheet
  2. In cell A1 enter the formula to import a table from Wikipedia=importHTML("http://en.wikipedia.org/wiki/2012_Summer_Olympics_medal_table","table",3)
  3. Select Insert > Chart and then select the data ranges for country name and total medals
    Selecting chart data
  4. While still in the Chart Editor select the Charts tab, then Maps > Geo charts – regions
  5. Still in the Chart Editor select Customise and change the No value colour to blank.
  6. Finally click Insert

You should now have a chart that looks like this:

World map with missing data

Notice that the chart has No values (black) for most of the countries.  To fix this we need to remove the country codes in brackets. One way to do this is trim the text from the left until the first bracket “(“. This can be done using a combination of the LEFT, FIND and ARRAYFORMULA (ARRAYFORMULA allows you to apply formulas to multiple cells).

  1. In cell H2 enter the formula =ARRAYFORMULA(LEFT(B2:B86,FIND("(",B2:B86)-2)) this should create a column of country names without brackets)
  2. Click on your Chart and select Advanced edit.
  3. Make sure you are on the Start tab in the Chart Editor and edit the data range forSheet1!B2:B86 to Sheet1!H2:H86 then click Update

Your chart should now look like this (my solution http://goo.gl/8qUI9):

World map without missing data

importFeed

Syntax: =ImportFeed(URL; [feedQuery | itemQuery]; [headers]; [numItems]). The arguments to the function are as follows:

URL is the url of the RSS or ATOM feed.

feedQuery/itemQuery is one of the following query strings: "feed", "feed title", "feed author", "feed description", "feed url", "items", "items author", "items title", "items summary", "items url", or "items created". The feed queries return properties of the feed as a whole: the feed’s title, the feed’s author, etc.Note: To get the data included in the feed, you need to do an "items" request.

  1. the "feed" query returns a single row with all of the feed information.
  2. the "feed <type>" query returns a single cell with the requested feed information.
  3. the "items" query returns a full table, with all of the item information about each item in the feed.
  4. the "items <type>" query returns a single column with the requested information about each item.
  5. if a query is given that begins with "feed", the numItems parameter isn’t necessary and is replaced by the option headers param.
  6. if a query is given that begins with "items", the numItems parameter is expected as the 3rd param, and headers as the 4th.
  7. headers – "true" if column headers is desired. This will add an extra row to the top of the output labeling each column of the output.

Example: =ImportFeed("http://news.google.com/?output=atom")

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

Exercise 2: Importing a RSS feed and getting social share counts

  1. Open http://goo.gl/aai2p
  2. In cell B5 enter the RSS feed url for a blog (or yo
    u can use http://feeds.feedburner.com/MASHe) and hit enter

You should end up with something like:

RSS social counts

An aside: Spreadsheet Addiction by Patrick Burns (http://goo.gl/P6pQP) – highlights the dangers of using spreadsheets for analytics. Particular issues include the ambiguity of a cell being a value or a formula. For example, if I sort cells on the value in the Twitter count column all the data is lost because cells are sorted as values but actually contain formula which get broken.

How it works

In cell A11 is the formula =IF(ISBLANK(B5),,IMPORTFEED(B5,"items",FALSE)) If the feed url is not blank this fetches the RSS feed defined in B5. Results are returned in cells A11:E30. You may have noticed that column E is hidden this is because it contains the feed item description.

The social share counts are returned by a custom function written in Google Apps Script (https://script.google.com). Google Apps Script is similar to Excel Macros, written using a JavaScript syntax. If you open Tools > Script editor in your spreadsheet you can see some of the custom script powering the spreadsheet. This includes the getSharedCount formula used in cells F11:F30 which passes the post url to the SharedCount.com API and returns social share counts. The code used is:

function getSharedCount(sourceLink){
//var url = "https://hawksey.info/blog/2012/02/oer-visualisation-project-fin-day-40-5/"
 var url = extractLink(sourceLink);
 var cache = CacheService.getPublicCache(); // using Cache service to prevent too many urlfetch
 var cached = cache.get("C"+url);
 if (cached != null) { // if value in cache return it
   //var test = cached.split(",")
   return cached.split(",");
 }
 try {
   var options =
   {
     "method" : "get",
     "contentType" : "application/json"
   };
   var response = UrlFetchApp.fetch("http://api.sharedcount.com/?url="+encodeURI(url), options);
   var data = Utilities.jsonParse(response.getContentText());
   var output = [];
   for (i in data){
     if (i == "Facebook"){
       output.push(data[i].total_count)
     } else {
       output.push(data[i]);
     }
   }
   cache.put("C"+url, output, 86400); // cache set for 1 day
   return output;
 } catch(e){
   Logger.log(e);
 }
}

For more examples of Google App Script see http://scoop.it/t/gas.

importXML

Syntax: =ImportXML(URL, query)

URL is the URL of the XML or HTML file.

Query is the XPath query to run on the data given at the URL. Each result from the XPath query is placed in its own row of the spreadsheet. For more information about XPath, please visithttp://zvon.org/xxl/XPathTutorial/Output/.

Example: =importXml("http://www.toysrus.com"; "//a/@href"). This returns all of the href attributes (the link URLs) in all the <a> tags on http://www.toysrus.com homepage.

Note: The limit on the number of ImportXML functions per spreadsheet is 50.

Exercise 3: Turn a page of RSS Feeds into an OPML file

  1. Create a new spreadsheet
  2. In cell A1 enter the text ‘Title’ and in cell B2 ‘Url’
  3. Now in cell A2 enter the formula=ImportXML("http://edfuture.mooc.ca/feeds.htm","//b/a")
  4. Cell B2=ImportXML("http://edfuture.mooc.ca/feeds.htm","//a[.=’XML’]/@href")
  5. File > Publish to the web and click ‘Start publishing’, copy the link in the bottom box then ‘Close’
  6. Visit http://opml-generator.appspot.com/ and paste your spreadsheet link in the box and copy the generated link into your broswer address bar
How it works

Using XPath we can identify parts of a XML (including HTML) page we want to extract. The screenshow below shows how parts of the page are identified. [I always struggle with XPath so use browser extensions to help (Scraper and XPath Helper)]. The results are pulled into the spreadsheet as live data so if the source page is updated the data in the spreadsheet will also be updated.

XPath parts

ImportRange

Syntax: =ImportRange(spreadsheet-key, range)

Spreadsheet-key is a STRING which is the key value from the spreadsheet URL.

Range is a STRING representing the range of cells you want to import, optionally including the sheet name (defaults to first sheet). You can also use a range name if you prefer. Given that the two arguments are STRINGs, you need to enclose them in quotes or refer to cells which have string values in them.

Example: =importrange("abcd123abcd123", "sheet1!A1:C10")

"abcd123abcd123" is the value in the "key=" attribute on the URL of the target spreadsheet and "sheet1!A1:C10" is the range which is desired to be imported.

Note: In order to use ImportRange, you need to have been added as a viewer or collaborator to the spreadsheet from which ImportRange is pulling the data. Otherwise, you’ll get this error: "#REF! error: The requested spreadsheet key, sheet title, or cell range was not found."

[Here’s http://goo.gl/b8FXC is a copy of the completed spreadsheet used in exercises 4, 5 and 6]

Exercise 4: Importing data from Guardian Datastore

  1. Visit http://goo.gl/j6RBU and click Get the Data, then DATA: download the full spreadsheet
  2. Keep this window open and create a new spreadsheet.
  3. In cell A1 enter=ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104") – notice how the key and range are entered

Lets now reshape the data so we can generate some graphs. Lets first calculate the change in rank between 2011 and 2012

  1. In cell E2 enter the formula =B2-A2
  2. Fill this formula for the rest of the rows (there are a couple of ways of doing this including copying cell E2, highlighting the other empty cells in Column E and pasting, or whilst E2 is active grab and drag the bottom right corner of the cell
  3. Now we want to get a list of the countries included in column D. To do this in cell G2 enter the formula =UNIQUE(D2:D102)
  4. Now we want to sum the rank difference per country by entering the following formula in cell H2:=SUMIF(D$2:D$102,G2,E$2:E$102)
  5. Copy this value down for the remaining rows
  6. Select the data range G2:H16 and Insert > Chart > Bar chart

University world ranking delta

Graph of change in top 100 world university ranking.

Is the process for producing this chart valid? Is it displaying a meaningful representation of the data? Is this chart a lie?

Important: Notice that the calculation for France has an error:

Broken value

This is because on row 93 the source data doesn’t have a number value. Because we’ve used ImportRange to get the data we can’t edit it as are changes get overwritten by the importRange formula in cell A1. In our scenario we can remove the calculated value in E93 or use a formula to handle the error. Other ways around this are to flatten the imported data by copying all of it and paste as values (other solutions exist which we cover later) 

ImportData

Syntax: =ImportData(URL)

URL is the URL of the CSV or TSV file. This imports a comma- or tab-separated file.

Note: The limit on the number of ImportData functions per spreadsheet is 50.

Exercise 5: Importing CSV data from Google Maps

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. In cell A1 of the new sheet enter the formula =FILTER(Sheet1!A:E,Sheet1!D:D="United Kingdom") to filter the data on sheet1 where column D is the United Kingdom
  3. Now in column F1 enter the formula =ImportData("http://maps.google.com/maps/geo?output=csv&q="&C2) and press return
  4. Copy this cell down for the remaining rows

You should now have a table that looks a little like this:

ImportData from Google Maps

Import… and QUERY

Syntax: =QUERY(data, query, headers)

Data – An array of data. For example: A1:B6, data!B2:H6, ImportRange(spreadsheet_key, [sheet!]range),FILTER(sourceArray, arrayCondition_1, …)

Query – A query string for applying data operations. The query operates on column IDs directly from the input range and uses a subset of the SQL language. For example, "select E," "select A , B," "sum(B),C group by C," "select D where D < ‘Nick’ ." In certain instances, for example when using FILTER as a data source, column identifiers are Col1, Col2 etc. For more information on creating queries read see Google Visualization API Query Language

Headers (optional) – A number specifying the number of header rows in the input range. If omitted, or set to -1, the number of header rows is guessed from the input range. This parameter enables transformation of multi-header rows range input to be transformed to a single row header input which the QUERY supports.

Exercise 6: Import and QUERY

  1. In the spreadsheet you created for exercise 4 Insert > New Sheet
  2. Form your original sheet (Sheet1) copy the importRange formula in cell A1
  3. In your new sheet (Sheet3) paste the formula you just copied inside a QUERY formula shown below

    =QUERY(ImportRange("0AonYZs4MzlZbdFdrRGthS3dLVzlBdWVrV2lIbzZKY0E","Times Higher Education rankings 2012!A1:D104"), "SELECT Col1, Col2, Col3, Col4, Col2-Col1 LABEL Col2-Col1 ‘Difference’")

You should now have a table that looks like this:

QUERY for difference

How it works

The QUERY function imports the data and using the Google query language selects columns 1 to 4 and also adds a fifth by taking the difference between Col2 and Col1, this new column is labeled as Difference. Notice that on row 93 the French entry no longer has an error, but is blank.

We could continue exercise 4 and get a summary chart using UNIQUE and SUMIF. An alternative would be to use the QUERY formula again to do this for us by:

  1. In Sheet 3 enter the following formula in cell G1 =QUERY(A:E,"SELECT D, SUM(E) WHERE D <> ” GROUP BY D ORDER BY D")

This time we are setting the data source as all the data in columns A to E in Sheet3. Next we are creating a query that selects column D and a sum of column E where D is no blank and grouped by the value in column D (the country names).

A reminder that here’s a copy of the completed spreadsheet used in exercises 4, 5 and 6 http://goo.gl/b8FXC

Summary/Keypoints

Hopefully you’ve seen that Google Sheets (Spreadsheets) can be a useful tool for importing data from other sources and reshaping it to fit your needs. The last set of exercises are more advanced so don’t worry if you don’t fully understand what is happening, they are there as an indication of what is possible and hopefully inspire you to find out more.

Three points worth remembering when using import formula:

  • The data is a live link – if the data source changes or disappears  your imported data will also change or disappear (this can be both an advantage and disadvantage).
  • The ambiguity of a cell – because a spreadsheet cell can a value and a formula sorting values generated by formulas can lead to unforeseen problems.
  • There are other ways of getting data into Google Sheets – this guide has only looked at ‘Import…’ formula for getting pulling data in. There are other ways of populating sheets using Google Forms, Google Apps Script and 3rd party services like IFTTT (for an example of this last one see ‘IFTTT: IF I do THAT on {insert social network/rss feed/other} THEN add row to Google Spreadsheet http://goo.gl/nB0vD)

A reminder that the Google Docs version of the guide is here and contains more examples. If you get stuck leave a comment.

chevron_left
chevron_right

Join the conversation

comment 22 comments
  • Tina

    I imported an RSS feed into the spreadsheet & it shows Title, Author, URL, and Date Created.
    For the social shares, it just shows #NAME in the first column with an error of “Unknown function name GETSHAREDCOUNT”. Not sure what I’m doing wrong?

    • Martin Hawksey

      If you didn’t copy the template (or the template didn’t copy correctly) then you’ll be missing the custom formula that fetches the data. When you open Tools > Script editor what do you see?

  • Tina

    I just did another copy and used a different RSS feed. The same fields were populated.
    F9 and F11 both returned #NAME errors and there’s no formulas in the other social share columns.
    Tools > script editor only has this “function myFunction()” so I’m missing something
    Very confused with what I’m doing wrong.
    Thanks

  • Tina

    Working now. What a great tool. Thank you!!

  • Volkan

    Hello Thanks for Saved my Day. That really working and help. I want to ask you this. When i tried to import data from csv source. Spreadsheet shows all of data in a row. Data looks like
    – mpcbm(94,[[‘458′,’a380’,’aloha airways ‘,[32.8661098,26.588951,13363.54,120.3,-1.82,-1.04],1362712241588,’458’],[‘460′,’AN140′,’Alphalead 15(MTA)’,[25.798486,-80.2858274,5.24,19.2,-0.8,0.28],1362712210179,’460′],[‘462′,’a380′,’Octa Air TGBr UA CLX-770-PT’,[25.7839661,-80.2159159,394.47,133.79,-15.14,-25.32],1362712254640,’462′],[‘476′,’md11′,’Phil. Intl. Airline’,[25.7887856,-80.2642534,58.96,91.42,-7.84,-3.31],1362770679316,’476′],[‘480′,’hughes’,’CAPTAIN AIMAN’,[-22.990834,-43.3724926,29.33,-16.6,-3.96,13.21],1362712248528,’480′],[‘848′,’cub’,’MumbaiSky’,[19.1475676,72.9643679,442.9,86.76,-2.12,-2.89],1362712239782,’848′],[‘1502′,’a380′,’AAV VP (Rick)’,- Data not hava column names. So how can i show this data in columns ?
    Best wishes.

    • Martin Hawksey

      Looks like either a bad csv encoded source or Google isn’t parsing properly. Are you about to share the csv source and I might be able to show you a workaround?

  • Volkan

    Sorry for answered late and thanks for your help. source is http://net.gefs-online.com:8080/map
    yeah its bad enconded looks like navigation data..
    Best wishes

  • Volkan

    Hello, i think i found which character they are using for seperating. works fine when i tried ‘, and show data 3 colums is working fine. But i tried using this over a table generator. how can i configure spreadsheet to seperate data with ‘, ..
    Best Wishes

  • Volkan

    Thank you Martin i never knew if it was a JSON table. now i think my problem get easier.Anyways thanks for your help… Your step by step turtioals really helped .
    Best Wishes

  • #ocTEL Using Google Spreadsheets for a basic analytic to find your fledging bloggers Jisc CETIS MASHe

    […] If you look at Sheet 1 you’ll see there is a static list of Feed urls (Column B). This was obtained by using the Chrome Screen Scrape extension. The reason it is static is I wanted a column where tutors could make comments and a dynamic import would screw this up. Looking at Sheet 2 cell A1 you can see how to do a dynamic list using the importXML function (see this post for more examples of spreadsheet import options).    […]

  • Badging Wikipedia Contributions | Finding Knowledge

    […] Scrape tables from user contributions (e.g. using the gdocs scraper – various options, importHtml doesn’t seem to play well so I’ll have to look through, Martin Hawksey’s great guide) […]

  • Feeding Google Spreadsheets: Exercises in using...

    […] In this post I work through some exercises in using the importHTML, importFeed, importXML, importRange and importData formula available in Google Sheets/Spreadsheet. As part of this I show how you …  […]

Comments are closed.

css.php