Reshaping importHTML data in Google Spreadsheet using QUERY and TRANSPOSE formula

importHTML is a fantastic formula you can use in Google Spreadsheets. Here’s Google’s support documentation for importHTML:

importHtml

Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page. Either “list” or “table” indicates what type of structure to pull in from the webpage. If it’s “list,” the function looks for the contents of <UL>, <OL>, or <DL> tags; if it’s “table,” it just looks for <TABLE> tags. Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml(“http://en.wikipedia.org/wiki/Demographics_of_India“; “table”;4). This function returns demographic information for the population of India.

Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

What’s even better is you can wrap this formula in other formula to get the data in the shape you want. A case in point I was recently asked:

Using TRANSPOSE

The answer is yes, you can TRANSPOSE a importHTML. Let use the Demographics of India table from the support documentation as an example. To switch columns into rows we can use =TRANSPOSE(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4))

This lets us change the way the data is imported from this:

"=ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4)"

to this:

"=TRANSPOSE(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4))"

Using QUERY

Lets now say we are only interested in the population figures for 1991 and 2001.  You could always just import all the data then pull it using a cell reference. Another way of doing this is to wrap our data in a QUERY formula.

The QUERY function is a built-in function that allows you to perform a query over an array of values using the Google Visualization API Query Language.

Anyone used to tinkering with databases will recognise the query language which uses the clauses like SELECT, WHERE, GROUP_BY etc.

There are a couple of ways to query our data for the population of India in 1991 and 2001.

Using LIMIT and OFFSET

  • Limit – Limits the number of returned rows.
  • Offset – Skips a given number of first rows.

Using these we could use the query "SELECT * LIMIT 2 OFFSET 4". This selects all the columns (using *) and then limits to 2 results starting from the 4th row. The order of limit/offset is important, using these the other way around won’t return any results.

"=QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT * LIMIT 2 OFFSET 4 ")"

SELECT columns

  • Select – Selects which columns to return, and in what order. If omitted, all of the table’s columns are returned, in their default order.

Because we are using importHTML as our datasource when selecting the columns we need to use the syntax Col1, Col2, Col3 …. So if you just want the year and population our query could be "SELECT Col1, Col2 LIMIT 2 OFFSET 4"

"=QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT Col1, Col2 LIMIT 2 OFFSET 4 ")"

WHERE rows

  • Where – Returns only rows that match a condition. If omitted, all rows are returned.

One issue with using limit/offset is if more data is inserted into the source table it might push your results out of the range. A way around this is to include a WHERE clause to only include data on certain conditions. WHERE allows various comparison operators like <=, =, >, multiple conditions (‘and’, ‘or’ and ‘not’) and more complex string comparisons like ‘contains’. More information on WHERE conditions here. So if we only wan the population where the year is 1991 or 2001 we can use the query "SELECT Col1, Col2 where Col1='*1991*' or Col1='*2001*'"

For this last example lets also TRANSPOSE the result and remove the table header:

"=TRANSPOSE(QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT Col1, Col2 WHERE Col1='*1991*' or Col1='*2001*'",0))"

So there you using the QUERY formula to be more selective on your html import to Google Spreadsheets. Here is a copy of the spreadsheet with all the examples I’ve used in this post Any questions/clarifications leave a comment.

PS Tony Hirst has also  written about Using Google Spreadsheets Like a Database – The QUERY Formula and this is a place if you want some more query examples.

PPS I’m on leave now which is why this post has very little to do with CETIS or OER.

Last updated by at .

3 Responses to “Reshaping importHTML data in Google Spreadsheet using QUERY and TRANSPOSE formula”


Leave a Reply

About

This blog is authored by Martin Hawksey Google+

JISC CETIS Learning Technology Advisor (OER Programme Support)
jisc cetis logo

The MASHezine (tabloid)

It's back! A tabloid edition of the latest posts in PDF format (complete with QR Codes). Click here to view the MASHezine

Preview powered by:
Bluga.net Webthumb

The MASHebook

You can also download this post as:

Subscribe to monthly email digest of posts

Loading...Loading...


Subscribe to per post email updates

Enter your email address:

Delivered by FeedBurner

Copyright License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Unported License. CC-BY mhawksey

Privacy /Cookies

This blog uses Google Analytics (which makes use of 'cookie' technologies) to provide information on usage. Here's an overview of Google Analytics Privacy and how to opt-out (other 3rd party services like Twitter might also be tracking you via this site, but as far as possible I try and prevent this by removing official tweet buttons).

Badges

. . .