importHTML is a fantastic formula you can use in Google Spreadsheets. Here’s Google’s support documentation for importHTML:
Syntax: ImportHtml(URL, query, index)
URL is the URL of the HTML page. Either "list" or "table" indicates what type of structure to pull in from the webpage. If it's "list," the function looks for the contents of <UL>, <OL>, or <DL> tags; if it's "table," it just looks for <TABLE> tags. Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.
Example: =ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4). This function returns demographic information for the population of India.
Note: The limit on the number of ImportHtml functions per spreadsheet is 50.
What’s even better is you can wrap this formula in other formula to get the data in the shape you want. A case in point I was recently asked:
@mhawksey is it possible in GS/sheet to use ImportHTML and transpose table data on import into columns if the same source table format?
— jamestoon (@jamestoon) September 13, 2012
The answer is yes, you can TRANSPOSE a importHTML. Let use the Demographics of India table from the support documentation as an example. To switch columns into rows we can use
This lets us change the way the data is imported from this:
Lets now say we are only interested in the population figures for 1991 and 2001. You could always just import all the data then pull it using a cell reference. Another way of doing this is to wrap our data in a QUERY formula.
The QUERY function is a built-in function that allows you to perform a query over an array of values using the Google Visualization API Query Language.
Anyone used to tinkering with databases will recognise the query language which uses the clauses like SELECT, WHERE, GROUP_BY etc.
There are a couple of ways to query our data for the population of India in 1991 and 2001.
Using LIMIT and OFFSET
- Limit - Limits the number of returned rows.
- Offset - Skips a given number of first rows.
Using these we could use the query
"SELECT * LIMIT 2 OFFSET 4". This selects all the columns (using *) and then limits to 2 results starting from the 4th row. The order of limit/offset is important, using these the other way around won’t return any results.
- Select - Selects which columns to return, and in what order. If omitted, all of the table's columns are returned, in their default order.
Because we are using importHTML as our datasource when selecting the columns we need to use the syntax Col1, Col2, Col3 …. So if you just want the year and population our query could be
"SELECT Col1, Col2 LIMIT 2 OFFSET 4"
- Where - Returns only rows that match a condition. If omitted, all rows are returned.
One issue with using limit/offset is if more data is inserted into the source table it might push your results out of the range. A way around this is to include a WHERE clause to only include data on certain conditions. WHERE allows various comparison operators like <=, =, >, multiple conditions (‘and’, ‘or’ and ‘not’) and more complex string comparisons like ‘contains’. More information on WHERE conditions here. So if we only wan the population where the year is 1991 or 2001 we can use the query
"SELECT Col1, Col2 where Col1='*1991*' or Col1='*2001*'"
For this last example lets also TRANSPOSE the result and remove the table header:
So there you using the QUERY formula to be more selective on your html import to Google Spreadsheets. Here is a copy of the spreadsheet with all the examples I’ve used in this post Any questions/clarifications leave a comment.
PS Tony Hirst has also written about Using Google Spreadsheets Like a Database – The QUERY Formula and this is a place if you want some more query examples.
PPS I’m on leave now which is why this post has very little to do with CETIS or OER.