Reshaping importHTML data in Google Spreadsheet using QUERY and TRANSPOSE formula

importHTML is a fantastic formula you can use in Google Spreadsheets. Here’s Google’s support documentation for importHTML:

importHtml

Syntax: ImportHtml(URL, query, index)

URL is the URL of the HTML page. Either “list” or “table” indicates what type of structure to pull in from the webpage. If it’s “list,” the function looks for the contents of <UL>, <OL>, or <DL> tags; if it’s “table,” it just looks for <TABLE> tags. Index is the 1-based index of the table or the list on the source web page. The indices are maintained separately so there might be both a list #1 and a table #1.

Example: =ImportHtml(“http://en.wikipedia.org/wiki/Demographics_of_India“; “table”;4). This function returns demographic information for the population of India.
Note: The limit on the number of ImportHtml functions per spreadsheet is 50.

What’s even better is you can wrap this formula in other formula to get the data in the shape you want. A case in point I was recently asked:

Using TRANSPOSE

The answer is yes, you can TRANSPOSE a importHTML. Let use the Demographics of India table from the support documentation as an example. To switch columns into rows we can use =TRANSPOSE(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4))
This lets us change the way the data is imported from this:

ImportHtml("http:

to this:

TRANSPOSE(ImportHtml("http:

Using QUERY

Lets now say we are only interested in the population figures for 1991 and 2001.  You could always just import all the data then pull it using a cell reference. Another way of doing this is to wrap our data in a QUERY formula.

The QUERY function is a built-in function that allows you to perform a query over an array of values using the Google Visualization API Query Language.

The syntax for the formula is QUERY(data, query, headers) where:

  • data – is an array of values (this can be a cell reference like data!B2:H6 or a imported range)
  • query – A query string for applying data operations. If you are using a cell reference or importRange then the ‘query operates on column IDs directly from the input range and uses a subset of the SQL language. For example, “select E,” “select A , B,” “sum(B),C group by C,” “select D where D < ‘Nick’ .”’ Note: When using importHTML column IDs are the column numbers using Col{number} e.g. “select Col1, Col2”
  • headersA number specifying the number of header rows in the input range. If omitted, or set to -1, the number of header rows is guessed from the input range. This parameter enables transformation of multi-header rows range input to be transformed to a single row header input which the QUERY supports.

Anyone used to tinkering with databases will recognise the query language which uses the clauses like SELECT, WHERE, GROUP_BY etc.
There are a couple of ways to query our data for the population of India in 1991 and 2001.

Using LIMIT and OFFSET

  • Limit – Limits the number of returned rows.
  • Offset – Skips a given number of first rows.

Using these we could use the query "SELECT * LIMIT 2 OFFSET 4". This selects all the columns (using *) and then limits to 2 results starting from the 4th row. The order of limit/offset is important, using these the other way around won’t return any results.

QUERY(ImportHtml("http:

SELECT columns

  • Select – Selects which columns to return, and in what order. If omitted, all of the table’s columns are returned, in their default order.

Because we are using importHTML as our datasource when selecting the columns we need to use the syntax Col1, Col2, Col3 …. So if you just want the year and population our query could be "SELECT Col1, Col2 LIMIT 2 OFFSET 4"

QUERY(ImportHtml("http:

WHERE rows

  • Where – Returns only rows that match a condition. If omitted, all rows are returned.

One issue with using limit/offset is if more data is inserted into the source table it might push your results out of the range. A way around this is to include a WHERE clause to only include data on certain conditions. WHERE allows various comparison operators like <=, =, >, multiple conditions (‘and’, ‘or’ and ‘not’) and more complex string comparisons like ‘contains’. More information on WHERE conditions here. So if we only wan the population where the year is 1991 or 2001 we can use the query "SELECT Col1, Col2 where Col1='*1991*' or Col1='*2001*'"
For this last example lets also TRANSPOSE the result and remove the table header:

"=TRANSPOSE(QUERY(ImportHtml("http://en.wikipedia.org/wiki/Demographics_of_India"; "table";4),"SELECT Col1, Col2 WHERE Col1='*1991*' or Col1='*2001*'",0))"

So there you using the QUERY formula to be more selective on your html import to Google Spreadsheets. Here is a copy of the spreadsheet with all the examples I’ve used in this post Any questions/clarifications leave a comment.
PS Tony Hirst has also  written about Using Google Spreadsheets Like a Database – The QUERY Formula and this is a place if you want some more query examples.
PPS I’m on leave now which is why this post has very little to do with CETIS or OER.

chevron_left
chevron_right

Join the conversation

comment 4 comments
  • Zohar

    Hi,
    when a login is required to a certain site with tables, is there a way to script the login into the ImportHtml syntex?
    Thanks,
    Z.

  • Nate

    I have a list of companies (apprx. 4,000) in a Google Spreadsheet. Is it possible to use this function (or some other) to search Google by (for example) cell A1 (with the company name) and cell e1 (zip code, or city name), and then simply put the FIRST google result into an empty cell? I’m hoping the first result will be a company website. The results wouldn’t be perfect, but it certainly would cut out the fat.
    I’m having trouble finding anything the can accomplish this (in Excel or Google).
    Thanks.

Comments are closed.

css.php