How to use expression subqueries to query nested and repeated fields in Google BigQuery

Posted by Pascal Landau on 2020-05-29 14:00:00

BigQuery allows to define nested and repeated fields in a table. Although this is very powerful, it makes it much more complex to retrieve the data if one is not used to such structures. Especially beginners tend to use an UNNEST statement on the nested fields, followed by a huge GROUP BY statement on the not-originally-repeated fields. Imho, using expression subqueries is oftentimes the better approach here.

Code

SELECT
  id,
  (SELECT value from t.repeated_fields LIMIT 1)
FROM
  table t  

Caution: When using expression subqueries, you need to make sure that the result is a single value (scalar or array), otherwise you will get the error message

Scalar subquery produced more than one element

In the example code above this is ensured by enforcing one result via LIMIT 1.

Working Example

Run on BigQuery

Open in BigQuery Console

BigQuery Console: How to use expression subqueries for nested and repeated fields example

Links

Use cases

The most prominent use case is probably the BigQuery export schema of Google Analytics. To be honest, I also feel that the schema is not very friendly for newcomers with its ~30 RECORD-type (nested) fields and 300+ columns.

In a nutshell, each row represents one session. A session consists of multiple hits. Those hits are also available in the nested and repeated hits field. But wait, there is more... Each hit can have a number of so called customDimensions (meta data that can be attached to each hit). So the resulting table structue looks something like this:

- field_1
- field_2
- hits
  - field_1 
  - field_2
  - customDimensions
    - index 
    - value 

The following example uses the public Google Analytics sample dataset for BigQuery and shows a couple of sample expression subqueries

SELECT 
  fullVisitorId,
  visitStartTime,
  TIMESTAMP_SECONDS(visitStartTime) as started_at,
  TIMESTAMP_SECONDS(visitStartTime + CAST( (SELECT time from t.hits ORDER BY hitNumber DESC LIMIT 1) /1000 AS INT64)) as ended_at,
  (SELECT COUNT(*) from t.hits) as hit_count,
  (SELECT page.hostname || page.pagePath from t.hits WHERE isEntrance = TRUE) as landing_page,
  (
    SELECT  
      (SELECT COUNT(*) from h.customDimensions) 
    FROM 
      t.hits h
    WHERE 
      hitNumber = 1
   ) as customDimension_count_of_first_hit, 
FROM 
  `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
ORDER BY
  visitStartTime asc

Wanna stay in touch?

Since you ended up on this blog, chances are pretty high that you're into Software Development (probably PHP, Laravel, Docker or Google Big Query) and I'm a big fan of feedback and networking.

So - if you'd like to stay in touch, feel free to shoot me an email with a couple of words about yourself and/or connect with me on LinkedIn or Twitter or simply subscribe to my RSS feed or go the crazy route and subscribe via mail and don't forget to leave a comment :)

Subscribe to posts via mail

We use Mailchimp as our newsletter provider. By clicking subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices here.
Waving bear

Comments