How to extract URL parameters as ARRAY in Google BigQuery

Posted by Pascal Landau on 2018-04-08 12:00:00

We're gonna use the REGEXP_EXTRACT_ALL function provided in the Standard SQL dialect of BigQuery to extract parameters from the query part of a URL and return them as an ARRAY.

Code

#standardSQL
SELECT
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)((?:[^=]+)=(?:[^&]*))') as params,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:([^=]+)=(?:[^&]*))') as keys,
  REGEXP_EXTRACT_ALL(query,r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values
FROM
  table

Working Example

Result

Row id query params keys values description
1 1 ?foo=bar foo=bar foo bar simple
2 2 ?foo=bar&bar=baz foo=bar foo bar multiple params
bar=baz bar baz
3 3 ?foo[]=bar&foo[]=baz foo[]=bar foo[] bar arrays
foo[]=baz foo[] baz
4 4 no query

Run on BigQuery

Open in BigQuery Console

BigQuery Console: Extract URL parameters example

Notes

  • REGEXP_EXTRACT_ALL only excepts 1 capturing group, hence we need to mark all other groups as non-capturing with (?:
  • if the URL contains a fragment part (e.g. https://example.org/?foo=bar#baz), the fragment is currently not removed. To do so, remove the fragment prior to extraction with REGEXP_REPLACE, e.g. like so:
    REGEXP_EXTRACT_ALL(
    REGEXP_EXTRACT(query, r'#.*', ''),
    r'(?:\?|&)(?:(?:[^=]+)=([^&]*))') as values

Links

Use cases

  • compile a list of all parameters from your log files
  • evaluate the frequency of parameters keys/values

Wanna stay in touch?

Since you ended up on this blog, chances are pretty high that you're into Software Development (probably PHP, Laravel, Docker or Google Big Query) and I'm a big fan of feedback and networking.

So - if you'd like to stay in touch, feel free to shoot me an email with a couple of words about yourself and/or connect with me on LinkedIn, Twitter or Facebook - or simply subscribe to my RSS feed and leave a comment ;)

Waving bear

Comments