How to calculate the MEDIAN in Google BigQuery

Posted by Pascal Landau on 2020-06-20 12:00:00

There is no MEDIAN() function in Google BigQuery, but we can still calculate the MEDIAN with the PERCENTILE_CONT(x, 0.5) or PERCENTILE_DISC(x, 0.5) functions. The difference between those two functions is the linear interpolation that is applied when using PERCENTILE_CONT(x, 0.5) - so that's probably what you want when dealing with numeric values. Take the numbers 1,2,3,4 for example:

  • PERCENTILE_CONT(x, 0.5) yields 2.5 (as the 50% percentile is exactly between 2 and 3)
  • PERCENTILE_DISC(x, 0.5) yields 2 (as the 50% percentile is >= 2)

Example

SELECT 
  PERCENTILE_CONT(x, 0.5) OVER() AS median_cont,
  PERCENTILE_DISC(x, 0.5) OVER() AS median_disc
FROM
  UNNEST([1,2,3,4]) as x 
LIMIT 1

Result

median_cont median_disc
2.5 2

Caution: As of today (2020-06-20), BigQuery only supports PERCENTILE_CONT and PERCENTILE_DISC for window functions (hence the OVER() clause and the LIMIT 1 in the example above):

PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later.

Source: SO: percentile functions with GROUPBY in BigQuery

The more common use case is probably to calculate the median as a result of a GROUP BY statement. I.e. I would like to write something like this to get the median of quantity per product_id.

SELECT
  product_id, 
  PERCENTILE_CONT(quantity, 0.5) AS median
GROUP BY
  product_id

Right now, that is only possible for the average via AVG() but not for the median. But we can still work around that limitation by using the PERCENTILE_CONT function on a window partitioned by product_id, then group by the product_id (to get only one row per product_id) and resolve a single median value via ANY_VALUE().

Working Example

Run on BigQuery

Open in BigQuery UI

BigQuery UI: MEDIAN in BigQuery example

Links

Notes

There is also the APPROX_QUANTILES() function (mentioned here) that can by applied to a GROUP BY. I didn't have a practical use case for approximate functions yet, though. Thus, I don't know the implications of "not using an exact calculation" and rather mention this for the sake of completeness. Example:

SELECT
  product_id, 
  APPROX_QUANTILES(quantity, 100)[OFFSET(50)] as approx_median
GROUP BY
  product_id

Wanna stay in touch?

Since you ended up on this blog, chances are pretty high that you're into Software Development (probably PHP, Laravel, Docker or Google Big Query) and I'm a big fan of feedback and networking.

So - if you'd like to stay in touch, feel free to shoot me an email with a couple of words about yourself and/or connect with me on LinkedIn or Twitter or simply subscribe to my RSS feed or go the crazy route and subscribe via mail and don't forget to leave a comment :)

Subscribe to posts via mail

We use Mailchimp as our newsletter provider. By clicking subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices here.
Waving bear

Comments