Seldomly Used SQL: bool Aggregate Functions
Aggregate functions are the cornerstone of performing any type of analysis or reporting on your data that lives within Postgres. The
bool_and aggregate functions are ones that I’ve started to pull out of my toolbox more often. These functions have been particularly useful when I’ve needed to know if any or all values within a column meets a certain condition and to report on it apporpriately.
Before we can dive into the aggregate functions, we first need to discuss logical operators. If you’re already familiar with these types of operators, this section will likely be review. Any time you’re writing SQL, and are using multiple filters in your where clause, you’re using a logical operator.
SELECT products.id , products.cost , sum(orders.amount) FROM products INNER JOIN orders ON orders.product_id = products.id WHERE products.category = 'food' AND orders.ordered_at > '2016-01-01 00:00:00' AND ( orders.region = 'west' or orders.express_delivery = true ) GROUP BY 1, 2
In this query we’re using both an
AND operator and an
OR operator. The
AND operator means that both conditions need to be true to return a result. In our example, the products that need to have been ordered need to be food items and ordered after the first of the year. The
OR operator allows for either condition to be true to return a value as part of the result set. In the case of our example, this means orders can be express delivery or have been ordered in the west region. If you need more background on logical operators, the Postgres documentation has a great table on the possible scenarios for all logical operators.
Now imagine being able to take the concept of logical operators and apply that to the rows that are processed as part of a result set. That’s what the
bool_and aggregate functions do for us. Both functions must take a boolean value as it’s parameter. The
bool_or function will return
true if at least one of the input values is true. On the other hand, the
bool_and will return true only if all of the input values are true.
The best way I’ve found to explore these concepts is with some example usage. Let’s assume I own a warehouse and my warehouse will take orders and, based on the customer, will track the category of the products sold and whether the customer asked for express delivery. My orders table might look something like this:
CREATE TABLE orders ( id BIGSERIAL, customer_id BIGINT, category TEXT, express_delivery BOOLEAN );
Let’s assume this is what our initial dataset looks like:
Our boss has come in and was curious to know if, based on the category of product, the customers will always ask for express delivery. Also the boss wants to see this breakout against all categories in the database. No problem, let’s use the
bool_and function to make sure that all orders by category are always express delivered:
SELECT category , bool_and(express_delivery) as always_express FROM orders GROUP BY 1
Our output should look something like this:
The boss has come back to us and is curious why so few categories are express delivered. We astutely point out that, we’re only looking for instances where every delivery in a category is express delivered. A better analysis should be to see which product categories ever get express delivered. This means that if a category has express delivery, even once, we should indicate that it has been for the category. Our SQL and the result set should look like this:
SELECT category , bool_or(express_delivery) as ever_expressed FROM orders GROUP BY 1
We now see that
shoes have been added to the list of products that have been asked to be expressed delivered by customers. The boss has some sense of what to proritize within the warehouse so that food and shoes get out the warehouse door much more quickly.
What happens if our application that’s backing the ordering system has an error and we don’t record the express delivery field correctly? Let’s assume the record got created in the database but the
express_delivery field was set as
NULL. Ideally, we’d have a constraint on the field in the database, but for the purposes of this example let’s assume we didn’t. Let’s say our table now looks like this:
bool_or and the
bool_and functions will treat
NULL just as that. Our above queries will do one of two things:
- If other input values in set are not null, then the output will be based on the other values.
- If all input values in the set are null, then the output of the query will be null as well.
Keep this behavior in mind when you’re querying and there is possibility of null values in your data.
I hope you found the
bool_and functions as useful as I have. Any time you find yourself starting your analysis with
does any or
do all then checkout these functions. If you have other instances where you use these functions, let me know via twitter or email, I’d love to hear more.