Aggregate functions are the cornerstone of performing any type of analysis or reporting on your data that lives within Postgres. The bool_or and bool_and aggregate functions are ones that I’ve started to pull out of my toolbox more often. These functions have been particularly useful when I’ve needed to know if any or all values within a column meets a certain condition and to report on it apporpriately.
Before we can dive into the aggregate functions, we first need to discuss logical operators. If you’re already familiar with these types of operators, this section will likely be review. Any time you’re writing SQL, and are using multiple filters in your where clause, you’re using a logical operator.
In this query we’re using both an AND operator and an OR operator. The AND operator means that both conditions need to be true to return a result. In our example, the products that need to have been ordered need to be food items and ordered after the first of the year. The OR operator allows for either condition to be true to return a value as part of the result set. In the case of our example, this means orders can be express delivery or have been ordered in the west region. If you need more background on logical operators, the Postgres documentation has a great table on the possible scenarios for all logical operators.
Now imagine being able to take the concept of logical operators and apply that to the rows that are processed as part of a result set. That’s what the bool_or and bool_and aggregate functions do for us. Both functions must take a boolean value as it’s parameter. The bool_or function will return true if at least one of the input values is true. On the other hand, the bool_and will return true only if all of the input values are true.
The best way I’ve found to explore these concepts is with some example usage. Let’s assume I own a warehouse and my warehouse will take orders and, based on the customer, will track the category of the products sold and whether the customer asked for express delivery. My orders table might look something like this:
Let’s assume this is what our initial dataset looks like:
Our boss has come in and was curious to know if, based on the category of product, the customers will always ask for express delivery. Also the boss wants to see this breakout against all categories in the database. No problem, let’s use the bool_and function to make sure that all orders by category are always express delivered:
The boss has come back to us and is curious why so few categories are express delivered. We astutely point out that, we’re only looking for instances where every delivery in a category is express delivered. A better analysis should be to see which product categories ever get express delivered. This means that if a category has express delivery, even once, we should indicate that it has been for the category. Our SQL and the result set should look like this:
We now see that shoes have been added to the list of products that have been asked to be expressed delivered by customers. The boss has some sense of what to proritize within the warehouse so that food and shoes get out the warehouse door much more quickly.
What happens if our application that’s backing the ordering system has an error and we don’t record the express delivery field correctly? Let’s assume the record got created in the database but the express_delivery field was set as NULL. Ideally, we’d have a constraint on the field in the database, but for the purposes of this example let’s assume we didn’t. Let’s say our table now looks like this:
Both the bool_or and the bool_and functions will treat NULL just as that. Our above queries will do one of two things:
If other input values in set are not null, then the output will be based on the other values.
If all input values in the set are null, then the output of the query will be null as well.
Keep this behavior in mind when you’re querying and there is possibility of null values in your data.
I hope you found the bool_or and bool_and functions as useful as I have. Any time you find yourself starting your analysis with does any or do all then checkout these functions. If you have other instances where you use these functions, let me know via twitter or email, I’d love to hear more.
Edit: Thanks to @epc and @floatingatoll for pointing out a small mistake on one of my headings on the query output. Fixed!
In my previous post, I introduced some very high level conceptual differences between Postgres and Cassandra. I also refereneced one pattern that I’ve seen where it makes sense to use the two systems together. When you have one very large table in Postgres where the disk size is measured in terabytes, and that’s after it’s been fully vacuumed, it might make sense to bring Cassandra into your architecture. Trust me, that one very large table can create operational headaches over time in Postgres. But how easy is it to model my schema in Cassandra if all I know is Postgres? Cassandra has its own SQL-like dialect called Cassandra Query Language (CQL) that mirrors many of the semantics of SQL but that’s where it stops. You’ll need to know how SQL and CQL differ and how to model the data properly in Cassandra.
Moving the Large Events Table
For the purposes of this exercise, let’s assume that my data model in Postgres has three different tables: users, accounts and events. The data definition for this schema in SQL would look something like this:
This is pretty straight forward. Our application has multiple users that belong to an account and, as our users do things inside of our application, we’ve instrumented their interactions to send events to the database to be stored in the events table. Over time this setup has worked well but now the amount of data inside the events table has ballooned. The growth of a table like events is typically non-linear as compared to data growth for the other tables in the schema. Now let’s walk through how we move the events table out to Cassandra.
Creating the Initial Cassandra Cluster
Cassandra can be thought of as one cluster but that one cluster can have many grouping of nodes called data centers. Let’s assume we were going to create our Cassandra cluster using Amazon Web Services (AWS). We could have one cluster in AWS and then create datacenters in different regions. For example, my topology could have two data centers, one in us-east another in us-west and I could have five nodes in each.
To create the cluster, we’ll need to define some properties and stuff them into a couple of different configuration files. We also need to make sure that we create each of the nodes ahead of time and that Cassandra is installed on each of them. I’m going to gloss over those details for now but if you need more help, a decent tutorial on setting up a multi-node cassandra cluster should get you set up.
Assuming we’ve got the initial installation complete, for the sake of continuing my example, I’ve updated the file /etc/cassandra/cassandra-rackdc.properties to name the data center based on the AWS region I’m hosting each of the nodes in the data centers:
The data center names are important when we start building our Cassandra schema. For each of the nodes, I’ve created in the US East region in AWS, I’ve added the rackdc.properties with the appropriate dc. When creating this file for the nodes in my other region, US West, I’d set the dc equal to us-west. Note that these names could be anything I wanted them to be, but I like them to mirror the AWS regions so that I know what I’m dealing with.
Designing the Cassandra Schema
Before we can create our events table in Cassandra, we need to create a keyspace. A keyspace is roughly analogous to a database in Postgres.
A Postgres instance can have many databases and, in turn, each of those databases can have multiple schemas. Schemas in Postgres can be defined, at a basic level, as a named collection of tables, views and other objects. A keyspace can be thought of as a Postgres database with only one schema. CQL even has CREATE SCHEMA as an alias for CREATE KEYSPACE. When we define the keyspace in Cassandra, we’re setting up the rules for the cluster topology as well as how data gets replicated between the nodes.
In the above example, we’re using the network topology strategy and the data center of us-east has a replication factor of 3.
Creating a Cassandra Table
Now that we’ve got our keyspace created, it’s time to actually define the events table. The semantics of creating a table look the same in Cassandra as it does in Postgres but that’s where it ends.
Many of the data types from Postgres map to the data types in Cassandra but not all. Here’s a list of what you’ll find in terms of differences between the two:
One of the biggest challenges I’ve had converting over Postgres models to Cassandra has been the JSON data types from Postgres. Cassandra doesn’t really handle nested data structures very well. I take that back, it doesn’t at all. The best proxy for that is the collection type map. map is itself a collection of key/value pairs that can be used to store information in the column family. Typically if you’re going to save JSON, the prevailing best practice is to serialize that information into a blob or use a single layer with the map collection type. Do check out the rest of the collection data types in Cassandra, there’s things like set and list.
SQL Primary Key != CQL Primary Key
Cassandra primary keys are nothing like Postgres primary keys. In Postgres, when you create a primary key, you’re explicitly telling the database that the fields you’ve defined for the primary key contain unique, nonnull values. On top of that, an index gets created based on the primary key to enforce uniqueness. Cassandra’s primary key doesn’t do any of that. Generally speaking the purpose of the primary key in Cassandra is to order information within the cluster.
The primary key is a very loaded term in CQL. First and foremost, the primary key is used by Cassandra to distribute the data amongst the nodes in the cluster. This is called the partition key. If you desire to have a composite primary key, the first set of columns become the partition key. A composite primary key with multiple fields for the partition key would look like this, where the user_id and account_id are the partition key:
Second, the remaining columns within a primary key definition are used for clustering. Clustering in the Cassandra sense means how the data in the defined columns are stored on disk. In the above example, Cassandra will sort by session_id first and then the occurred_at field second. You do have the ability to change the ordering on disk by using the CLUSTERING ORDER BYsyntax at the time of table creation.
The syntax for composite primary keys can be fairly complicated but if you only have one column as the primary key, that one column will be used as both the partition key and the clustering column. This double meaning for the primary key in Cassandra is why it’s such a loaded term.
I can’t stress enough that the primary keys within Cassandra enforce eventual consistency. Unlike Postgres primary keys, eventual consistency could lead to unexpected results with your data. Let’s walk through a contrived scenario to understand what could happen. Imagine that I have a table in my production keyspace that tracks the amount of hits a page on my site gets. Again, this is contrived and not something you’d actually set up in a schema, this is only to illustrate eventual consistency:
If we have multiple clients attached to the cluster and each of those clients is attached to a different node, this is where the problem starts to manifest itself. A client on node1 reads the number of hits for the page name ‘welcome’, which happens to be 10, and tries to increment it by 2 to 12. At the same time, a second client on node2 read the value of the hits for the welcome page, which is 10, but instead tries to increment the count by 1 to 11. Since the second client was the last one that wrote information to Cassandra, if no other updates are made to that record, then the hit count of 11 will get propagated amongst all of the nodes in the cluster. Even though this was a made-up example, being mindful of eventually consistency is super important.
A lot of careful thought needs to be put upfront in your table design because you need to make sure that the primary key is specific enough to the point where it’s highly unlikely that you’ll run into collisions. My general rule is if you need some level of consistency, don’t do it in Cassandra or redesign the schema so you don’t run into problems.
I’ve gone through a very terse walkthrough of converting over an events table in a Postgres database to Cassandra. There’s still a lot more to talk about about including best practices and what happens under the hood. If you’re up for digging into some documentation, I’d recommend checking out the very complete docs over at Datastax. Otherwise, if you have questions, comments or other ways of thinking about data modeling in Cassandra, reach out to me via twitter or email. Cheers!
One of the biggest challenges I see product managers face in their career is
killing features and products. Much of the product management literature and
psyche on the web is spent talking about how to build new products. I did a
quick search for “kill a product feature” in Google and noticed that only the
top 6 results are about killing features. Most of those posts are spent paying
lip service to the exercise with little guidance around how to do it. This
signals one of two things for me. Either, product managers never kill anything
or killing features is the least glamorous part of being a product manager. I’m
inclined to think it’s the latter because in some ways it means that your
original hypothesis for building the feature was incorrect and that can feel
defeating. It doesn’t have to be that way.
Product management is about creating and curating great experiences for your
customers. Otherwise, a product that never removes anything will be guilty of
feature creep and bloat. To do a sunset properly, you’ll need to determine if
your feature is ready for retirement, plan your timelines and execute with
Is it time to retire the feature?
Defining if the feature or product needs to be retired can be the hardest part.
Ideally, before you’ve built the feature in the first place, you’ve created a
hypothesis and defined success metrics for that hypothesis. Should the feature
fail those success metrics, then the choice should be much easier. If you don’t
have those success metrics, the hard part is going to be defining what a success
metric should be. There are so many different questions that could be asked but
these are one that I usually ask of myself:
Does this product have strategic importance for the company?
Can revenue be directly attributed to this product?
Are we supplanting this product with another one?
Qualitatively, do our customers understand how to use this product?
As a word of caution, you need to make sure you socialize the shutdown of a
product within your company. Sometimes the biggest users of feature will be your
co-workers and they have a direct line to you. This can cause a ton of problems
if you don’t give them some advanced warning. Think of this as the pre-game warm
up for shutting down this feature amongst your customers. You can test your
messaging around why you’re doing it and how to do a migration.
Build a Timeline
We’ve got product that we need to kill. Now what? It’s time to put in the work
to do all the pre-planning to make this process seamless. Every sunset that I’ve
done boils down four critical dates, at a minimum; immediate shutdown for new
users, start date of the sunset for current users, a soft end for killing the
feature, and the hard end date for when everyone absolutely has to be off.
1. Immediate Shutdown
Turn the feature off to new customers that sign up for your service immediately.
You need to stem the flow of new customers that will come to rely on this
feature. This assumes that you’ve removed all of the subsequent discussion of
the feature from all of your marketing materials and sales collateral. Don’t
forget to tell your sales teams about the shutdown!
2. Start Date for the Sunset Process
This is the date that you want to formally communicate the sunset to customers.
I’m a big fan of telling all users of your feature that it’s getting shut down
regardless of usage. The last thing that you want to have happen as a customer
is for something to get shut off without knowing about it. If you have product
documentation, I’d actually recommend you keep the information about the feature
in the documentation with warnings around when the feature will be sunset. Other
things you’ll want to have:
Migration path to the new feature, if there is one.
Documentation on why you’re doing this.
Recommendations for moving to another service.
3. Soft End Date
The soft end date is meant to be discussed openly and, from a customer
perspective, the final date before the feature will be cut off. Inevitably,
customers won’t read their email or they have their own priorities that don’t
mesh with your timelines for sunset. That’s ok. Customers will reach out to you
to ask for more time. Because we’ve built some slack into the schedule, your
customers will be grateful. You’ll still have a set of customers that won’t
reach out and at this point I usually assess the size of that population. If
it’s small enough, maybe it’s worthwhile to reach out to them individually,
maybe even by phone if this is a B2B product, rather than by mass email.
4. Hard End Date
This is the time when everyone has to be off. The product documentation is
removed and everyone migrated somewhere else. Just like the time during the soft
end date, you’ll have to do some outreach. Also, I’d recommend doing more for
your customers. If you can do a migration on their behalf, do it. This will
result in less product debt over time. I’ve seen sunsets where product managers
tried to create forcing functions for people to move to new plans or product.
I’d strongly recommend not doing this. Over time the migration will happen
slowly but something will inevitably happen where you will have to force your
customers to move over, it could be a security vulnerability or retiring
infrastructure. It’ll be way easier to reduce your product debt immediately then
having it hanging over your head.
Going the extra mile
The four dates I’ve defined are what you need at a minimum. Depending on the
length of the sunset, I’d recommend a time interval to send out reminders for
your customers. That regular cadence helps to remind customers that the time is
coming for the feature to be removed. Do make sure you talk up the seriousness
of the end date the closer you’re approaching to that time.
Execute with Empathy
Once everything is defined, it’s just time to execute the timeline. Depending on
the seriousness of the sunset, you’ll get a number of customers writing in or
taking their cause to social media. You have to be ready to handle this in the
best way possible. I can’t stress how important it is to have all of your
materials and messages built beforehand because you’ll have a ton of email and
communication to do. Treat each one of your customers as a person. Even if the
customer doesn’t use the feature that much or at all, they will still have
reasons for not wanting to see that feature go. Sometimes those reasons can be
emotional rather than rational and you need keep that in mind. Craig Kerstiens
really sums up this concept really well.
Be wary of
viewing your customers as buckets and cohorts and not individuals, especially
when you interact with them
Almost any product manager can build great experiences but the really good ones
will take care of their customers and build great sunset experiences. I’ve found
that there are far and few product managers that can do that. If you have a
different take on how to do sunsets, I’d love to hear it. Let’s chat via email
I’m a huge fan of Postgres and I have been for many years along with many other
people. It’s one of the
best open-source communities dedicated to building a
fast, standards compliant database. Despite my desire to use Postgres for
everything, some situations require a different style of data store especially
in situations where apps need very high write throughput. To be clear, I see
these two data stores as being complementary not mutually exclusive.
In this post, we’ll cover some history and high-level conceptual differences
between Cassandra and Postgres. In subsequent posts we’ll dive into the
specifics of creating tables, data modeling, and data types, just to name a few.
Throughout this series, I’ll do my best to apply a set of scenarios where using
Cassandra would make sense. Furthermore, I’ll use the most recent versions of
Cassandra Query Language (CQL) and Cassandra to illustrate the appropriate
concepts, 3 and 2.2 respectively
From a historical perspective, Postgres, a featureful and standards compliant
database, has been around since the early 1980s and is written in C. I’m not
going to spend a lot of time spelunking Postgres history but if you’re looking
for more, postgres.org has a great summary. Cassandra, on the other hand, is a
relative new-comer having been released to the public in 2008 and written in
Cassandra was originally conceived by Avinash Lakshman and Prashant Malik at
Facebook. The original problem they were trying to solve was storing reverse
indices of messages for their users inboxes. But, additional constraints were
added including the storage of a large amount of data, handling a large rate of
data growth, and to serve the information within strict limits. The initial
release was put up on Google code in 2008. But, it wasn’t until 2009 where the
first non-Facebook committer was added to the project and Cassandra started
picking up steam.
The Cassandra codebase eventually moved from Googlecode to an Apache incubator
project ultimately graduating to a top-level Apache project. The community is
still fairly young, as compared to Postgres, but is growing through the backing
of many individuals and corporate sponsors, like Datastax
Why do you even need Cassandra?
Most applications that I’ve seen usually start out with a Postgres database and
it serves the application very well for an extended period of time. Typically,
based on type of application, the data model of the app will have a table that
tracks some kind of state for either objects in the system or the users of the
application. For the sake of keeping things simple, let’s just call this table
“events”. The growth in the number of rows in this table is not linear as the
traffic to the app increases, it’s typically exponential.
Over time, the events table will increasingly become the bulk of the data volume
in Postgres, think terabytes, and become increasingly hard to query. In this
situation, it makes sense to move that table out of Postgres and into Cassandra.
Cassandra will be able to handle the nonlinear nature of the events that need to
be created and will scale with minimal changes to the application.
What Makes Cassandra So Special?
At a high level, relational databases, like Postgres, define the data model in
terms of two-dimensional tables with the dimensions being rows and columns. When
tables are defined, typically the intention is to reduce the amount of data
duplication by normalizing the data model. To illustrate this concept, let’s use
an example application that stores event information for users. Each user in the
system will belong to one account and users can have many events.
Cassandra, on the other hand, is a partitioned key-value store. In some
programming languages, a key-value structure is called a hash or a dictionary.
Each “row”, is defined by a unique key with the value being any kind of data
While Postgres is typically run on just a single instance (I’ll save sharding
and clustering for another post), Cassandra requires that it be run as a cluster
of multiple machines. This is where the partitioned-part of the definition comes
into play. Conceptually, Cassandra looks something like this:
Partitioning is done on a partition key. This key defines how data should be
distributed across the cluster. The simplest definition is that it’s the key of
the key-value pair we had defined earlier. But, the partition key can be more
complex than just a single field. Typically, the fields that are used to define
a partition key are hashed together by a partitioner and the resulting value
defines which node in the cluster the data should live. The best part about the
partitioner is that it takes care of the hashing behind the scenes. You can
loosely think of the partition key as you would a primary key in Postgres for
any particular row of information. I’ll dive more into the Cassandra primary key
later on in this series.
As for the value part of the key-value pair, it’s more than just one piece of
information. Instead, the value is actually a column family. A column family is
itself a series of names and values (tuples) that are associated with a
In Postgres, depending on the table constraints, when a record is created, each
row has a defined value for each column. In a column family in Cassandra, only
those columns that have data as part of the column family are actually written
to the data store. I talk about column families here because that was what they
were originally called in versions of Cassandra prior to 3.0. From 3.0 forward,
column families are called tables. This brings the concept a little closer to
Postgres and SQL.
This was a very high level overview of Cassandra touching on the history and
conceptual architecture with one main use case for using Cassandra in concert
with Postgres. There’s way more that I didn’t cover, topics like data
modeling, querying and best practices which can be lengthy posts in their own
If you have any questions or comments please feel free to reach out to me via
The best software products are those that give their customers great
experiences. The opportunity to delight customers or users of your products can
happen during any interaction. For example, Slack’s awesome onboarding
is great for setting expectations within the app. While web and mobile apps tend
to get all of the limelight relative to experience design, creating exceptional
command line experiences can be just as unique and challenging. Command line
tools can be large pieces of software, like the gcc compiler, to the smallest of
tools like the ls command that lists the contents of the current working
directory on Unix systems. The famously titled, “Art of Unix
distills a philosophy from the way programs and interfaces were built for Unix
systems. One of my favorite Unix philosophy tenants is building programs that do
one thing and doing it well. While the Unix
philosophy is a good base for
defining a good command line experience, I’ve found that I’ve needed a few more
Building trust means doing what you say you’re going to do and if you can’t, you
need to be forthcoming and explicit. Being consistent with this behavior across
all of the possible interactions of your tool helps your customers easily
identify when the tool is not working as expected.
As an example of what not to do, let me use a fake command called credential
that will give a user temporary authorization against an API which requires one
parameter, the duration I want the authorization to be valid in hours.
$ credential 1
Credential successfully created!
Great based on the above command, I’d assume that my credential was created
successfully and I have 1 hours before the system will revoke my access. Now
what if I wanted to increase my duration?
$ credential 100
Credential successfully created!
Based upon this message, I’d assume that I got a credential that’s good for 100
hours. Little did I know that the maximum value for duration is 30 hours. Once I
go over 30 hours, I’ll get error messages interacting with my service saying
that I don’t have access any more. That’s a mismatch between what I expected and
what I actually got. Don’t make your users guess!
I’m a huge fan of the concept of progressive
borrowing this term from the web design sphere and applying it to command line
tools. Essentially, progressive design is a way of providing the most basic
functionality to all users but if the user has a deeper understanding of the
underlying systems at play, more power and flexibility should be available.
For example, let’s use something as mundane as the ls command on unix systems.
This command tips its hat to the concept of progressive enhancement. At the
command’s most basic level you can just get the contents of your current working
Knowing what you know about filesystems, you understand that hidden files can
have their names prefaced with a dot and that some of the items that you’ve
listed in this current directory might have sub-directories. Diving a little
deeper, I could issue the ls command with a few extra options to get that
$ ls -ap
.vimrc Applications/ pgadmin.log
Be very careful about this concept. Understand the range of abilities of the
users that you expect to use your tool and make sure you target those levels and
nothing more. You may not need everything under the sun from super beginner to
Documentation can take many forms, from the help text that gets outputted when a
command is run incorrectly, to the source code comments and even the website
that houses the extended information on how to operate the command, just to name
a few. Unfortunately, in practice, building good documentation tends to be one
of the areas that really gets neglected. Now there’s something to be said for a
user not having to read the full set of docs to figure out your command line
tool. If they have to go to the docs to figure it out, then the experience is
largely a failure. In that sense, I view documentation as addressing a few other
areas for the user. The first of which is discoverability around what your
command line tool can do and the second is building you or your company’s brand.
Web site and mobile apps have it way easier when it comes to discoverability.
They have all sorts of navigation menus, notifications and graphical elements
like font sizes to work with. However, documentation can provide a level of
discoverability that you might not get when a user interacts with your tool on a
daily basis. Let’s use the ngrok tool as an example. Ngrok is a tool that allows
you to create secure introspected tunnels to your local machine. What I love
about it is how easy it is to get started and how it tells you what you need to
do via documentation. For example, you can go to the ngrok site, download the
binary, unzip and issue the command ngrok. What’s happens next is fantastic, a
full set of information on what ngrok is, examples of how to use it, version
numbers and commands get published in the terminal:
ngrok - tunnel local ports to public URLs and inspect traffic
ngrok exposes local networked services behinds NATs and firewalls
to the public internet over a secure tunnel. Share local websites,
build/test webhook consumers and self-host personal services.
Detailed help for each command is available with 'ngrok help
<command>'. Open http://localhost:4040 for ngrok's web interface
to inspect traffic.
ngrok http 80 # secure public URL for port 80 web server
ngrok http foo.dev:80 # tunnel to host:port instead of localhost
ngrok tcp 22 # tunnel arbitrary TCP traffic to port 22
From a discoverability perspective, I know now what ngrok can do and how to get extra help.
The second front where documentation can help is on the branding side. How
thorough are you in your docs? Do you provide examples? Do you tell users where
to go if they have more questions? Going back to our ngrok example, the
documentation on their website does a good job of answering all of the questions
above and more. To me, the docs convey a sense of quality and that contributes
to how I may perceive the quality of the command line tool itself and how the
company conducts itself. When it comes to branding, every touch point a customer
has with anything the company produces results in the customer building a
profile of what that company stands for. Docs are a great way of implicitly
communicating that value.
My Tenets of Great Command Line Experiences
The Unix philosophy is a great starting point for understanding what it takes to
build good command line experiences. I highly recommend reading through the “Art
of Unix Programming”. On top of that, I’ve found that build command line tools
takes a little bit more:
Trust - trust for a command line tool means that a developer can
reason about the behavior of your tool.
Progressive enhancement - understand who’s going to use your tool and give
them some options to get more out of it.
Documentation - use it to teach and delight your users on what your tool can
While these are the tenets that I follow, I’m sure there’s more out there. If
you have principles that you swear by when you create command line tools, drop
me a line, @neovintage, or comment over on