Bad data model design with Cassandra causes chronic pains as application scales. I had to re-read about data model design in “Cassandra – the Definitive Guide” and keep my notes and thoughts in this post.
The data modelling in the relational world is indoctrinated to every students out of university. It embraces several things:
- Entity-Relation: we typically start with tables that represents entities, and then tables that expresses relations;
- Query design after table design: we can join multiple tables, index certain fields for better query performance;
- Data normalization: several normal forms (NFs) are brought up to better organize data; de-normalization only occurs when 1) performance bottleneck reached; and 2) specific requirement on retaining snapshots of previous (un-updated) value in a field;
- referential integrity: we can specify foreign keys on a table to reference the primary key of a record in another table; we can configure cascading deletes, etc;
Anybody with years of experience with relational database may have all these built in their instinct. Unfortunately, Cassandra does not follow any of these patterns. For someone with relational database background, the trip entering the Cassandra design is very counter-intuitive.
In Cassandra you have very limited options to achieve what you can do with joins in relational realm. One option is to duplicate the data column on different tables (a pattern against the “normalization” best practice). The second table is denormalized and it represents the join results. The other option rarely applied is to do the work on the client side.
Query-driven data modelling
In relational database, you start writing queries after tables are laid out to pull together disparate data, using the relationship defined by the keys. The queries is secondary concern. It is assumed that you can always get the data you want as long as you have your tables modelled properly, even if you have to use several complex subqueries or join statements.
In Cassandra, You do not start with tables to represent entity. Instead, you would start with queries, and then organize data around the queries. This means an upfront effort must be made to investigate what queries the client application may perform, and work backwards with tables that answers those queries in the most efficient manner. Table names in Cassandra often takes names such as: hotels_by_poi, avaialbe_rooms_by_hotel_date, reservations_by_guest, reservations_by_hotel_date. On the tables the selection of partition key and clustering keys should also consider best query performance and avoids wide partitions. It should also ensure with best effort that a query should not have to travel across multiple partitions in order to return results.
Due to the query-driven modelling approach, Cassandra usually need to be designed with denormalization. The entire concept of normalization applies only to relational world and in Cassandra it’s perfectly normal to organize data that are against NFs.
Designing for optimal storage
Cassandra tables are each stored in separate files on disk. Its best practice to keep related columns defined together in the same table. We need to minimize the number of partitions that must be searched in order to satisfy a given query. Because the partition is a unit of storage that does not get divided across nodes, a query that searches a single partition will typically yield the best performance.
The book “Cassandra: the definitive guide” contains a great example of modelling hotel reservation system. This article is also a good guideline.