Simple Strategy for Data Modeling in Cassandra

Erarica Mehra
5 min readDec 9, 2020

Apache Cassandra is a fast distributed database with high availability and linear scalability. Modern applications are using Apache Cassandra over RDBMS because of its ability to handle large volumes of data with very high speed. Apache Cassandra follows a decentralized Peer-to-Peer architecture, which is very reliable, easy to scale, and fault-tolerant. It easily recovers from node failures without losing any data.

If you are from a relational background, CQL may seem familiar to you, but its application can be very different. The goal of this article is to familiarize you with simple strategies and practices that I used when I first switched from the relational database to Cassandra.

Cassandra is a write-optimized database

Don’t be afraid of duplicating your data. Here’s why

The Write Path: Cassandra has an incredibly simple write path. Writes can be written to any node in the cluster. The node servicing your write request is called the coordinator node. When the write request hits a node of the cluster, the write is written to a commit log (which is an append-only data structure). The commit log is then merged with an in-memory representation of your table called as memtable. At this point, the coordinator node sends the acknowledgment back to the client. The memtables are asynchronously flushed into the disk or sstables. A partition can be spread across multiple ssatbles.

The Read Path: Any node of the cluster can be queried with the read request. This node becomes the coordinator node. The coordinator node contacts other nodes with the requested key. On each node, data is pulled from multiple sstables, merged on the basis of timestamp and response is sent back to the client via the coordinator node.

Cassandra has expensive reads but cheap writes. It is easier and simpler to write data to Cassandra than to fetch data from Cassandra. Therefore, it is important to make sure that data is evenly distributed across a cluster and a minimum number of partitions are reached while reading data from a Cassandra table. Hence you need to duplicate your data across multiple tables to optimize your reads as much as possible.

Designing a Data Model using simple rules

  1. Determine the queries you need:

First of all, write down the queries you need to support. Be very careful while finalizing the queries as they would help determine the primary key of your table.

2. Create tables to satisfy the queries:

Each table should be created to satisfy a specific query/queries.

Design the primary key of the table in such a way that the data is evenly distributed in the cluster. Make sure certain partitions aren't too burdened. The primary key of the table must uniquely store all records. (In Cassandra, if the same data is inserted twice, existing data might be lost and replaced by new data).

Here are some examples to illustrate the rules

Example 1: Display the total number of views on each product of your website

Let us consider an e-commerce website where you need to track the total number of views on each product of your website. This means that whenever a user visits a product, the count should be incremented by 1. In this example, we will use Counters to calculate the total number of views per product.

Step 1: Determine the query

select count from views_by_product_id where product_id=7678 ;

Considering that the product_id for every product is unique, the product_id can serve as primary key.

Step 2: Create a table to satisfy the query

CREATE TABLE views_by_product_id (
product_id bigint,
count counter,
PRIMARY KEY (product_id)
) ;

Note: To load data into a Counter column use UPDATE. Counter data type cannot be assigned to a column that serves as the primary key or partition key. In a table with a counter column, all non-counter columns must be in the primary key. Read Cassandra documentation on Counters for more details.

update views_by_product_id set count = count + 1 where product_id = 7678;

Example 2: Check the latest book of an author

Let us consider an example where we need to view the latest book of an author from the book_by_author table.

Step 1: Determine the query

select * from book_by_author where author = 'Mark Twain';

Step 2: Create a table to satisfy the query

CREATE TABLE book_by_author (
author text,
year int,
genre text,
title text,
rating int,
price double,
PRIMARY KEY (author, year)
) WITH CLUSTERING ORDER BY (year DESC);

It is very clear from the query that the author column would serve as the partition key. But here we also need to view the results in such a way that the latest book is on top. Therefore we use the ‘year’ as a clustering key.

Example 3: Get a book by rating

Let us consider an example where you need to display books published in year 2020 with the rating 5.

Step 1: Determine the query

select * from book_by_rating where rating = 5 ;
or
select * from book_by_rating where rating in (4, 5);

Step 2: Create a table to satisfy the query

CREATE TABLE book_by_rating (
rating int,
year int,
title text,
author text,
price double,
PRIMARY KEY (rating, year, title)
) WITH CLUSTERING ORDER BY (year DESC, title DESC);

Here you must be wondering why I have used ‘title’ in the primary key. It is because rating and year alone do not uniquely identify a record. There might be multiple records with the same rating and year, leading to data loss. Adding ‘title’ in the primary key would uniquely identify a record.

It is important to take into account the data size that a partition might hold in the long run. In this example, it may happen that as the data size increases, the partitions may get overburdened. Lets assume that ‘rating’ column can hold one of the 5 values (1, 2, 3, 4, or 5). In such a case, all the data would be distributed only among 5 partitions making them extremely overwhelmed. To avoid such a situation, we can use a composite partition key. A composite partition key is one that constitutes more than one partition key column. Data would be distributed across the cluster based on the hash value rating and year taken together.

CREATE TABLE book_by_rating (
rating int,
year int,
title text,
author text,
price double,
PRIMARY KEY ((rating, year), title)
) WITH CLUSTERING ORDER BY (year DESC, title DESC);

--

--