Apache Cassandra is a NoSQL database, initially developed by Facebook to power the inbox search feature. It was released in 2008 as open-source in Google Code. Cassandra is massively scalable and ideal for managing large amounts of more or less structured data. The most common use case is storing a time series data.
NoSQL databases are non-relational, distributed collections of key-value pairs, documents, graph databases or wide-column stores. They’re difficult to scale out as they scale horizontally, by increasing the number of database servers in a pool. These databases are an ideal choice for the hierarchical data storage, with data in the JSON format or similar, and for large data sets, like big data.
Other examples of NoSQL database products apart from Cassandra include MongoDB, Redis, or RavenDB. AWS provides its fully-managed NoSQL database service called Amazon DynamoDB. However, Cassandra is a strong contender. It’s in the top ten of the most popular database engines, (DB-Engines Ranking https://db-engines.com/en/ranking), and it’s number one for wide column store databases – https://db-engines.com/en/ranking/wide+column+store. According to the system vendor, a quarter of the Fortune 100 companies choose it.
Cassandra provides availability, scalability, and simplicity across multiple distributed locations. It has no single point of failure. Each node has the knowledge of the cluster topology, and every second the nodes exchange this information with one another.
Let’s see Cassandra in action (in the cloud).
Deploying Cassandra in AWS
The largest unit of deployment in Cassandra is a cluster. Each cluster consists of nodes from one or more distributed locations – these are availability zones in AWS.
A single availability zone contains a collection of nodes which stores partitions of data. To create a highly-available, distributed Cassandra cluster in AWS, you need to leverage availability zones within a single region. Although you can distribute Cassandra across regions, we’d recommend selecting the same region for data and application to minimize latency. Bear in mind that data in one region won’t automatically replicate to another region, so you’ll need to perform the replication by yourself.
Spreading Cassandra nodes across multiple availability zones is easy, as Cassandra has a masterless architecture and all nodes play an identical role. An architecture with no single point of failure makes Cassandra foolproof, which is a significant advantage. If you spread nodes across availability zones, Cassandra will remain available even if there’s an outage in one of the availability zones.
Snitch will tell you all…
You can make Cassandra clusters Amazon EC2-aware by defining an appropriate snitch setting in cassandra.yaml file. A snitch determines which data centers and racks Cassandra nodes belong to. In AWS terms, a data center is a region, and a rack is an availability zone.
Snitches also inform Cassandra about network topology to efficiently route requests and distribute replicas by grouping machines into regions and availability zones. The replication strategy allocates the replicas based on the information provided by the new stitch, and Cassandra assigns one replica per availability zone, if possible.
When deploying Cassandra in AWS, you can use EC2Snitch and EC2MultiRegionSnitch. Use the first one for a single region and the second – for multi-region clusters.
Another task of the snitch setting is to allow Cassandra to place data replicas in different availability zones during a write operation. If there are three availability zones, with a Cassandra cluster in each, and you set a replication factor of 3 at the keyspace level, then every write operation will be replicated across nodes in three AZs.
We recommend using multiple AZs and configure Cassandra to replicate data across them. This will help you ensure that your cluster will be available during an availability zone failure. The number of AZs should be a quotient of the replication factor to ensure even distribution of data across all nodes.
Don’t forget about the seed nodes
When a new node wants to join the existing cluster, it consults the so-called seed node. If the seed node is unavailable, the new node won’t be able to join. Therefore we recommend spreading seed nodes across multiple availability zones.
Seed node IP addresses are hardcoded in the .yaml configuration file. If the seed node fails, the yaml configuration on every node in the cluster needs to be updated with the IP of the new seed node that is to replace the failed one. To avoid such a scenario, you can create an Elastic Network Interface (ENI), which can be attached, detached, and reattached to any instance. All the attributes follow the ENI. When you move it to the new seed node, the traffic will be redirected. Of course, it can be automated, too.
Another advice regarding the networking in Cassandra deployments is to use the enhanced networking feature on the applicable EC2 instances where you run the Cassandra clusters. That feature gives you better performance, lower latency, and lower jitter.
Achieving extreme High Availability with Cassandra
Cassandra’s deployment type with multi-AZ clustering makes it perfect for horizontal scaling. It provides not only better performance but also significant fault tolerance. As AWS offers an almost unlimited pool of resources, you can scale out Cassandra clusters horizontally as much as you want.
The above diagram shows the basic high-availability configuration for a Cassandra cluster. It spans over three availability zones within one region. Cassandra’s nodes are placed in private subnets and can access the Internet for updates using a NAT Gateway. SSH access can be provided through bastion hosts.
Each availability zone contains a seed node. Not all the nodes should be seed as this can affect performance. The rule of the thumb is not to have more than two seed nodes per availability zone.
An essential factor for scattered database environments is data consistency. It refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Cassandra offers a tunable consistency. For any given read or write operation, the client application decides how consistent the requested data must be.
Within the single AWS region, you can use a LOCAL_QUORUM consistency. In a cluster with a replication factor of 3, a read or write operation is considered as a success when two out of two replicas in the region signal a success. As a cluster spreads across three availability zones, the local quorum read and write operations remain available even in case of an availability zone failure.
Auto-scaling Cassandra cluster
To add another layer of high availability and avoid manual replacement of failed clusters, consider introducing some level of automation.
For a cluster, this requires the creation of an auto-scaling group with the minimum, maximum and desired sizes set to the same size. This operation will allow auto-scaling to bring up a new node in place of the failed one. If the launch configuration includes software installation and configuration, the new node will join the ring automatically, and Cassandra will take care of feeding it with data.
We already mentioned using ENI to preserve the IP addresses assigned to the seed nodes. You need to create an ENI with a static private IP address per seed node and add it to the seed list. Automation can then attach the ENI to the new instance provisioned by the auto scaling group instead of the failed one.
You may scale bastion hosts in the same manner as Cassandra nodes, but you cannot do so with the OpsCenter nodes. They have a master-slave configuration, and the master node has to have a static IP mapped to it. However, you can leverage an Elastic IP address for that.
For the failover, you can assign master and slave nodes individually to two different auto scaling groups with the minimum, maximum and desired size of 1. If either the master or the slave node fails, auto-scaling will restore it. The new node should become a new backup node. Automation should also detect a failover and remap the elastic IP to the new master.
To enable multi-region clusters, you need to allow multi-region communication. To achieve that, you can use a virtual private network or public IP addresses over the Internet. Remember to reconfigure the snitch. For the VPN setup, you can use either the EC2Snitch or gossippingpropertyfilesnitch. For the communication over the Internet, consider setting the snitch to EC2MultiRegionSnitch.
Set the consistency level for the multi-region clusters to LOCAL_QUORUM. This will help you avoid constructing quorum across AWS regions and increasing the latency time.
Setting up a basic Cassandra cluster in the AWS is not as hard as it may seem. AWS cloud provides tools to quickly deploy Cassandra in a scalable and fault tolerant environment. That makes Cassandra an excellent alternative to AWS DynamoDB. True, you have to manage Cassandra on your own. On the other hand, it offers you a higher level of control than other NoSQL DBs.
As usual, before you make the final choice of the tool, consider all pros and cons. But without a doubt, there’s no reason to be afraid of deploying Cassandra in the AWS cloud.