Tombstones in Apache Cassandra®#
Apache Cassandra manages deletion of data via a mechanism called tombstones. Because Cassandra is a distributed system,
it cannot delete data immediately in the same way as a traditional relational database. On a high-level, when a row is
deleted, instead of immediately deleting it, Cassandra will mark it as a tombstone row. Then, as part of regularly scheduled
maintenance, the row will actually get deleted. This maintenance is called compaction and the threshold is controlled by
the table-level setting gc_grace_seconds
. Any tombstone older than this setting will be removed completely during
compaction (with some caveats - more details in the Cassandra documentation on compaction). The
default value for this setting is 864000 seconds (10 days).
Tombstone tradeoffs#
If your system has very large numbers of tombstone rows, this can lead to unexpected behaviour since the rows that seem deleted are in fact still there on disk, but with a tombstone marker. If your workload includes a lot of data deletion, it is useful to be aware of the tradeoffs. Tombstones are periodically processed by garbage collection, which can affect cluster stability. The two main things affected are read performance and disk usage.
Tombstones and read performance#
If read queries have to scan large numbers of tombstones, the query performance can be significantly degraded. In particularly bad cases, the query can even time out. There are a couple of types of queries that are more likely to be affected by this. They all involve scanning all or a large part of a table.
Full table scans like
SELECT * from inventory.items
Any query that requires adding
ALLOW FILTERING
Range queries, i.e. queries with
WHERE item_cost > threshold
or similar
Tombstones and disk usage#
If you are rapidly filling up your cluster with data at the same time as you are doing a lot of deletions, you will reach size limits sooner. This is because the tombstone data is not actually deleted and still taking up space on disk.
Identify when tombstones affect a query#
If you suspect problems caused by tombstones for your cluster, you can check the logs. By default, if a query encounters over
1000 tombstones (configured by tombstone_warn_threshold
see the documentation) it will generate a log entry. The
entry will be in the format Read <X> live rows and <Y> tombstone cells for query <query> [...] (see tombstone_warn_threshold)
.
If it encounters over 100 000 tombstones (configured by tombstone_failure_threshold)
, the query will be aborted with a
TombstoneOverwhelmingException
(or just time out). To investigate a query that is encountering tombstones, the easiest
way is to connect with a cqlsh
session and run TRACING ON
followed by the query of interest. You can view values of
Cassandra settings with SELECT * FROM system_views.settings WHERE name = '<setting name>';
.
Tombstone best practice#
Designing your data models and query strategies to account for the expected tombstones for your particular application can really help to get the best from Apache Cassandra. We’ve put together a list of strategies to help mitigate the effects that can sometimes be observed.
Review your data model and compaction strategy and consider implementing table-level time-to-live (TTL) or using TimeWindowCompactionStrategy (TWCS) as the compaction strategy if appropriate for your workload.
Avoid queries that end up running on all partitions in a table, such as queries with no
WHERE
clause, or queries that needALLOW FILTERING
.Update your queries so that they don’t have to scan over tombstone rows in the same manner. For range queries, this might mean investigating if you can use a narrower range, or use a different approach to the query.
If you are planning to delete all the data in a table, you can truncate the table to avoid creating tombstones.
Allow tombstone deletion to happen automatically as part of regular operations rather than forcing the deletes. Once more time than
gc_grace_seconds
has elapsed and a compaction happens, the data with tombstone marks will be removed from disk.