aarondwi.github.io

View the Project on GitHub

Low Cardinality Indexing

This note focuses on OLTP setting, not OLAP. In OLAP setting, we can just spam more CPUs to do parallel scan/aggregate/etc. In OLTP, we can also do that, but also bottlenecked by contention for isolation, number of queries, latency target, etc.

Also assuming that the queries cant use other highly selective fields (e.g. unique key, date, etc). If it can, then the low cardinality field can be checked on filter step rather than access step.

For simple, relatively static pattern

  1. On status/enum field. Typical example is if a record is already processed or not. Either use normal index with the status/enum as first column (so combined together on search), or using partial index (e.g. WHERE status is NULL). Partial/Filtered index for example supported in PostgreSQL, VoltDB, Oracle, SQL Server, CockroachDB, Yugabyte
  2. For relatively close in number of records, statis + known value, e.g. VALUES IN ('A', 'B', 'C', 'D', ...). Basically the same as partitioning by list. Physically partitioned, so queries can directly filter by this partitioning. Supported by both PostgreSQL, MySQL, CockroachDB, YugaByte, Oracle, SQL Server, VoltDB

For heavy dynamic patterns

Typical in OTA industries, as most search result has many boolean check (e.g. HAS_SMOKE_AREA, HAS_AC, HAS_WIFI, etc). Need to use bitmap index, and use partitioning, so do not need to use bitmask over every single record. Tarantool and Lucene (and basically their users, Elasticsearch and Solr) has support for this. Pilosa is a full blown distributed store, specialized in bitmap indexing.

Can also use bitmap index libraries, such as Roaring Bitmap or those using SIMD instructions, such as kelindar/bitmap. For few patterns to be aware of, see here