aarondwi.github.io

View the Project on GitHub

Current Database Concurrency Control Algorithm Landscape

This is my thoughts as why academic CC algorithms are not widely adopted by newest database products (and vendors).

Background

  1. Lots of production databases, implementation wise, adopt traditional 2PC/OCC + clock like + snapshot for read is adopted, and not esoteric implementations. While write may take long locks (cause distributed, fsync latency, etc), read doesn’t have to suffer. And majority of workloads are read-heavy, so this works.
  2. Lock based are usually easier to understand (cause already used to) and implement (cause both single node and distributed have the same pattern), which means less bugs, for both pessimistic 2PC and OCC.
  3. Even with non-advanced CC algorithms, it already meets performance requirements. The problems faced in real production usually more into schema change stability, capacity (memory, disk, etc), recovery, etc

There are some techniques that are already used to go around heavy contention problem, such as:

  1. Some companies open separate campaign. For example, e-commerce may splits voucher redeem from the sales day itself, by asking user to manually redeem before the sales or ask user to invite others to get more discounts. This allows the contention to be split not into 1 min, but for example, to 3 days. This works also as easy promotions.
  2. Another case, if only 1 type of item going to be sold (xiaomi did this kind of flash sale back before), developers can also opt not to materialize the conflict at all, as they are not having any limit that should be preserved, just to record who buys the thing, has already paid, etc. (Don’t know if this was the technique applied, but should work)

Example of latest CCs used in known production DB:

  1. FoundationDB -> OCC 2PC, possibly with false negative but no false positive, can explicitly remove/add range.
  2. CockroachDB/Yugabyte -> HLC OCC 2PC, like spanner
  3. TiDB -> percolator model, with backoff when meeting higher ids
  4. MongoDB -> Snapshot isolation, HLC, OCC
  5. AtlasDB -> percolator model, and then SSI for serializability
  6. Vitess and/or Citus/Hyperscale -> 2PC and/or best effort, with bit more check, like deadlock detection.
  7. FaunaDB -> an OCC variant of Calvin
  8. ING Bank’s Rebel -> PSAC
  9. VoltDB -> HStore model
  10. PostgreSQL -> Snapshot isolation + SSI
  11. Couchbase -> RAMP
  12. Facebook’s TAO -> RAMP-TAO
  13. SQL Server’s Hekaton
  14. GaussDB/openGauss MOT -> Silo
  15. NDB Cluster -> Lock based 2PC hybrid, which is normal 2PC over partition, linear 2PC over replica with backups getting locked first

More traditional design, which allow asking explicitly for read/write lock:

  1. Traditional SQL RDBMS (FOR SHARE / FOR UPDATE)
  2. Neo4J
  3. Couchbase
  4. NDB/RonDB cluster
  5. Innodb cluster
  6. FoundationDB -> add/remove locks, as it is optimistic version
  7. SingleStore -> link 1, link 2

Why?

IMHO cause other esoteric techniques, are either:

  1. Can’t be used on distributed setting, cause too expensive (any in-memory gain fully removed on distributed setting), or no explanation how to do so at all
  2. No explanation for management side (loading/updating ad-hoc data, add index, etc), but only focusing on performance. These management side basically should also go through serializable check (or is this the only way?)
  3. Hard to implement correctly, need lots of state tracking (lots of memory)
  4. Most real world problem have clear, easy perf target (human speed) + easily shardable per user, low contention, with high contention only on specific combined metric (total likes, etc). Even percolator easily reach 2million/s. For those cant, usually very specific only (time series, HFT, etc)
  5. Focus more on higher contention, by somehow rescheduling/reordering to remove contention (but for real contention, still sequential, so not really an improvement, unless it is CRDT like)
  6. 2PC for multi partition, only scalable per partition
  7. Employ non-snapshot algo, has bad perf for long running read transactions, which are quite common (but should be avoided either way for high-throughput OLTP). Some has esoteric write locks behavior to allow non blocking read, but this is far more complex than the standard snapshot algorithm
  8. For non single-global tso/equivalent, meaning need very careful engineering to not allow partial read
  9. All their benchmarks dont include disk-write/sync and repl, only in memory. Looks really fast, but assuming failure are not correlated
  10. Need static workload analysis, dynamic/ad-hoc query doesn’t receive optimization
  11. Very wasteful on abort
  12. Does not assume dynamic transaction, which is the most typical DBMS usage (JDBC, etc)
  13. Need a full rewrite, not easily adaptable to current popular database/storage engine

Which means almost all of them does not implement all that is expected for a proper production ready database, and the implementations need to fill them. This means lots of behaviors not yet known, and how it will impact the design/performance after the algo got implemented

About academic CCs

Notes: every non snapshot doesnt supp non blocking read! So bad for long running read only tx (HTAP case). But usually good enough for adhoc query by product team. And although not widely used on production database, some parts of their techniques can be used for improvement

  1. SSI (and its derivative SGSI and RSSI)is expensive cause need to scatter gather read range (basically back to like 2PC in distributed setting), and final check should be in critical section
  2. BOHM cause of preallocation, gonna need to allocate all when doing adhoc insertion, etc, but may be good cause allowing almost all parallelism. Doesn’t explain indexing, management, etc. Need custom language, cause deterministic. Memory only, not including disk and replication.
  3. Orthrus/DORA becomes weird when need result from multi partition before work -> where to do the work? Presumably will be last thread -> new bottleneck, and doesnt handle management. Need custom language, cause deterministic. Memory only, not including disk and replication.
  4. Strife aims to handle high contention, but has heavy compute partitioning scheme, which takes lots of time (hundred ms), and doesnt handle management either. Need custom lang, cause deterministic. Memory only, not including disk and replication.
  5. MOCC means more network traffic for temperature checking, lock, etc (cant go full read/write set only during commit), but can probably be improved with caching temperature, or local only temp. Basically still OCC.
  6. Silo works both memory/dist, but those on distributed setting are using HLC/TrueTime/TSO. Variant used in most newsql. Foedus is a Silo variant which also utilizes NVM, to manage durability and cold data, but with same CC algorithm
  7. Even Calvin is a deterministic database framework, need to use custom lang, do reconnaissance check first for index lookup (expensive if lots)
  8. Hstore becomes full blocking on multi partition
  9. MV3C is expensive, cause need to communicate change back-and-forth. Can use its optimization, and do local caching to alleviate. And need custom language
  10. SSN is a single global object (subject to contention). The explained fully concurrent one is far more complex, full of CAS, retries, state-tracking, etc.
  11. Ermia/Hekaton are usable. For example, out of order writing closely mimics mysql. SI is used by almost all newsql right now. Indirection used to alleviate locks.
  12. Tictoc provides nice abort reducing, like MOCC. But not strict serializablee, which is already more than enough. For snapshot, can utilize HLC or the like.
  13. Cicada is like combination of tictoc/silo/foedus/mocc/ermia/hekaton, so inherits most of its benefits, but few uselessness remains. It becomes good only with really often garbage collection to remove unneeded version (or else gonna need long traversal), so cant support long running read only tx without reducing performance. But supposedly the fastest possible from this kind of OCC (against MOCC, Silo, Foedus, etc). Most optimizations only happen on single node. Memory only, not including disk and replication.
  14. PSAC runs all possible cases directly without waiting, assuming both success and failure. Need to finish an entire part of a transaction at once, can’t be waiting for others or become blocking again, just like SAGA. Very wasteful on abort
  15. Early Lock Release will complicate read semantic, will need also to go thru the log. Also cause possibility of holes in the log, as later transaction persists before older ones.
  16. Phaser/Doppel can achieve high throughput under contention, but should only be CRDT-safe, unnecessarily block reads (cause of phase synchronization between join and split), and operations can’t return data (to guarantee serializability).
  17. QURO need full static analysis of all workloads, not allowing dynamic query to be also optimized (but possible to be made incremental)
  18. IC3/Transaction Chopping also need static analysis to create dependency graph. Can’t dynamically/incrementally add new transaction, as it will change the analysis, unless going back to traditional 2PC. Basically based on dependency graph between either full transaction, or piece(s) of transaction, some pieces need to wait to reduce abort. All of the checks need to be done atomically, e.g. inside a critical section
  19. BCC does lots of differing for read-only (the most typical query) to reduce dependency graph size. Read Check reduction done via partitioned hash-map, so reduce critical section size, which is expensive on distributed setting. Memory only, not including disk and replication.
  20. Callas/MCC is a derivative of IC3/TxChopping, but has differentiation via nexus locks (across group as typical locks, inside group via release locks before/ deferred lock release). The group created is mostly from same, hot transaction. In effect, it behaves as if it is single thread per group. Across groups, still behaves as if blocking, but already achieves much parallelism inside a group. But as locks release is deffered, depends on the logic (lock all then check all or lock check in steps) basically if the front one abort, gonna abort everything else, which is wasteful. And still need static analysis, to reduce the abort
  21. RAMP is the most scalable for Read Atomic. Basically half 2PC, with context passed to each partition, to know which other partitions included in the transaction. The cost is only on latency. Programming effort can be abstracted
  22. RedBlue allow commutative ops (blue) to be executed in parallel, while non-commutative (red) one sequentially, which means only CRDT-like ops can go parallel (Close to Phaser and MCC). This needs static analysis (or developer creating generator/shadow ops manually).
  23. Transaction Healing requires static analysis, basically at runtime instead of discarding entire work, it recomputes only from the first conflict. Change checking done in a optimized way via pointer directly. Also need to be written in custom language, so it can manage the code to re-run.
  24. Lazy Transaction requires deterministic code to decide now and later phase, to ensure it is deferring executions correctly. This semantic is already used, for example, like MySQL change buffer, or fractal tree in TokuDB (Those 2 are not concurrency control mechanism, rather just disk optimization in the form of data structure)
  25. Sundial quite close to tictoc algorithm. Use of logical lease would allow pretty simple caching mechanism
  26. PPCC mostly optimizing read-after-write and write-after-read use case (this one is quite common). Precedence check could be cheap in distributed setting, cause can check locally first
  27. [PLP] (http://www.pandis.net/resources/pvldb11pandis.pdf) a lightweight partitioning method. Assigning task to data. An optimization for single machine use case