Async Commit (initial design)

This document is focussed on the initial increments of work to implement a correct, but non-performant version of async commit.

Implementation is tracked in issue 36.

Goals

Reduce latency of transactions by moving a round trip from client to server from before reporting success to the user to after reporting success to the user.
Not to significantly reduce throughput or increase latency of reads or prewrite.

Requirements and constraints

Preserve existing SI, linearizability, and RC properties of transactions.
TiDB can report success to the user after all prewrites succeed; before sending further messages to TiKV.
Backwards compatible
- Existing data should still be usable (including transient data such as locks)
- We can assume that all TiKV nodes will be updated before any TiDB nodes (and before async commit is enabled)
- Old and new TiKV nodes may exist in the same cluster

Implementation

TiDB

The user should opt-in to async commit; we assume the user has opted in for all nodes for the rest of this document.
Each prewrite response will include a min_commit_ts, TiDB will select the largest min_commit_ts as the final commit_ts for the transaction.
TiDB can return success to the client before sending the commit message but after receiving success responses for all prewrite messages.
If an operation fails because it is locked, TiDB must query the primary lock to find a list of secondaries (a CheckTxnStatus request). It will then send a CheckSecondaryLocks request to each region to get all secondary locks and the min_commit_ts in those locks. If all locks in a success state, it can send the commit message as above. If any lock is not present or rolled back, then the transaction should be rolled back.

TiKV

The information stored with each primary key lock should include all keys locked by the transaction and their status.
When a prewrite is received, lock the keys with a preliminary ts of the start ts. Query PD for a timestamp, this is returned to TiDB as the min_commit_ts and stored in the lock data for the primary key.
When a read is woken up, ensure we check the commit ts, since we might want the old value (I think we do this anyway).
Handling a CheckSecondaryLocks message means checking each specified lock, returning min_commit_ts if the lock is in a success state and rolling back the lock if not (including leaving a rollback tombstone).

Evolution to long term solution

The framework of async commit will be in place at the end of the first iteration. In a later iteration we should improve the temporary locking mechanism, See parallel-commit-solution-ideas.md for possible improvements.

Open questions

There is a problem if the commit_ts calculated during a resolve lock is > pd’s latest ts + 1, that is separate from the problem of non-unique timestamps (but has the same root cause). (#21)
Replica read. (#20)
Interaction with CDC. (#19)
Interaction with binlog. (#19)
Missing schema check in TiDB on commit.
Interaction with large transactions (there should be no problem, we must ensure that other transactions don't push a async commit transaction's commit_ts).
Heartbeats for keeping async commit transactions alive.

Refinements

TiDB

TiDB should choose whether or not to use async commit based on the size of the transaction

TiKV

When using a preliminary lock, try to wake up reader when we have a commit ts.
Use a memlock rather than writing a preliminary lock

Refinement 1

For each node, store a local_ts (aka max_ts), this is the largest ts seen by the node or issued by the node + 1. It could be kept per-region if there is lots of contention, but since it a fairly simple lock-free (but not wait-free) update, I would not expect it to be too bad.
- Note that this is basically a Lamport Clock version of the timestamp, i.e., it is an approximation to PD's current TS.
TiDB fetches a timestamp from PD for all prewrites (min_commit_ts). TiKV compares min_commit_ts to local_ts, if local_ts is greater than min_commit_ts, it must fetch a new timestamp from PD, otherwise it can reuse min_commit_ts.

Refinement 2

Use the local_ts as the min_commit_ts.
Note that this scheme causes duplicate time stamps, and requires one of the proposed solutions.

Refinement 3

The user can opt-in to non-linearizability. In that case we use the extended timestamp format described in parallel-commit-known-issues-and-solutions.md.
If the user does not opt-in, then use the refinement 1 scheme.

1PC

If there is just a single prewrite then TiDB can set a flag on the request, then TiKV can just use the min_commit_ts as the commit_ts and commit the transaction without TiDB sending the final commit message (or taking the risk of )

`max_read_ts` approach

The coarsest granularity is to maintain max_read_ts and min_commit_ts per region.

The per-range approach:

For each region, store in memory (note this general mechanism should be abstracted using a trait so it can be easily upgraded to per-key or other form of locking):
- A structure of min_commit_tss, a map from each in-progress transaction to the minimum ts at which it may be committed.
- max_read_ts: the largest start_ts for any transactional read operation for the region (i.e., this value is potentially set on every read).
- When a TiKV node is started up or becomes leader, max_read_ts is initialised from PD with a new timestamp.
- When a prewrite is processed, TiKV records the current max_read_ts + 1 as the min_commit_ts for that transaction. min_commit_ts is recorded in each key's lock data structure.
- When a prewrite is finished, its entry is removed from the min_commit_ts structure. If the prewrite is successful, the min_commit_ts is returned to TiDB in the response.
- When a read is processed, first it sets the max_read_ts, then it checks its start_ts against the smallest min_commit_ts of any current transaction in the read range. It will block until its start_ts >= min(min_commit_ts)
Use Mutex<Timestamp> for max_read_ts

Further refinement:

Per-key, rather than per-region, min_commit_ts and max_read_ts
Lock-free max_read_ts rather than using a mutex.

Handling non-unique timestamps

See parallel-commit-known-issues-and-solutions.md for discussion.

There is a possibility of two transactions having the same commit_ts, or of one transaction’s start_ts to be equal to the other’s commit_ts. We believe conflicts in the write CF between two commits are not possible. However, if one transaction's start_ts is another's commit_ts then rolling back the first transaction would collide with committing the second. We believe this isn't too serious an issue, but we will need to find a backwards compatible change to the write CF format. We do not know if there are problems due to non-unique timestamps besides the conflict in write CF.

Testing

Staffing

The following people are available for work on this project (as of 2020-06-15):

Zhenjing (@MyonKeminta): minimal time until CDC project is complete
Zhaolei (@youjiali1995): review + minimal time
Nick (@nrc): approximately full time
Yilin (@sticnarf): approximately 1/2 time
Fangsong (@longfangsong): approximately full time after apx one month (although work to be finalised)
Rui Xu (@cfzjywxk): apx 1/3 initially

RACI roles:

Responsible:
- Nick
- Rui Xu
- Yilin
Accountable: Rui Xu
Consulted:
- Zhenjing
- Zhaolei
Informed:
- Liu Wei
- Liqi
- Evan Zhou
- #sig-transaction
- this month in TiKV

TiKV sig-transaction documentation

Async Commit (initial design)

Goals

Requirements and constraints

Known issues and solutions

Implementation

TiDB

TiKV

Protobuf changes

Evolution to long term solution

Open questions

Refinements

TiDB

TiKV

1PC

`max_read_ts` approach

Handling non-unique timestamps

Testing

Staffing