Async Commit (initial design)
This document is focussed on the initial increments of work to implement a correct, but non-performant version of async commit.
Implementation is tracked in issue 36.
Goals
- Reduce latency of transactions by moving a round trip from client to server from before reporting success to the user to after reporting success to the user.
- Not to significantly reduce throughput or increase latency of reads or prewrite.
Requirements and constraints
- Preserve existing SI, linearizability, and RC properties of transactions.
- TiDB can report success to the user after all prewrites succeed; before sending further messages to TiKV.
- Backwards compatible
- Existing data should still be usable (including transient data such as locks)
- We can assume that all TiKV nodes will be updated before any TiDB nodes (and before async commit is enabled)
- Old and new TiKV nodes may exist in the same cluster
Known issues and solutions
Please see this doc.
Implementation
TiDB
- The user should opt-in to async commit; we assume the user has opted in for all nodes for the rest of this document.
- Each prewrite response will include a
min_commit_ts
, TiDB will select the largestmin_commit_ts
as the finalcommit_ts
for the transaction. - TiDB can return success to the client before sending the commit message but after receiving success responses for all prewrite messages.
- If an operation fails because it is locked, TiDB must query the primary lock to find a list of secondaries (a
CheckTxnStatus
request). It will then send aCheckSecondaryLocks
request to each region to get all secondary locks and themin_commit_ts
in those locks. If all locks in a success state, it can send the commit message as above. If any lock is not present or rolled back, then the transaction should be rolled back.
TiKV
- The information stored with each primary key lock should include all keys locked by the transaction and their status.
- When a prewrite is received, lock the keys with a preliminary ts of the start ts. Query PD for a timestamp, this is returned to TiDB as the
min_commit_ts
and stored in the lock data for the primary key. - When a read is woken up, ensure we check the commit ts, since we might want the old value (I think we do this anyway).
- Handling a
CheckSecondaryLocks
message means checking each specified lock, returningmin_commit_ts
if the lock is in a success state and rolling back the lock if not (including leaving a rollback tombstone).
Protobuf changes
See kvproto/637 and kvproto/651.
Evolution to long term solution
The framework of async commit will be in place at the end of the first iteration. In a later iteration we should improve the temporary locking mechanism, See parallel-commit-solution-ideas.md for possible improvements.
Open questions
- There is a problem if the commit_ts calculated during a resolve lock is > pd’s latest ts + 1, that is separate from the problem of non-unique timestamps (but has the same root cause). (#21)
- Replica read. (#20)
- Interaction with CDC. (#19)
- Interaction with binlog. (#19)
- Missing schema check in TiDB on commit.
- Interaction with large transactions (there should be no problem, we must ensure that other transactions don't push a async commit transaction's commit_ts).
- Heartbeats for keeping async commit transactions alive.
Refinements
TiDB
- TiDB should choose whether or not to use async commit based on the size of the transaction
TiKV
- When using a preliminary lock, try to wake up reader when we have a commit ts.
- Use a memlock rather than writing a preliminary lock
Refinement 1
- For each node, store a
local_ts
(akamax_ts
), this is the largest ts seen by the node or issued by the node + 1. It could be kept per-region if there is lots of contention, but since it a fairly simple lock-free (but not wait-free) update, I would not expect it to be too bad.- Note that this is basically a Lamport Clock version of the timestamp, i.e., it is an approximation to PD's current TS.
- TiDB fetches a timestamp from PD for all prewrites (
min_commit_ts
). TiKV comparesmin_commit_ts
tolocal_ts
, iflocal_ts
is greater thanmin_commit_ts
, it must fetch a new timestamp from PD, otherwise it can reusemin_commit_ts
.
Refinement 2
- Use the
local_ts
as themin_commit_ts
. - Note that this scheme causes duplicate time stamps, and requires one of the proposed solutions.
Refinement 3
- The user can opt-in to non-linearizability. In that case we use the extended timestamp format described in parallel-commit-known-issues-and-solutions.md.
- If the user does not opt-in, then use the refinement 1 scheme.
1PC
If there is just a single prewrite then TiDB can set a flag on the request, then TiKV can just use the min_commit_ts
as the commit_ts
and commit the transaction without TiDB sending the final commit message (or taking the risk of )
max_read_ts
approach
The coarsest granularity is to maintain max_read_ts
and min_commit_ts
per region.
The per-range approach:
- For each region, store in memory (note this general mechanism should be abstracted using a trait so it can be easily upgraded to per-key or other form of locking):
- A structure of
min_commit_ts
s, a map from each in-progress transaction to the minimum ts at which it may be committed. max_read_ts
: the largeststart_ts
for any transactional read operation for the region (i.e., this value is potentially set on every read).- When a TiKV node is started up or becomes leader,
max_read_ts
is initialised from PD with a new timestamp. - When a prewrite is processed, TiKV records the current
max_read_ts + 1
as themin_commit_ts
for that transaction.min_commit_ts
is recorded in each key's lock data structure. - When a prewrite is finished, its entry is removed from the
min_commit_ts
structure. If the prewrite is successful, themin_commit_ts
is returned to TiDB in the response. - When a read is processed, first it sets the
max_read_ts
, then it checks itsstart_ts
against the smallestmin_commit_ts
of any current transaction in the read range. It will block until itsstart_ts >= min(min_commit_ts)
- A structure of
- Use
Mutex<Timestamp>
formax_read_ts
Further refinement:
- Per-key, rather than per-region,
min_commit_ts
andmax_read_ts
- Lock-free
max_read_ts
rather than using a mutex.
Handling non-unique timestamps
See parallel-commit-known-issues-and-solutions.md for discussion.
There is a possibility of two transactions having the same commit_ts, or of one transaction’s start_ts to be equal to the other’s commit_ts. We believe conflicts in the write CF between two commits are not possible. However, if one transaction's start_ts
is another's commit_ts
then rolling back the first transaction would collide with committing the second. We believe this isn't too serious an issue, but we will need to find a backwards compatible change to the write CF format. We do not know if there are problems due to non-unique timestamps besides the conflict in write CF.
Testing
Staffing
The following people are available for work on this project (as of 2020-06-15):
- Zhenjing (@MyonKeminta): minimal time until CDC project is complete
- Zhaolei (@youjiali1995): review + minimal time
- Nick (@nrc): approximately full time
- Yilin (@sticnarf): approximately 1/2 time
- Fangsong (@longfangsong): approximately full time after apx one month (although work to be finalised)
- Rui Xu (@cfzjywxk): apx 1/3 initially
RACI roles:
- Responsible:
- Nick
- Rui Xu
- Yilin
- Accountable: Rui Xu
- Consulted:
- Zhenjing
- Zhaolei
- Informed:
- Liu Wei
- Liqi
- Evan Zhou
- #sig-transaction
- this month in TiKV