Scheduling

This section introduces the scheduling mechanism.

The scheduling mechanism is mainly by summarizing the collected cluster real-time status information (including the heartbeat of the region and the heartbeat of the store), and then judging whether scheduling needs to be generated according to different strategies, and then generate scheduling Operator which is send to TiKV to do scheduling. the Operatorstruct like:

    // Operator contains execution steps generated by scheduler.
    type Operator struct {
        desc             string
        brief            string
        regionID         uint64
        regionEpoch      *metapb.RegionEpoch
        kind             OpKind
        steps            []OpStep
        stepsTime        []int64 // step finish time
        currentStep      int32
        status           OpStatusTracker
        level            core.PriorityLevel
    }

Each scheduling Operator is only used to operate a region migration, which including some OpSteps: add peer, transfer the raft group leader and remove peer. an Operator will record the ID of the operator region, the relative strategy name, operator priority level etc. The Operator in PD may be generated from two operate, one is the checker and the other is the scheduler. The generated schedule will be stored in a map, and then it will be returned to the corresponding TiKV through the response when the region heartbeat comes. Let us first look at how checker and scheduler work.

Checker

After PD server started, there is a background worker will polling all regions and then check the health status of each region:

    func (c *coordinator) patrolRegions() {
        timer := time.NewTimer(c.cluster.GetPatrolRegionInterval())
        for {
            select {
            case <-timer.C:
                timer.Reset(c.cluster.GetPatrolRegionInterval())
            case <-c.ctx.Done():
                log.Info("patrol regions has been stopped")
                return
            }

            regions := c.cluster.ScanRegions(key, nil, patrolScanRegionLimit)
            for _, region := range regions {
                if c.checkRegion(region) {
                    break
                }
            }
        }
    }

In this function, checkRegion will be executed to determine whether the region needs to be scheduled. If a schedule Operator is generated, it will be sent to TiKV through the heartbeat of this region. The initialization of the checker can be found in coordinator.go, which mainly contains four checkers: RuleChecker, MergeChecker, JointStateChecker and SplitChecker.

RuleChecker is the most critical checker. It will check whether a region has down peers or offline peers. It will also check whether the number of replicas of the current region is the same as the number of replicas specified in the Placement Rules. If the conditions are met, it will trigger the logic of the corresponding supplementary replicas or delete the redundant replicas. In addition, RuleChecker will also check whether the current copy of this region is placed in the most reasonable place, and if not, it will be placed in a more reasonable place.

MergeChecker will check whether the current region meets the merge conditions, such as whether the size of the region is less than max-merge-region-size, whether the key of the region is less than max-merge-region-keys, and whether there has been no split operation in the recent period, etc. If these conditions are met, an adjacent region will be selected to try to merge the two regions.

Scheduler

Let's first take a look at the code about the scheduler running process. Schedulers are running concurrently. Each scheduler has a scheduler controller, which running in a background worker:

func (c *coordinator) runScheduler(s *scheduleController) {
    defer logutil.LogPanic()
    defer c.wg.Done()
    defer s.Cleanup(c.cluster)

    timer := time.NewTimer(s.GetInterval())
    defer timer.Stop()

    for {
        select {
        case <-timer.C:
            timer.Reset(s.GetInterval())
            if !s.AllowSchedule() {
                continue
            }
            if op := s.Schedule(); len(op) > 0 {
                added := c.opController.AddWaitingOperator(op...)
                log.Debug("add operator", zap.Int("added", added), zap.Int("total", len(op)), zap.String("scheduler", s.GetName()))
            }

        case <-s.Ctx().Done():
            log.Info("scheduler has been stopped",
                zap.String("scheduler-name", s.GetName()),
                errs.ZapError(s.Ctx().Err()))
            return
        }
    }
}

Similar to the checker, when the PD starts, the specified scheduler will be added according to the configuration. Each scheduler runs in a goroutine by executing go runScheduler, and then executes the Schedule() function at a dynamically adjusted time interval.

There are two things that a function has to do. The first is to execute the scheduling logic of the corresponding scheduler to determine whether to generate an Operator, and the other is to determine the time interval for the next execution of Schedule().

PD contains many schedulers. For details, you can check the server/schedulers package, which contains the implementation of all schedulers. The schedulers that PD will run by default include balance-region-scheduler, balance-leader-scheduler, and balance-hot-region-scheduler. Let's take a look at the specific functions of these schedulers:

The balance-region-scheduler calculates a score based on the size of the region size on a store and the usage of available space. Then, according to the calculated score, the region is evenly distributed to each store through the Operator that generates the balance-region. The reason why the available space is considered here is that the actual situation may have differences in storage capacity of different stores.
The balance-leader-scheduler is similar to the balance-region-scheduler. It calculates a score based on the region count, and then uses the Operator that generates the balance-leader to distribute the leaders as evenly as possible across the stores.
The balance-hot-region-scheduler implements the related logic of hot spot scheduling. For TiDB, if there are hot spots and only a few stores have hot spots, then the overall resource utilization of the system will be lowered, and it is easy to form a system bottleneck. Therefore, PD also needs to count the hot spots in response to this situation. Come out, and by generating the corresponding schedule, the hot spots are scattered to all stores as much as possible. So that all stores can share the pressure of reading and writing.

There are some other schedulers to choose from. Each scheduler of PD implements an interface called Scheduler:

// Scheduler is an interface to schedule resources.
type Scheduler interface {
    http.Handler
    GetName() string
    // GetType should in accordance with the name passing to schedule.RegisterScheduler()
    GetType() string
    EncodeConfig() ([]byte, error)
    GetMinInterval() time.Duration
    GetNextInterval(interval time.Duration) time.Duration
    Prepare(cluster opt.Cluster) error
    Cleanup(cluster opt.Cluster)
    Schedule(cluster opt.Cluster) []*operator.Operator
    IsScheduleAllowed(cluster opt.Cluster) bool
}

The most important thing is the Schedule() interface function, which is used to implement the specific scheduling-related logic of each scheduler.

In addition, the interface function IsScheduleAllowed() is used to determine whether the scheduler is allowed to execute the corresponding scheduling logic. Before executing the scheduling logic, each scheduler will firstly call this function to determine whether the scheduling rate is exceeded. Specifically, in the code, this function is called AllowSchedule(), and then the IsScheduleAllowed() method implemented by different schedulers is called.

PD can control the speed at which the scheduler generates operators by setting the limit, but the limit here is just one that maintains a window size, and different operator types have their own window sizes. For example, balance-region schedulers and balance-hot-region schedulers will generate operators related to migrate region, and the type of this operator is OpRegion. We can control the operator of this type of OpRegion by adjusting the region-schedule-limit parameter. The specific operator type definition can be found in the file operator.go. An operator may contain multiple types. For example, the operator generated by balance-hot-region may belong to both OpRegion and OpHotRegion.

This section mainly introduces the main operation process of PD scheduling. For more details, you can continue to refer to the corresponding code study. And, welcome to contribute good first issues.

See more information about PD implementation on its Wiki page.

TiKV Development Guide

Scheduling

Checker

Scheduler

More