MIT-6.824: Lab 4B: Key/value service with snapshots

Lab4B 整体比较简单，要求实现 KVServer 重启宕机后，仍能 快速恢复 之前的状态，因此，Service 需要创建快照，并发送给 Raft Server

快照内容

快照应该包含以下内容：

状态机的全部 K-V 键值对
ClientID 与 Epoch 的映射关系
lastReply 缓存
applyIndex

type SnapshotData struct {
	StateMachine map[string]string
	ClientEpoch  map[int64]int64
	LastReply    map[int64]int64
}

applyIndex 隐式包含在 rf.Snapshot 的参数中

如何创建快照

初步想法是：

每个 KVServer 会有一个后台 goroutine：refreshSnapshot
refreshSnapshot 会定期检查 Raft 实例的状态
如果 persister.RaftStateSize() 比 maxraftstate 大，那么创建快照

如何应用快照

Raft Server 重启时，会将持久化的 Snapshot 通过 applyCh 发给 KVServer

因此，需要修改 KVServer 的 applier：

如果一个 command 的类型时 SnapshotValid，应用快照

应用快照，就是无条件使用快照重置 KVServer，这包括了：状态机、ClientEpoch、LastReply

正确性验证

4B 部分测试 100 次的结果如下：

踩的坑

KVServer 应该有辨别 command 乱序的能力

之前一直有一个误区：认为依托于 Raft 的 Service，如果检测到乱序，直接 panic

事实上这是不正确的，因为 Raft 的 applier 实现，注定了会有乱序 command

为什么？

直接看 Raft applier 的代码：

for !rf.killed() {
    time.Sleep(getApplyTimeout())
    rf.mu.Lock()

    CurrentTerm, isLeader := rf.GetState()
    if isLeader {
        updateCommitIndex(CurrentTerm)
    }

    commitIndex := rf.commitIndex
    if rf.applyIndex + 1 > commitIndex {
        // no msgs to apply
        rf.mu.Unlock()
        continue
    }
    msgs := rf.getLogs(rf.Logs[rf.getIndex(rf.applyIndex + 1):rf.getIndex(commitIndex + 1)])

    rf.mu.Unlock()

    // apply msgs without lock
    for _, msg := range msgs {
        applyMsg := ApplyMsg {
            CommandValid: true,
            CommandIndex: msg.Index,
            Command: msg.Command,
        }
        rf.applyCh <- applyMsg
        DPrintf("{%v}%v: applied command which index is %v(term:%v)\n", CurrentTerm, rf.me, msg.Index, msg.Term)
    }

    rf.mu.Lock()
    // Why take the maximum value of rf.applyIndex and commitIndex?
    // Because we did not lock when applying,
    // and rf.applyIndex may increase due to InstallSnapshot RPC.
    // We do not want the updated rf.applyIndex to become smaller,
    // which may cause duplicate apply
    rf.applyIndex = max(rf.applyIndex, commitIndex)
    rf.mu.Unlock()
}

原因就出现在：向 applyCh 提交 msg 时，没有持锁

如果 Leader 发来一个 InstallSnapshot RPC，并且 Snapshot 是描述的「日志前缀」

KVServer 收到这个快照以后，使用快照重置状态机，更新 applyIndex，注意：更新后的 applyIndex 一定会变小

在收到快照之前，Raft 的 applier 已经准备批量向 KVServer 提交日志（这部分日志条目的确定，是依靠于之前的 applyIndex 和 commitIndex），此时日志的 Index 一定比更新后的 applyIndex 大

于是，给上层应用的感觉就是：Raft 提交的 Command 乱序了

此时 KVServer 正确的做法应该是：拒绝 apply 这条日志，跳过即可

为什么不 panic？

因为此时虽然看起来乱序了，但是 Raft 后续肯定会 apply 我们希望的 Command，到那时，KVServer 就可以 apply