Problem
_rollback_to_pre_run_checkpoint restores an earlier checkpoint by writing that checkpoint state back through the checkpointer, but it does not delete, hide, or supersede checkpoints that were created after the rollback target. If later code resolves the latest checkpoint by checkpoint ordering, it can still observe the newer post-run checkpoint instead of the restored state.
Impact
Rollback can appear to succeed while subsequent thread-state reads or resumed runs continue from a newer checkpoint. This makes rollback behavior nondeterministic and can leave the thread in a state the user explicitly tried to undo.
Suggested Fix
When rolling back, either remove/mark obsolete all checkpoints and writes after the rollback target, or create a new restoring checkpoint with a fresh latest checkpoint id that clearly supersedes the later entries. The latest-checkpoint lookup should resolve to the restored state after rollback.
Tests
- Create multiple checkpoints for a thread.
- Roll back to an earlier checkpoint.
- Verify the next latest-state lookup returns the restored state, not a later checkpoint.
- Verify writes associated with retired checkpoints do not replay after rollback.
References
backend/packages/harness/deerflow/runtime/runs/worker.py:303
Problem
_rollback_to_pre_run_checkpointrestores an earlier checkpoint by writing that checkpoint state back through the checkpointer, but it does not delete, hide, or supersede checkpoints that were created after the rollback target. If later code resolves the latest checkpoint by checkpoint ordering, it can still observe the newer post-run checkpoint instead of the restored state.Impact
Rollback can appear to succeed while subsequent thread-state reads or resumed runs continue from a newer checkpoint. This makes rollback behavior nondeterministic and can leave the thread in a state the user explicitly tried to undo.
Suggested Fix
When rolling back, either remove/mark obsolete all checkpoints and writes after the rollback target, or create a new restoring checkpoint with a fresh latest checkpoint id that clearly supersedes the later entries. The latest-checkpoint lookup should resolve to the restored state after rollback.
Tests
References
backend/packages/harness/deerflow/runtime/runs/worker.py:303