Skip to content

[harness][runtime] Rollback writes an old checkpoint without retiring newer checkpoints #2554

@hetaoBackend

Description

@hetaoBackend

Problem

_rollback_to_pre_run_checkpoint restores an earlier checkpoint by writing that checkpoint state back through the checkpointer, but it does not delete, hide, or supersede checkpoints that were created after the rollback target. If later code resolves the latest checkpoint by checkpoint ordering, it can still observe the newer post-run checkpoint instead of the restored state.

Impact

Rollback can appear to succeed while subsequent thread-state reads or resumed runs continue from a newer checkpoint. This makes rollback behavior nondeterministic and can leave the thread in a state the user explicitly tried to undo.

Suggested Fix

When rolling back, either remove/mark obsolete all checkpoints and writes after the rollback target, or create a new restoring checkpoint with a fresh latest checkpoint id that clearly supersedes the later entries. The latest-checkpoint lookup should resolve to the restored state after rollback.

Tests

  • Create multiple checkpoints for a thread.
  • Roll back to an earlier checkpoint.
  • Verify the next latest-state lookup returns the restored state, not a later checkpoint.
  • Verify writes associated with retired checkpoints do not replay after rollback.

References

  • backend/packages/harness/deerflow/runtime/runs/worker.py:303

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions