Checkpoint & Resume¶
RushTI can save progress during long-running workflows. If something fails -- a TM1 crash, network timeout, or TI process error -- you can resume from where it left off instead of starting over.
Working Example¶
A Typical Scenario¶
You are running a 500-task month-end close that takes 2 hours. After 90 minutes and 450 completed tasks, the TM1 server restarts unexpectedly.
Without checkpoint: You re-run all 500 tasks from scratch. Another 2 hours lost.
With checkpoint: You resume from where it stopped. Only the remaining 50 tasks need to run.
# Original run (interrupted after 450 tasks)
rushti run --tasks monthly-close.json --max-workers 8
# Resume -- only the remaining tasks execute
rushti resume --tasks monthly-close.json
How to Enable¶
Add the [resume] section to config/settings.ini:
| Setting | Default | Range | Description |
|---|---|---|---|
enabled |
false |
-- | Turn on checkpoint saving |
checkpoint_interval |
60 |
10-600 seconds | How often to save progress |
checkpoint_dir |
./checkpoints |
-- | Directory for checkpoint files |
That is all the setup required. Once enabled, RushTI automatically saves checkpoints during every run.
How It Works¶
Saving Checkpoints¶
When checkpoints are enabled, RushTI saves a checkpoint file every 60 seconds (configurable). The checkpoint records:
- Which tasks succeeded (with timing data)
- Which tasks failed (with error details)
- Which tasks were in progress when the checkpoint was saved
- Which tasks are still pending (not yet started)
Checkpoint files are saved as JSON in the checkpoint directory:
Resuming¶
When you run rushti resume, RushTI:
- Loads the checkpoint file for the specified workflow
- Validates it against the current task file (warns you if the file changed)
- Marks all previously succeeded tasks as complete (skips them)
- Runs only the failed, in-progress, and pending tasks
- Continues saving new checkpoints as it goes
# Resume from the most recent checkpoint
rushti resume --tasks monthly-close.json
# Resume from a specific checkpoint file
rushti resume --checkpoint checkpoints/monthly-close_20260209_143022.json
What Gets Resumed¶
| Task Status at Checkpoint | What Happens on Resume |
|---|---|
| Succeeded | Skipped (already done) |
| Failed | Re-run |
| Pending | Run normally |
In progress (safe_retry: true) |
Re-run from the beginning |
In progress (safe_retry: false) |
Requires manual decision |
The safe_retry Flag
Mark tasks as safe_retry: true when they are idempotent -- running them twice produces the same result. Examples: clearing and rebuilding a cube view, mantain subsets, exporting a report. Tasks that append data or send emails should stay safe_retry: false (the default).
Checkpoint File Contents¶
A checkpoint file is a simple JSON document. Here is a shortened example:
{
"version": "1.0",
"workflow": "monthly-close",
"taskfile_path": "/rushti/tasks/monthly-close.json",
"run_started": "2026-02-09T14:00:00",
"checkpoint_created": "2026-02-09T15:30:00",
"total_tasks": 500,
"summary": {
"completed": 450,
"in_progress": 2,
"pending": 48,
"failed": 0,
"progress_percentage": 90.0
},
"completed_tasks": {
"1": { "success": true, "duration_seconds": 12.5 },
"2": { "success": true, "duration_seconds": 15.2 }
},
"in_progress_tasks": ["7", "8"],
"pending_tasks": ["11", "12"]
}
You can inspect a checkpoint any time to see progress:
Best Practices¶
Enable for Long-Running Workflows¶
Any workflow that takes more than 5 minutes is a good candidate for checkpoints. The overhead is minimal (a small JSON file written to disk every 60 seconds).
Tune the Checkpoint Interval¶
| Workflow Type | Recommended Interval | Reason |
|---|---|---|
| Many short tasks (< 10s each) | 30 seconds |
Capture fast progress |
| Mix of short and long tasks | 60 seconds (default) |
Good balance |
| Few long tasks (> 5 min each) | 120-300 seconds |
Less I/O, tasks are slow anyway |
Mark Idempotent Tasks¶
Set safe_retry: true on tasks that can safely re-run:
Good candidates for safe_retry: true:
- Clear-and-rebuild processes (dimension updates, view refreshes)
- Read-only exports and report generation
- Cache refresh and metadata operations
Keep safe_retry: false (default) for:
- Incremental data loads (appending transactions)
- Processes that send emails or trigger external systems
- Anything that creates new records with auto-generated IDs
Use --force to Start Fresh¶
If a checkpoint exists but you want to ignore it and start over:
The --force flag tells RushTI to discard any existing checkpoint and begin a full run.
Troubleshooting¶
"Checkpoint not found"¶
The checkpoint may have been cleaned up after a successful run (this is the default behavior). Check your checkpoint directory:
Checkpoint Cleanup
Checkpoint files remain in the checkpoint directory after a run completes. Clean them up manually or include cleanup in your scheduling scripts.
"Task file has been modified"¶
You changed the task file after the checkpoint was created. RushTI warns you because the checkpoint might not match the current tasks. Options:
- Force resume if your changes are compatible (e.g., you only added a new task):
- Start fresh if your changes are significant:
"Cannot automatically resume -- non-safe-retry tasks"¶
These tasks were running when the interruption happened, and RushTI cannot guarantee they are safe to re-run. You have two options:
- Manually verify the tasks completed or can be re-run, then force resume
- Mark them as
safe_retry: truein the task file if they are actually idempotent
Configuration Summary¶
All checkpoint settings in one place:
[resume]
enabled = true # Save checkpoints during execution
checkpoint_interval = 60 # Seconds between checkpoint saves
checkpoint_dir = ./checkpoints # Where to store checkpoint files
auto_resume = false # Automatically resume from last checkpoint on restart
Customize Further¶
- Settings Reference -- Complete
[resume]settings documentation - CLI Reference -- Full CLI options for the resume command
- DAG Execution -- How task scheduling and failure handling interact with checkpoints
- Exclusive Mode -- Prevent concurrent executions that could conflict with a resumed run