Performance Tuning¶
This guide covers practical techniques for getting the most out of RushTI's parallel execution engine. The goal is to minimize total workflow runtime by keeping TM1 servers busy without overloading them.
Choosing Worker Count¶
The max_workers setting controls how many TI processes RushTI executes concurrently. The right value depends on your TM1 server capacity and workload characteristics.
Rules of Thumb¶
| Scenario | Workers | Reasoning |
|---|---|---|
| Single TM1 instance, I/O-bound | 4--8 | Avoid overloading one server |
| Single TM1 instance, CPU-bound | 4--6 | CPU contention degrades all tasks |
| Multiple TM1 instances | 8--20 | Load distributes across servers |
| Read-only processes (exports) | 16--32 | Reads rarely contend |
| Write-heavy processes (loads) | 4--8 | Write locks serialize execution anyway |
| Mixed read/write workload | 8--12 | Balanced approach |
Finding Your Optimal Value¶
- Start conservative at
--max-workers 4. - Run the same workflow several times, increasing workers: 8, 12, 16.
- Watch for diminishing returns -- when doubling workers no longer cuts runtime significantly, you have hit a bottleneck (usually TM1 server threads or write locks).
- Use
rushti stats visualizeto inspect Gantt charts and identify idle workers. - Use
rushti stats optimizeto get a data-driven recommendation based on runs at different worker levels.
# Collect timing data at different worker levels
rushti run --tasks workflow.json --max-workers 4
rushti run --tasks workflow.json --max-workers 8
rushti run --tasks workflow.json --max-workers 16
# Compare in dashboard
rushti stats visualize --workflow workflow
# Get a data-driven worker recommendation
rushti stats optimize --workflow workflow
Too Many Workers
Setting workers higher than your TM1 server can handle causes connection pool exhaustion, increased memory usage, and TM1 thread starvation. Symptoms include tasks waiting long periods before starting, sporadic connection timeouts, and TM1 server performance degradation for all users.
Connection Pooling¶
RushTI reuses TM1py connections across tasks. For each unique TM1 instance referenced in the task file, RushTI creates a pool of connections at startup and shares them among workers.
Key Behaviors¶
- One pool per instance: If your tasks reference
tm1-financeandtm1-reporting, two independent connection pools are created. - Pool size matches workers: Each pool has up to
max_workersconnections. - Connections are reused: When a worker finishes a task, its connection returns to the pool for the next task targeting the same instance.
- Session context: Each connection carries a session context string (
RushTI_{workflow}orRushTIX_{workflow}) for exclusive mode detection.
Optimizing Connections¶
In config.ini, enable SSL and async mode for better throughput:
[tm1-finance]
address = tm1server.company.com
port = 12354
user = admin
password = ${TM1_PASSWORD}
ssl = True
async_requests_mode = True
Async Polling (Backoff Strategy)¶
When async_requests_mode = True, TM1py submits TI processes asynchronously and polls the TM1 server until they complete. Rather than polling at a fixed interval, TM1py uses exponential backoff — starting with frequent polls and gradually slowing down:
Poll 1: wait 0.1s → Poll 2: wait 0.2s → Poll 3: wait 0.4s → Poll 4: wait 0.8s → Poll 5+: wait 1.0s (cap)
This means short-running processes get detected quickly (within ~0.1s of completion), while long-running processes don't flood the server with polling requests.
You can tune the polling behavior per instance in config.ini:
[tm1-finance]
address = tm1server.company.com
port = 12354
ssl = True
async_requests_mode = True
async_polling_initial_delay = 0.1
async_polling_max_delay = 1.0
async_polling_backoff_factor = 2
| Parameter | Default | Description |
|---|---|---|
async_polling_initial_delay |
0.1 |
Seconds to wait before the first poll |
async_polling_max_delay |
1.0 |
Maximum seconds between polls (cap) |
async_polling_backoff_factor |
2 |
Multiplier applied to the delay after each poll |
When to Adjust Polling
For workflows dominated by very short TI processes (under 1 second), keep the defaults — the 0.1s initial delay detects completion quickly. For workflows with mostly long-running processes (minutes), you can increase async_polling_initial_delay to 0.5 and async_polling_max_delay to 5.0 to reduce polling overhead on the TM1 server.
Connection Recovery (RemoteDisconnect Retry)¶
Network interruptions between RushTI and the TM1 server can cause RemoteDisconnected errors mid-execution. TM1py handles these automatically with exponential backoff retries — reconnecting without failing the task:
Attempt 1: wait 1s → Attempt 2: wait 2s → Attempt 3: wait 4s → Attempt 4: wait 8s → Attempt 5: wait 16s
If the connection is restored within any retry window, the request completes normally. If all retries are exhausted, the task fails (and RushTI's own retry logic takes over if retries > 0).
Configure retry behavior per instance in config.ini:
[tm1-finance]
address = tm1server.company.com
port = 12354
ssl = True
remote_disconnect_max_retries = 5
remote_disconnect_retry_delay = 1.0
remote_disconnect_max_delay = 30
remote_disconnect_backoff_factor = 2
| Parameter | Default | Description |
|---|---|---|
remote_disconnect_max_retries |
5 |
Maximum number of reconnection attempts |
remote_disconnect_retry_delay |
1.0 |
Seconds to wait before the first retry |
remote_disconnect_max_delay |
30 |
Maximum seconds between retries (cap) |
remote_disconnect_backoff_factor |
2 |
Multiplier applied to the delay after each retry |
Connection Recovery vs Task Retry
TM1py's connection recovery and RushTI's task retry (retries setting) are independent. Connection recovery handles transient network blips transparently — the task never knows the connection dropped. Task retry re-executes the entire TI process from scratch after a complete failure. For maximum resilience, use both: connection recovery handles brief outages, and task retry handles longer disruptions.
Reducing Connection Overhead¶
For workflows with many short-running tasks, connection setup time can dominate execution time. To reduce this:
- Group related tasks on the same TM1 instance when possible.
- Keep
max_workersaligned with the number of TM1 instances to avoid creating excess connections that are rarely used.
Task Ordering¶
The order in which RushTI picks ready tasks matters. When multiple tasks are ready to run (all predecessors complete), RushTI must choose which to start first.
Default Behavior¶
Without optimization, ready tasks are started in the order they appear in the task file.
EWMA Optimization: Configurable Scheduling Algorithms¶
When optimization is enabled via --optimize <algorithm> or a JSON taskfile optimization_algorithm setting, RushTI sorts ready tasks by estimated runtime using EWMA (Exponentially Weighted Moving Average) estimates from historical execution data.
Two scheduling algorithms are available:
| Algorithm | Sort Order | Best For |
|---|---|---|
longest_first |
Descending (longest tasks start first) | Independent workloads with varied task durations. Minimizes total makespan by starting expensive tasks early. |
shortest_first |
Ascending (shortest tasks start first) | Shared-resource TM1 workloads where concurrent heavy tasks cause contention. Reduces resource pressure by completing quick tasks first. |
Why longest-first works: If a 10-minute task and a 1-minute task are both ready, starting the 10-minute task first means the 1-minute task runs in parallel during those 10 minutes. Starting the 1-minute task first wastes 9 minutes of potential parallelism.
Why shortest-first works: When tasks share TM1 server resources (memory, threads, write locks), running many heavy tasks simultaneously causes contention. Starting short tasks first reduces the number of concurrent heavy tasks, lowering resource pressure and improving overall throughput.
Enabling Optimization¶
Optimization is off by default. Activate it per-run via the CLI or per-taskfile via JSON settings:
# CLI: opt-in to shortest-first scheduling
rushti run --tasks daily-etl.json --max-workers 20 --optimize shortest_first
# CLI: opt-in to longest-first scheduling
rushti run --tasks daily-etl.json --max-workers 20 --optimize longest_first
Or set a default algorithm in a JSON task file:
CLI --optimize overrides the JSON setting. Omitting both means no optimization.
The EWMA tuning parameters live in settings.ini (system-wide, rarely changed):
The optimizer needs historical data, so stats must be enabled. After min_samples runs, the optimizer has enough data to produce reliable estimates.
Manual Optimization with rushti stats analyze¶
If you prefer to control task ordering yourself, use the analyze command to generate an optimized task file:
# Analyze the last 20 runs and generate an optimized task file
rushti stats analyze \
--workflow daily-etl \
--tasks daily-etl.json \
--output daily-etl-optimized.json \
--runs 20
# Validate and use the optimized file
rushti tasks validate --tasks daily-etl-optimized.json --skip-tm1-check
rushti run --tasks daily-etl-optimized.json
The optimized task file reorders tasks so that long-running ones appear first while preserving all dependency constraints.
Stages for Resource Control¶
Stages provide a second level of concurrency control beyond max_workers. Use stage_workers to limit parallelism during resource-intensive phases.
Pattern: Aggressive Extract, Conservative Load¶
{
"settings": {
"max_workers": 20,
"stage_order": ["extract", "transform", "load"],
"stage_workers": {
"extract": 20,
"transform": 12,
"load": 4
}
}
}
- Extract (read-only): Use all 20 workers. Reads rarely cause contention.
- Transform (mixed): Moderate concurrency. Some transforms involve temporary cube writes.
- Load (write-heavy): Only 4 concurrent loads to avoid write lock contention and TM1 server memory pressure.
Pattern: Staged Throttling¶
Use a critical stage with low concurrency for tasks that modify shared dimensions or control cubes, and a normal stage for everything else.
Global max_workers Cap¶
The max_workers setting defines the TM1 session connection pool size, which is set at connection time and cannot be resized at runtime. This means max_workers always acts as an absolute ceiling for all stages. If a stage_workers value exceeds the global max_workers, the global limit still applies and a warning is emitted:
To take full advantage of stage_workers, ensure max_workers is at least as high as the largest per-stage limit.
Timeout Strategies¶
Timeouts prevent runaway TI processes from blocking the entire pipeline.
Setting Per-Task Timeouts¶
Analyze historical execution times from rushti stats list tasks and set timeouts to 2--3x the average duration:
# Check average task durations
rushti stats list tasks --workflow daily-etl
# If extract-gl averages 45s, set timeout to 120s
When to Use cancel_at_timeout¶
- Non-critical tasks: Reports, exports, and notifications where partial data is acceptable.
- Retry-safe tasks: Tasks marked
safe_retry: truethat can be re-executed later. - Never for data loads: Cancelling a write operation mid-execution can corrupt cube data.
EWMA Tuning¶
The EWMA (Exponentially Weighted Moving Average) optimizer has several tuning parameters.
Key Parameters¶
| Parameter | Default | Effect |
|---|---|---|
lookback_runs |
10 | How many recent runs to consider. Higher = more stable estimates, slower to adapt. |
min_samples |
3 | Minimum data points before optimization activates for a task. Lower = earlier activation but less reliable estimates. |
time_of_day_weighting |
false | Weight runs at similar times of day more heavily. Enable for workloads with time-dependent performance. |
cache_duration_hours |
24 | How long to cache estimates. Shorter = more up-to-date but more I/O at startup. |
Tuning with rushti stats analyze¶
The --ewma-alpha flag on rushti stats analyze controls how heavily recent runs are weighted:
| Alpha | Behavior |
|---|---|
| 0.1 | Very smooth -- emphasizes long-term average. Use for stable workloads. |
| 0.3 | Balanced (default). Good for most workloads. |
| 0.5 | Responsive -- adapts quickly to changes. Use for volatile workloads. |
| 0.8 | Aggressive -- almost entirely based on the most recent run. Use with caution. |
# Compare different alpha values
rushti stats analyze --workflow daily-etl --ewma-alpha 0.2 --report report_02.json
rushti stats analyze --workflow daily-etl --ewma-alpha 0.5 --report report_05.json
Contention-Aware Analysis¶
When runtime-based scheduling is not enough, RushTI's contention-aware optimizer (rushti stats optimize) provides deeper analysis. See Self-Optimization: Contention-Aware for the full algorithm description.
When Contention Analysis Helps¶
| Symptom | What Contention Analysis Does |
|---|---|
| Adding workers does not reduce total runtime | Detects concurrency ceiling and recommends the optimal worker count |
| A few heavy tasks slow down everything when running together | Identifies heavy outlier groups and chains them sequentially |
| Reducing workers actually improved performance | Confirms the ceiling with multi-run comparison data |
| Unclear which parameter drives the performance difference | Identifies the contention driver (e.g., pRegion, pDimension) |
Tuning Sensitivity¶
The --sensitivity parameter controls how aggressively outliers are detected:
| Sensitivity | Behavior |
|---|---|
5.0 |
Aggressive -- flags more groups as heavy. Use when you know contention is a problem. |
10.0 |
Balanced (default). Good for most workloads. |
20.0 |
Conservative -- only flags extreme outliers. Use for workflows with naturally varied durations. |
# Compare sensitivity levels
rushti stats optimize --workflow daily-etl --sensitivity 5.0
rushti stats optimize --workflow daily-etl --sensitivity 20.0
Concurrency Ceiling vs Scale-Up¶
The optimizer detects two complementary signals from multi-run data:
- Concurrency ceiling: Runs with fewer workers were faster. The server was overwhelmed. The optimizer recommends reducing
max_workers. - Scale-up opportunity: Runs with more workers were faster, but the most recent run used fewer workers. The optimizer recommends increasing
max_workersback to the efficient sweet spot.
The sweet spot algorithm avoids overreacting: it finds the fewest workers within 10% of the best observed wall clock time. For example, if 10 workers achieved 581s and 50 workers achieved 547s (only 6% faster), the optimizer recommends 10 workers — nearly the same speed with 5x fewer resources.
Combining Contention Analysis with Stages¶
For complex workflows, combine contention-aware optimization with stage-based throttling:
# 1. Run contention analysis to understand the bottleneck
rushti stats optimize --workflow complex-etl --tasks complex-etl.json --output optimized.json
# 2. Review the HTML report for heavy groups and recommended workers
# 3. Add stage_workers constraints for resource-intensive phases
# 4. Run with the optimized file
rushti run --tasks optimized.json
The optimized task file embeds the recommended max_workers value. You can further refine it by adding stage-level worker limits.
Monitoring and Identifying Bottlenecks¶
Dashboard Visualization¶
The rushti stats visualize command generates an interactive HTML dashboard with:
- Gantt chart: Shows when each task started and ended. Look for gaps (idle workers) and long sequential chains.
- Success rate trends: Identifies flaky tasks that fail intermittently.
- Duration trends: Spots tasks that are getting slower over time.
DAG Visualization¶
The rushti tasks visualize command shows the dependency graph. Look for:
- Over-specified dependencies: Tasks that depend on more predecessors than necessary, creating artificial bottlenecks.
- Long critical paths: The longest chain of sequential dependencies determines the minimum possible runtime.
- Fan-in bottlenecks: A single task that depends on many predecessors and cannot start until the slowest one finishes.
What to Look For¶
| Symptom | Likely Cause | Fix |
|---|---|---|
| Gantt chart shows many gaps | Over-specified dependencies | Reduce predecessor lists to only true dependencies |
| One task takes 80% of total time | Single bottleneck task | Split the TI process into smaller units that can run in parallel |
| Tasks wait at stage boundaries | Stage concurrency too low | Increase stage_workers for the bottleneck stage |
| Workers idle at the end | Short tail tasks after a long critical path | Reorder tasks so long ones start first (enable optimization) |
| Tasks fail intermittently | TM1 server overload | Reduce max_workers or stagger with stages |
Common Execution Patterns¶
Fan-Out / Fan-In¶
Extract data from multiple regions in parallel, then consolidate into a single result:
{
"tasks": [
{ "id": "1", "process": "Extract.Regional", "parameters": { "pRegion": "NA" } },
{ "id": "2", "process": "Extract.Regional", "parameters": { "pRegion": "EU" } },
{ "id": "3", "process": "Extract.Regional", "parameters": { "pRegion": "APAC" } },
{
"id": "4",
"process": "Consolidate.Global",
"predecessors": ["1", "2", "3"],
"require_predecessor_success": true
}
]
}
Tuning tip: The fan-in task (task 4) cannot start until the slowest extract finishes. Focus optimization on making the slowest extract faster.
Pipeline (ETL Stages)¶
Sequential stages where each stage must complete before the next begins:
{
"settings": {
"stage_order": ["extract", "transform", "load", "validate"]
},
"tasks": [
{ "id": "1", "stage": "extract", "process": "Extract.Sales" },
{ "id": "2", "stage": "extract", "process": "Extract.Costs" },
{ "id": "3", "stage": "transform", "process": "Transform.Data" },
{ "id": "4", "stage": "load", "process": "Load.Cube" },
{ "id": "5", "stage": "validate", "process": "Validate.Results" }
]
}
Tuning tip: If the extract stage takes 2 minutes and the transform stage takes 30 seconds, the pipeline is extract-bound. Focus on parallelizing more extract tasks.
Diamond Dependencies¶
Two tasks that share common predecessors and a common successor:
{
"tasks": [
{ "id": "1", "process": "Setup.Environment" },
{ "id": "2", "process": "Process.PathA", "predecessors": ["1"] },
{ "id": "3", "process": "Process.PathB", "predecessors": ["1"] },
{ "id": "4", "process": "Merge.Results", "predecessors": ["2", "3"] }
]
}
Tuning tip: Tasks 2 and 3 run in parallel. The merge (task 4) waits for both. If one path is much slower, consider splitting it further.
Performance Checklist¶
Use this checklist when reviewing a workflow for performance:
- Worker count: Set to match TM1 server capacity, not arbitrarily high.
- Dependencies: Every predecessor relationship is truly necessary (no redundant edges).
- Optimization enabled:
[stats] enabled = truein settings.ini, then--optimize <algorithm>on the CLI oroptimization_algorithmin JSON taskfile. - Algorithm chosen: Use
shortest_firstfor shared-resource TM1 workloads,longest_firstfor independent tasks with varied durations. - Timeouts set: All tasks have appropriate timeouts to prevent hangs.
- Stages used: Resource-intensive phases have limited
stage_workers. - Dashboard reviewed: Gantt chart shows minimal idle time between tasks.
- Retry count appropriate: Non-zero retries for transient failures, zero for logic errors.
- Expandable parameters: Dynamic member lists instead of hardcoded task duplication.
- Contention analyzed: Run
rushti stats optimizeafter collecting runs at 2--3 worker levels. Review the HTML report for bottleneck insights.
Next Steps¶
- Self-Optimization -- Runtime scheduling and contention-aware analysis
- Settings Reference --
[optimization]and[stats]configuration - Advanced Task Files -- Stages, timeouts, and expandable parameters
- CLI Reference --
rushti stats analyze,rushti stats optimize, andrushti stats visualize