Skip to content

ACE — Adaptive Compute Efficiency Engine

The Adaptive Compute Efficiency Engine (ACE) answers one question: of all the GPU compute your cluster allocated, how much of it did useful work? ACE ingests job accounting data from major HPC schedulers — Slurm, PBS Pro, LSF, and XDMoD-aggregated multi-scheduler environments — alongside hardware telemetry from NVIDIA DCGM and Kubernetes pod metrics where available. It scores each job against a documented efficiency methodology, identifies the scripts and workload types driving waste, blocks right-sizing recommendations it cannot defend (memory-bound jobs, healthy repeat scripts, multi-GPU topology constraints), and produces a structured export consumed by GRADE for assessment scoring. The result is a GPU-hours-weighted efficiency rate that reflects physical resource consumption rather than masking large wasteful jobs behind a count of small efficient ones.

gpu_efficiency_rate — GPU-hours weighted efficiency. Of all the GPU-hours allocated across all jobs in the assessment period, what fraction did useful compute work?

gpu_efficiency_rate = gpu_hours_used / gpu_hours_requested
= Σ(used_gpus_i × duration_i) / Σ(requested_gpus_i × duration_i)

Large, long jobs count proportionally more than small, short jobs. A cluster running 1,000 small efficient jobs and 10 large wasteful jobs (100× the GPU-hours) cannot hide behind the small ones. This is the metric GRADE uses in the composite score.

gpu_efficiency_score — per-job mean utilization. Equal weight per job regardless of size or duration. Used by ATLAS for human-readable display (“current utilization: 25.7%”) and by PACE for job calibration rate computation.

MIT Supercloud (73,367 Slurm jobs, HPCA22 public dataset):

MetricValueWhat it says
gpu_efficiency_score0.257Most jobs underutilize GPUs
gpu_efficiency_rate0.339Large jobs are relatively better utilized

GRADE uses gpu_efficiency_rate. Both are reported in ACE findings.

JobGPUs requestedDuration (hrs)Utilization
1041810.090%
104240.520%
104388.080%
gpu_hours_requested = (8×10) + (4×0.5) + (8×8) = 146
gpu_hours_used = (7.2×10) + (0.8×0.5) + (6.4×8) = 123.6
gpu_efficiency_rate = 123.6 / 146 = 0.847 ← GRADE primary
gpu_efficiency_score = (0.9 + 0.2 + 0.8) / 3 = 0.633 ← secondary

Job 1042 is small and short — it barely moves gpu_efficiency_rate but is flagged for right-sizing by the per-job threshold.

ACE selects the highest-fidelity source available based on PROFILE’s routing manifest:

SourceCommandWhat it measures
sacctace ingest --source slurm-sacctSlurm scheduler accounting — job-level GPU requests, elapsed, state
pbs-accountingace ingest --source pbs-accountingPBS Pro / OpenPBS / TORQUE accounting log
xdmodace ingest --source xdmodOpen XDMoD CSV export — normalizes Slurm, PBS, SGE through one interface
lsf-bacctace ingest --source lsf-bacctIBM LSF bacct -csv output
dcgmace analyze input --input-path dcgmNVIDIA DCGM hardware telemetry (highest confidence)
k8s_metricsace analyze input --input-path k8s_metricsKubernetes pod GPU metrics
clawCLAW agentLive collection agent — routes to sacct or canonical path

All four scheduler paths produce an identical canonical CSV format so all downstream analysis and GRADE export paths work without modification.

ACE counts all submitted GPU jobs regardless of completion state. Silently dropping failed jobs was the previous behaviour — it understated waste.

FindingDefinition
job_failure_rateFAILED + OOM + BOOT_FAIL / submitted GPU jobs
node_fail_rateNODE_FAIL / submitted — hardware-induced failures
job_cancel_rateCANCELLED + PREEMPTED / submitted
job_timeout_rateJobs that hit their walltime limit / submitted
early_fail_rateJobs that failed within 5 minutes of start / submitted

PBS and LSF walltime-exceeded kills are detected by exit code (default: 137, 271) and by a heuristic: if elapsed ≥ 95% of requested walltime and the job failed, it is classified as timeout.

When the scheduler export includes a requested walltime field (Timelimit in Slurm, Resource_List.walltime in PBS, requested_walltime in XDMoD, RUN_LIMIT in LSF):

FindingDefinition
walltime_efficiency_ratemean(elapsed / timelimit) across completed jobs
walltime_overrequest_pctfraction of jobs that used < 50% of their requested walltime

A cluster where 60% of jobs use less than half their requested walltime is holding queue slots open longer than necessary — different waste from GPU underutilization, and invisible without this signal.

When DCGM telemetry is available:

  • Memory bandwidth (DCGM_FI_DEV_MEM_COPY_UTIL): effective utilization per GPU = max(compute%, memory_bandwidth%). Memory-bandwidth-bound workloads like LLM inference run at low compute utilization by design — using max() prevents incorrect flagging.
  • Memory capacity (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL): when framebuffer utilization exceeds 80%, ACE blocks right-sizing recommendations for that job. Reducing GPU count when model weights fill available memory causes OOM.

ACE applies four guards before recommending a GPU count reduction:

GuardConditionOutcome
Memory-boundmem_capacity_pct > 0.80Keep — reducing GPUs causes OOM
Group-healthyJob is in a group where p90 util ≥ thresholdKeep — this run is an outlier; investigate, don’t right-size
High utilizationUtil ≥ 80%Keep — already efficient
Topology-aware--topology-aware flagRound recommendation down to nearest valid GPU count (1, 2, 4, 8, 16, 32, 64)

The flagging threshold adapts to workload type, read from PROFILE via --profile:

Workload typeThresholdRationale
inference_realtime0.30Memory-bandwidth-bound; 30-40% compute util is correct
inference_batch0.35
preprocessing0.45CPU-GPU pipeline bottlenecks are expected
mixed0.50
training_small0.55
training_large0.60Default — calibrated for large-scale distributed training

ACE groups runs of the same job script (by jobname from the scheduler) and computes a utilization distribution across all runs in the period:

  • Group p90 ≥ threshold (group-healthy): individual low-efficiency runs are outliers. ACE overrides the right-sizing recommendation to “investigate this run” — the script is well-tuned; this particular execution had a problem.
  • Group p90 < threshold (chronic underutilization): every run is inefficient. Right-size the request.

This prevents a script that normally runs at 85% GPU utilization from accumulating a permanent right-sizing recommendation because of one bad run.

When no per-job GPU utilization data is available (sacct-only, no DCGM), ACE accepts --assume-util to assign a uniform utilization to all jobs. When this mode is active, the export includes:

  • An assumed_utilization_warning finding with confidence: low
  • A top-level "warnings" list stating the value applied and that gpu_efficiency_rate is a uniform assumption, not a measurement

An assessment produced with assume_util can look numerically identical to a direct measurement. The warning makes the distinction visible.

MetricDescription
gpu_efficiency_rateGPU-hours weighted efficiency — GRADE primary
gpu_efficiency_scorePer-job mean utilization — secondary
jobs_submittedTotal GPU jobs submitted (all states)
jobs_analyzedGPU jobs used for utilization analysis (completed + timeout)
gpu_hours_requestedΣ(requested_gpus × elapsed_hours)
gpu_hours_usedΣ(used_gpus × elapsed_hours)
flagged_jobs_pctFraction of jobs below the workload-type threshold
flagged_jobs_countCount of flagged jobs
near_zero_jobs_pctFraction of jobs below 5% utilization
short_jobs_pctFraction of jobs completing under 1 minute
over_request_ratioavg_requested / avg_used
job_failure_rateApplication-error failures / submitted
node_fail_rateHardware failures / submitted
job_cancel_rateCancellations / submitted
job_timeout_rateWalltime-exceeded / submitted
early_fail_rateCrash-on-start (< 5 min) / submitted
walltime_efficiency_ratemean(elapsed / timelimit) — when timelimit available
walltime_overrequest_pctFraction using < 50% of requested walltime
gpu_memory_capacity_pctmean(FB_USED/FB_TOTAL) — DCGM only
job_groups_totalRepeat job script groups with ≥ 2 runs
chronic_underutil_groupsGroups where p90 utilization < threshold
assumed_utilization_warningEmitted when assume_util is active
Terminal window
# Slurm — ingest sacct export, then analyze
ace ingest --source slurm-sacct \
--input sacct_export.csv \
--output canonical_ace.csv
ace analyze input --input canonical_ace.csv \
--profile profile_output.json \
--output ace_report.txt \
--json-out ace_report.json
# Export to ptl_output_v1.json for GRADE
ace export \
--input ace_report.json \
--organization "MIT Supercloud" \
--period 2026 \
--output ace_output.json
# PBS Pro / OpenPBS
ace ingest --source pbs-accounting \
--input /var/spool/PBS/server_priv/accounting/20260601 \
--output canonical_ace.csv
# XDMoD (multi-scheduler)
ace ingest --source xdmod \
--input xdmod_jobs_export.csv \
--output canonical_ace.csv
# LSF
ace ingest --source lsf-bacct \
--input lsf_jobs.csv \
--output canonical_ace.csv
# With DCGM telemetry (any scheduler)
ace ingest --source slurm-sacct \
--input sacct_export.csv \
--telemetry dcgm_telemetry.csv \
--telemetry-util-col smutilization_pct_avg \
--telemetry-mem-util-col memutilization_pct_avg \
--output canonical_ace.csv
# Kubernetes / DCGM aggregate → ptl_output directly
ace analyze input \
--input k8s_pod_metrics.json \
--input-path k8s_metrics \
--organization "My Cluster" \
--period 2026 \
--ptl-output ace_output.json