Skip to content

ACE — Adaptive Compute Efficiency Engine

The Adaptive Compute Efficiency Engine (ACE) measures GPU utilization — the fraction of allocated GPU compute that jobs actually use. ACE is the highest-weighted engine in the PTL composite and the most direct measure of whether a cluster is doing the work it claims to do.

gpu_efficiency_score — a number from 0.0 to 1.0. The formula is the mean of per-job GPU utilization across all analyzed jobs.

The score is linear and direct: 25.7% average GPU utilization produces a score of 0.257. There is no curve and no adjustment for workload type. A cluster where jobs request eight GPUs and use two earns a 0.25.

The gpu_efficiency_score is the mean of per-job GPU utilization across all jobs in the assessment window:

gpu_efficiency_score = (1/N) × Σ(used_gpus_i / requested_gpus_i)
where:
N = number of jobs analyzed
used_gpus_i = actual GPU utilization for job i
requested_gpus_i = GPUs requested by job i

The score is linear and direct. No curve, no adjustment for workload type. A job that requests 8 GPUs and uses 2 contributes 0.25 to the mean.

Input (three jobs from a Slurm sacct export):

Job IDGPUs requestedGPUs usedUtilization
104187.20.900
104240.80.200
104386.40.800

Calculation:

gpu_efficiency_score = (0.900 + 0.200 + 0.800) / 3
= 1.900 / 3
= 0.633

Result: ACE score 0.633 — Capable range. Job 1042 is flagged for right-sizing: it requested 4 GPUs and used 0.8, contributing a 0.200 to the mean. ATLAS would identify this job by script and recommend reducing its GPU request to 1.

Validated against: MIT Supercloud HPCA22 dataset — 73,367 production jobs, gpu_efficiency_score = 0.257.

ACE supports four input paths, selected based on PROFILE’s routing manifest:

  • claw — CLAW telemetry (highest fidelity, process-level GPU activity)
  • dcgm — NVIDIA DCGM metrics (DCGM_FI_DEV_GPU_UTIL)
  • k8s_metrics — Kubernetes pod GPU metrics
  • sacct — Slurm scheduler accounting (standard; used when nothing else is available)

PROFILE sets the priority order. If DCGM metrics are available alongside sacct, ACE uses DCGM.

FieldDescription
gpu_efficiency_scorePrimary score (0.0–1.0)
jobs_analyzedNumber of jobs in the assessment window
avg_job_gpu_utilizationMean GPU utilization across all jobs
gpu_hours_requestedTotal GPU-hours requested by jobs
gpu_hours_usedTotal GPU-hours actually utilized
flagged_jobs_countJobs below the 40% utilization threshold
near_zero_jobs_pctJobs below 5% utilization (likely walltime padding)
short_jobs_pctJobs under 5 minutes (scheduling overhead concern)
jobs_flagged_for_rightsizingJobs where right-sizing could improve score

ACE accepts a JSON or CSV input depending on the input path. For the sacct path:

FieldTypeRequiredDescription
job_idstringyesSlurm job identifier
requested_gpusfloatyesGPUs allocated by scheduler
used_gpusfloatyesGPUs actually utilized
duration_minutesfloatyesActual job runtime
statestringyesJob final state (COMPLETED, FAILED, etc.)
submit_timeISO8601noJob submission timestamp
userstringnoSubmitting user (for per-user findings)

For the DCGM path, ACE accepts DCGM_FI_DEV_GPU_UTIL metric streams. For the Kubernetes path, ACE accepts pod GPU metric exports from kubectl top or metrics-server.

An ACE score below 0.40 typically indicates one of three patterns: walltime padding (jobs request 24 hours, run for 4), GPU over-requesting (scripts request 8 GPUs for workloads that saturate 2), or idle job proliferation (misconfigured scripts that acquire GPUs and stall).

ATLAS receives the ACE findings and generates specific remediation text. For MIT Supercloud (score 0.257), ATLAS recommended auditing the top 10 job scripts by GPU-hours wasted and enabling --gpu-bind=closest in Slurm defaults.

Terminal window
# Analyze a Slurm sacct export
ace analyze --input jobs.csv --input-path sacct
# Analyze with DCGM telemetry
ace analyze --input dcgm_metrics.json --input-path dcgm
# Analyze a Kubernetes cluster
ace analyze --input k8s_pod_metrics.json --input-path k8s_metrics
# Analyze with CLAW agent output
ace analyze --input claw_package.json --input-path claw
# Output to JSON
ace analyze --input jobs.csv --output ace_result.json