ACE Engine

The Adaptive Compute Efficiency Engine (ACE) is the first engine in the Plain Theory Labs certification framework. It takes the raw output of an HPC scheduler — job accounting records, GPU telemetry, runtime data — and produces a graded efficiency analysis with traceable findings, disclosed assumptions, and concrete recommendations. It is the first tool built to turn HPC operational data into a defensible grade, not a dashboard.

What ACE does that nothing else does

The distinction matters. Monitoring tools show utilization. Reporting tools show trends. ACE grades. It applies a scoring methodology to each job in a scheduler export, evaluates whether the GPU resources requested were proportionate to the GPU resources used, applies a safety overlay that blocks recommendations it cannot defend with the available data, and produces a structured output that explains every finding in terms an operator can act on and a stakeholder can understand.

No tool like this existed. Researchers and engineers working in HPC sustainability have had access to telemetry data for years with no systematic way to convert it into efficiency grades that carry policy weight. ACE is the first implementation of that conversion — built on a methodology that is open, assumption-explicit, and designed from the start to be extended, audited, and improved.

What the engine produces

For each job in the input dataset, ACE produces an efficiency score, a right-sizing recommendation with a confidence label reflecting the quality of the underlying data, and an environmental impact estimate covering energy, carbon, and water. Every coefficient used in the environmental calculation is documented in the output and configurable — if your site has measured power usage effectiveness or grid carbon intensity, the engine uses it. If not, it uses documented defaults and says so.

The output is a JSON payload and a zero-dependency single-file HTML report — interactive tables, inline SVG charts, summary findings, and full methodology disclosure. Any operator can open it on any machine. Any stakeholder can read it without specialized tooling. A finding that requires infrastructure to view is a finding that will not travel.

MIT Supercloud HPCA22 dataset — 35,745 production jobs

25.7% average GPU utilization

16% median GPU utilization

41% of jobs ran under one minute

33% of jobs showed near-zero GPU activity

These are not projections. This is what the engine found.

Signal to structure: noisy scheduler traces become scored, reviewable outputs.

How the pipeline works

ACE runs five deterministic stages. Ingest takes a Slurm sacct export or canonical CSV, normalizes job identifiers, and joins GPU telemetry where available. Score computes per-job utilization against requested allocation, applies the over-request heuristic, and assigns a weighted efficiency score. The safety overlay evaluates each recommendation against the confidence of the underlying data — if the data cannot support a downsizing recommendation, the recommendation is withheld, not softened. Impact estimation projects energy, carbon, water, and cost from physical coefficients, all documented and configurable. Report generates the JSON payload and the single-file HTML output.

Every stage is deterministic. Run the same input twice and you get the same output. This is a design requirement, not an implementation detail — a certification framework that produces different findings on the same data is not a certification framework.

The start of something larger

ACE is Phase 1 of a five-engine certification framework. Future engines will address workload scheduling efficiency, cooling system performance, real-time carbon intensity, and hardware procurement decisions. Each engine follows the same design principles: ingest real operational data, apply a transparent methodology, produce graded findings with disclosed assumptions.

The goal is a certification standard that institutions can use before external regulation defines one for them — built by engineers who have operated these systems, validated on real datasets, open to scrutiny. ACE is the proof that the approach works.

Assumptions and current limits

Phase 2B0 has no live telemetry join. GPU utilization is derived from sidecar files or proxy estimates where direct measurement is not available. Confidence labels in the output reflect this — a recommendation derived from a proxy estimate carries a different label than one derived from DCGM telemetry. The methodology is explicit about the difference.

Environmental impact numbers are coefficient-based estimates, not metered totals. Default coefficients are documented. If your site has measured data, use it. Phase 2B1 will add windowed telemetry joins from DCGM and Prometheus, JobID normalization improvements, and per-window provenance scoring so every recommendation is traceable to its source.

Status

Phase 2B0 is complete. 118 tests passing. Validated on the MIT Supercloud HPCA22 dataset and the Google Borg CPU proxy dataset. Phase 2B1 is in active development.

Talk to us about your cluster