Recoverable Value
GPU clusters are expensive to buy, expensive to power, and expensive to cool. Most of them are also substantially underutilized. PTL quantifies that gap — and this page shows what closing it is worth.
These numbers come from the same public datasets used to evaluate ACE. The math is shown. You can check it.
What the data shows
Section titled “What the data shows”Across three public GPU cluster datasets (508,885 jobs):
| Dataset | Jobs | gpu_efficiency_rate | GPU-hours wasted per 100 allocated |
|---|---|---|---|
| MIT Supercloud HPCA22 | 73,367 | 0.339 | 66.1 hours |
| Alibaba Helios 2020 | 361,498 | 0.214 | 78.6 hours |
| Microsoft Philly 2017 | 74,020 | 0.502 | 49.8 hours |
gpu_efficiency_rate is the GPU-hours weighted metric. It measures what fraction of
allocated GPU-hours did useful compute work. The rest is waste.
A cluster at 0.339 efficiency is consuming 2.95× the energy and capital required
to do its actual workload. The other 1.95× is overhead.
The calculation
Section titled “The calculation”For a 100-GPU cluster running continuously:
GPU-hours per year = 100 GPUs × 8,760 hours = 876,000 GPU-hours/yearAt MIT Supercloud’s efficiency rate (0.339):
Useful GPU-hours = 876,000 × 0.339 = 297,004Wasted GPU-hours = 876,000 × 0.661 = 578,996Cloud pricing reference (NVIDIA A100, on-demand):
| Provider | A100 80GB | Hourly rate |
|---|---|---|
| AWS (p4d.24xlarge / 8 GPU) | per GPU | ~$4.10 |
| Azure (ND96asr_v4 / 8 GPU) | per GPU | ~$3.84 |
| GCP (a2-highgpu-8g / 8 GPU) | per GPU | ~$3.67 |
Using a conservative $3.50/GPU-hour:
Annual waste = 578,996 GPU-hours × $3.50 = $2,026,486/yearOn a 100-GPU cluster at 33.9% efficiency, over $2 million per year is allocated but not doing useful work.
Recoverable value at different cluster sizes
Section titled “Recoverable value at different cluster sizes”These tables use the MIT Supercloud baseline (0.339 efficiency) and a conservative $3.50/GPU-hour. All figures assume continuous operation.
Moving from 0.339 → 0.55 efficiency (DEVELOPING → CAPABLE tier)
Section titled “Moving from 0.339 → 0.55 efficiency (DEVELOPING → CAPABLE tier)”| Cluster size | Annual waste (current) | Annual waste (target) | Recovered/year |
|---|---|---|---|
| 50 GPUs | $1.01M | $652K | $362K |
| 100 GPUs | $2.03M | $1.30M | $724K |
| 500 GPUs | $10.1M | $6.52M | $3.6M |
| 1,000 GPUs | $20.3M | $13.0M | $7.2M |
| 5,000 GPUs | $101M | $65.2M | $36M |
Moving from 0.339 → 0.70 efficiency (CAPABLE → OPTIMIZED tier)
Section titled “Moving from 0.339 → 0.70 efficiency (CAPABLE → OPTIMIZED tier)”| Cluster size | Annual waste (current) | Annual waste (target) | Recovered/year |
|---|---|---|---|
| 50 GPUs | $1.01M | $459K | $554K |
| 100 GPUs | $2.03M | $919K | $1.1M |
| 500 GPUs | $10.1M | $4.6M | $5.5M |
| 1,000 GPUs | $20.3M | $9.2M | $11.1M |
| 5,000 GPUs | $101M | $45.9M | $55M |
For owned hardware (capital cost perspective)
Section titled “For owned hardware (capital cost perspective)”Cloud pricing overstates the marginal cost of underutilization for owned hardware. The relevant number for on-premise clusters is capital utilization and energy.
Capital: An NVIDIA H100 SXM5 lists at approximately $30,000–$40,000 per GPU. A 100-GPU cluster represents $3M–$4M in capital. At 0.339 efficiency, the effective capital doing useful work is $1.0M–$1.4M. The other $2M–$2.6M is allocated but idle.
Energy: GPU clusters running at 30–35% efficiency consume roughly 3× the power required for the actual workload. At $0.10/kWh and 300W per GPU:
100 GPUs × 300W × 8,760 hrs × $0.10/kWh = $262,800/year (electricity alone)At 0.339 efficiency: $178K of that powers wasteCooling: PUE of 1.4 (typical data center) adds 40% on top of compute power. Wasted GPU power drives wasted cooling energy at the same ratio.
What PTL measures and what moves the number
Section titled “What PTL measures and what moves the number”PTL assessment identifies the specific levers. ACE produces a ranked list of right-sizing recommendations — the exact jobs, scripts, and workload types contributing most to waste, with estimated GPU-hour impact per fix.
The four highest-impact findings in order of typical contribution:
-
Chronic underutilization — jobs that consistently request 8 GPUs and use the equivalent of 1–2. These dominate
gpu_efficiency_ratebecause they are large and long. Right-sizing the GPU request directly recovers those hours. -
Near-zero utilization — jobs below 5% GPU utilization. Typically stuck on data loading, incorrect GPU affinity, or framework misconfiguration. Fixing one class of issue can move the cluster rate by several percentage points.
-
Walltime over-request — jobs that hold queue slots for 24 hours and finish in 2. Does not directly waste GPU-hours but reduces throughput and effective cluster capacity, increasing the per-job cost of everything else.
-
Failed job rate — jobs that crash within 5 minutes of start have consumed GPU allocation with zero useful output. At 10% early-fail rate on a 1,000-GPU cluster, that is 876,000 GPU-hours/year producing nothing.
What PTL does not claim
Section titled “What PTL does not claim”PTL does not claim that any specific organization can achieve any specific efficiency improvement. These are potential recoveries if efficiency improves — the underlying driver of improvement is operational change, not the assessment itself.
PTL measures the current state accurately, identifies what is driving waste, and tracks whether changes made between assessments moved the number. The recoverable value is the difference. What you do with the findings is up to you.
Source data
Section titled “Source data”All efficiency rates on this page come from public datasets. The converters and canonical CSVs are in the ptl-engines repository. GPU pricing figures are from public cloud pricing pages as of May 2026 and may change. Energy cost is a conservative US average from EIA.
See ACE methodology for the exact efficiency formula. See Scoring methodology for how ACE feeds the PTL Score.