THE ERA OF GIGAWATT‑SCALE AI DATA CENTERS (2026)
A Complete Research Report with Diagrams & Company Positioning
1. Executive Summary
By 2026, AI data centers have crossed into gigawatt‑scale industrial infrastructure. Clusters that once held 2,000–8,000 GPUs now exceed 100,000 GPUs per site, with only 5–7 such clusters operational globally.
This report explains:
- Why building an AI data center is far more complex than “buy GPUs”
- The rise of 100k+ GPU clusters
- The networking revolution (AEC, optics, CXL, PCIe 6)
- Cooling and energy constraints
- The global construction boom (831 sites, 23.1 GW)
- Where Celestica (CLS), Astera Labs, and Vertiv fit in the stack
2. Why Building an AI Data Center Is Hard
At first glance, it seems simple:
“Buy NVIDIA GPUs and plug them in.”
In reality, a hyperscale AI cluster requires:
- GPUs
- HBM memory
- PCIe/CXL connectivity
- AEC/DAC/optical cables
- Switches (800G → 1.6T)
- Racks, power distribution, cooling loops
- Substation‑scale electrical infrastructure
Here is the real architecture:
+-------------------------------------------------------------+
| AI DATA CENTER STACK |
+-------------------------------------------------------------+
| AI MODELS / TRAINING FRAMEWORKS (PyTorch, JAX, Megatron) |
+-------------------------------------------------------------+
| GPU SERVERS (H100, B200, MI450, Rubin) |
| - HBM3E / HBM4 memory |
+-------------------------------------------------------------+
| IN-RACK CONNECTIVITY |
| - PCIe 5/6 |
| - CXL 3.0 memory pooling |
| - AEC / DAC / Optics |
+-------------------------------------------------------------+
| FABRIC SWITCHING |
| - 800G → 1.6T Ethernet |
| - Leaf / Spine / Super-Spine |
+-------------------------------------------------------------+
| POWER + COOLING |
| - Liquid cooling (DLC, immersion) |
| - CDUs, pumps, heat exchangers |
| - UPS, switchgear, substations |
+-------------------------------------------------------------+
| CAMPUS INFRASTRUCTURE |
| - 100–1,500 MW power |
| - Water, fiber, buildings |
+-------------------------------------------------------------+
3. The World’s Largest AI Clusters (as of April 2026)
xAI “Colossus” — Memphis, TN
- 100,000 NVIDIA Hopper GPUs
- Fully liquid‑cooled (Supermicro)
- Massive AEC deployment (Credo)
Meta “Grand Teton”
- Two 100k clusters (US + EU)
- Target: 600,000 H100‑equivalent GPUs by end of 2026
Microsoft/OpenAI “Phase 4”
- Multiple 100k “Eagle” clusters
- Built for GPT‑5 and successors
Tesla Dojo + H100
- Dojo rebooted in 2026
- 50k H100 cluster → expanding to 100k+
4. The Next Wave Under Construction
Microsoft “Stargate”
- $100B project
- Three campuses: Abilene, West Virginia, Wisconsin
- Each: 1.2 GW
- Will house millions of GPUs
Nscale (Norway + West Virginia)
- 1.35 GW deal with Microsoft (April 2026)
- Deploying NVIDIA Vera Rubin
- Requires 1.6T AEC connectivity
Meta “Helios”
- $100B AMD MI450 deal
- Open‑standard racks (OCP)
5. Global Construction Pipeline
+-------------------------------+
| GLOBAL AI DC PIPELINE 2026 |
+-------------------------------+
| Sites under construction: 831 |
| Total power: 23.1 GW |
| US share: 15.9 GW |
+-------------------------------+
| 23 GW = power for 17M homes |
+-------------------------------+
This is the largest infrastructure buildout since the creation of the electrical grid.
6. Memory: The Real Bottleneck
HBM is now the limiting reagent of AI compute.
- HBM3E: 5–8 TB/s
- HBM4 (2027): 10+ TB/s
- Supply fully sold out through 2026
- 40–60% of GPU BOM cost
7. Networking & Interconnects
7.1 Why Copper Still Dominates Inside the Rack
| Technology | Material | Distance | Limitation |
|---|---|---|---|
| Passive DAC | Copper | < 2 m | Too short at 800G |
| Optical (AOC/Fiber) | Glass + lasers | 10 m – 10 km | Expensive, hot, high power |
| AEC (Active Copper) | Copper + DSP | 3 – 7 m | Perfect for rack‑scale |
Why AEC exploded: hyperscalers hit a wall:
- Passive copper: too short
- Optics: too expensive + too hot
- AEC: the only viable solution
Digital vs Analog AEC
| Feature | Credo (Digital AEC) | Semtech (Analog ACC) |
|---|---|---|
| Signal processing | Full DSP + re‑clocking | Analog equalization |
| Max distance | ~7 m | ~3–4 m |
| Power consumption | Higher | Lower |
| Cable thickness | Thinner | Thicker |
8. Switches (800G → 1.6T)
Switches are the central nervous system of a 100k GPU cluster.
- Vendors: NVIDIA, Broadcom, Arista, Cisco
- Power per switch: 5–15 kW
- A 100k GPU cluster may require 10,000+ switches
9. Cooling
GPUs now draw:
- H100: ~700W
- B200: ~1000W
- Rubin: ~1200W+
Air cooling is effectively dead at this density.
Cooling Methods
| Method | Notes |
|---|---|
| Direct Liquid Cooling | Standard for 2025–2026 |
| Rear‑Door Heat Exchangers | Retrofit option |
| Immersion Cooling | Highest efficiency, more complex |
| Two‑Phase Cooling | Future standard, still early |
Cooling is now 30–40% of total data center cost.
10. Energy Consumption
A single 100k GPU cluster consumes:
- 150–300 MW continuous load
Gigawatt campuses:
- 1.2–1.5 GW each
PUE for AI data centers:
- 1.1–1.2 (thanks to liquid cooling)
11. The Infrastructure Supercycle (2025–2030)
2025–2026 → TRAINING ERA
- 10 mega-clusters (100k GPUs each)
2027–2030 → INFERENCE ERA
- Thousands of smaller clusters
- Deployed in hundreds of cities
Total expected spend: $3 trillion by 2030.
12. Where CLS, Astera Labs, and Vertiv Fit
12.1 Position in the AI Data Center Stack
| Layer / Function | Celestica (CLS) | Astera Labs | Vertiv |
|---|---|---|---|
| GPUs / ASICs | |||
| Memory / HBM | (x) | ||
| In‑rack connectivity (PCIe/CXL/AEC) | (x) | X | |
| Switches / Fabric | X | (x) | |
| Racks / Servers / Integration | X | (x) | |
| Power & Cooling | X | ||
| Data center construction | X |
X = primary exposure, (x) = secondary / indirect exposure.
Interpretation:
- Astera Labs lives inside the rack (PCIe/CXL retimers, CXL memory pooling).
- Celestica (CLS) sits at the fabric and server layer (switches, servers, ODM integration).
- Vertiv anchors power, cooling, and prefabricated data center modules.
12.2 Leverage to the AI Supercycle
| Company | Segment Focus | Tie‑in to 100k GPU Clusters | Sensitivity to GPU Shipments | Capital Intensity | Moat Type |
|---|---|---|---|---|---|
| Celestica | Switches, servers, ODM integration | High (fabric + racks) | High | Medium–High | Scale, integration, relationships |
| Astera Labs | PCIe/CXL retimers, memory pooling | Very high (per‑link scaling) | High | Low–Medium | IP, DSP know‑how, design‑ins |
| Vertiv | Power, cooling, prefabs | Medium (MW‑driven) | Medium | High | Installed base, service network |
13. Final Summary
AI data centers have become power plants for computation and the backbone of a new global infrastructure buildout.
- Astera Labs wins inside the rack (PCIe/CXL, memory connectivity).
- Celestica wins in switches, servers, and integration for hyperscale fabrics.
- Vertiv wins in power, cooling, and gigawatt‑scale campuses.
Together, they form the hidden infrastructure layer powering the AI revolution.
Comments
Post a Comment