The NVIDIA G1-G4 tiering formalization is the real signal here. Once the GPU vendor starts naming your memory tiers, you know the problem has graduated from "infrastructure team headache" to "industry-defining constraint."
What resonates from running production AI workloads on modest infrastructure: the recompute tax is the silent killer. You don't see it in your average latency dashboards because those 30% cache-miss requests get averaged out. But your p99 latency tells the real story — and that's what users actually experience.
The "hold state or redo work" framing should be tattooed on every infrastructure engineer's forearm. It's the same tradeoff that shows up in connection pooling, session management, and now KV caches — just at GPU-memory prices instead of RAM prices.
Wow, this really nails the fundamental challenge we're seeing with GPU memory right now. The recompute tax stuff is hitting us hard in production -- we were blaming our batching strategy but turns out our KV cache managment was the real culprit. It's eye-opening how much those cache misses compound across multi-turn convos. Definitley saving this for the team.
Didn't expect this! What if memory becomes true compute?
I think you have a typo in an approximate cost, perhaps an extra zero. $10 per gig seems in line with your chart showing GPU memory costs.
Memory (HBM/DRAM): The “Penthouse.” Extremely fast, extremely small, volatile (data vanishes when power cuts), and aggressively expensive (~$100/GB).
The NVIDIA G1-G4 tiering formalization is the real signal here. Once the GPU vendor starts naming your memory tiers, you know the problem has graduated from "infrastructure team headache" to "industry-defining constraint."
What resonates from running production AI workloads on modest infrastructure: the recompute tax is the silent killer. You don't see it in your average latency dashboards because those 30% cache-miss requests get averaged out. But your p99 latency tells the real story — and that's what users actually experience.
The "hold state or redo work" framing should be tattooed on every infrastructure engineer's forearm. It's the same tradeoff that shows up in connection pooling, session management, and now KV caches — just at GPU-memory prices instead of RAM prices.
Wow, this really nails the fundamental challenge we're seeing with GPU memory right now. The recompute tax stuff is hitting us hard in production -- we were blaming our batching strategy but turns out our KV cache managment was the real culprit. It's eye-opening how much those cache misses compound across multi-turn convos. Definitley saving this for the team.