Is Your Data Center AI-Ready? Immersion Cooling Could Be the Key to Tackle AI Demands
The AI boom is sparking an urge among companies to optimize data center energy use, and here’s why it matters. Data centers already consume roughly 2% of the world’s power and account for about 1% of global greenhouse gas emissions, and as artificial intelligence (AI) advances at breakneck speed, the demand for computational power has surged, creating new challenges for data centers all over the world.
Our everyday activities such as streaming, storing information, or meeting online, contribute to the increased demand. But AI, especially when it comes to the immense computational load of machine learning and deep learning, is the biggest driver yet.
The International Energy Agency (IEA) projects that data center energy consumption will double by 2026 – a trend that also means climate implications. So, what’s the solution? Beyond powering data centers with renewable energy sources, which tech giants claim to be working towards, another innovative approach is gaining traction: maximizing energy efficiency with immersion cooling.
The sheer volume of data processed and stored by AI systems, especially those driven by latest GPUs, requires exceptional power and cooling solutions to maintain performance and reliability. Traditional air-cooling systems, while effective, are limited in their ability to handle the extreme heat generated by high-performance computing (HPC) setups. Enter immersion cooling, an innovative and space-efficient solution poised to redefine data center cooling in the AI age.
What is Immersion Cooling?
Immersion cooling is a method that involves submerging electronic components, such as GPUs, CPUs, and other critical data center hardware, directly into a thermally conductive yet electrically insulating (non-conductive) liquid.
This liquid effectively absorbs the heat generated by the components, transporting it away from the hardware and keeping temperatures within safe operational ranges. Unlike traditional air cooling, which relies on moving air through fans and vents, immersion cooling provides direct contact between the cooling medium and the heat-generating parts of the system. It significantly improves cooling efficiency, allowing servers to be packed closely together and paving the way for high density computing. Where typical air-colled racks will be designed to run with up to 10-15 KW, immersion tanks could handle up to 150 KW with the same number of Rack Units (RU = 44.45 mm). By dissipating heat, it also increases the Power Usage Effectiveness (PUE) ratio while lowering maintenance requirements.
Besides higher efficiency, immersion cooling makes it easier to repurpose the heat from servers for house heating or industrial processes. This approach further reduces energy waste and contributes to sustainable operations.
Immersion cooling systems generally come in two types
- Single-Phase Immersion Cooling:
Hardware components are submerged in a non-conductive liquid (coolant tank), which absorbs heat and is then circulated out of the tank for cooling before returning back. The liquid remains in a single state (liquid) throughout the process.
- Two-Phase Immersion Cooling:
In the first phase heat is transferred from the chip to the fluid, causing it to boil at low temperatures, turning into gas as it absorbs heat from the components.
In the second phase the gas is condensed back into liquid form in a cooling mechanism, and the cycle repeats indefinitely.
The most common liquids used in immersion cooling are specialized, non-conductive fluids such as 3M’s Novec or Fluorinert. These liquids are designed to withstand extreme temperatures while remaining non-reactive with electronics, making them safe and very efficient for heat management.
Pioneers and Notable Companies in Immersion Cooling
Several companies are leading the charge in developing and deploying immersion cooling technologies. Some of the most influential names in the field include:
- Green Revolution Cooling (GRC): Known for its turnkey immersion cooling solutions, GRC provides scalable options for various data center sizes.
- LiquidStack: Initially launched under Bitfury, LiquidStack has become a prominent name in immersion cooling and has introduced advanced solutions with a focus on sustainability and efficiency.
- Submer: Specializes in immersion cooling systems and is notable for its modular cooling tanks, which can be adapted for different data center requirements.
- 3M: While not directly a cooling system provider, 3M’s Novec fluids are widely used across the immersion cooling industry and are key to the technology’s performance and safety.
Is your Data Center ready for Liquid Cooling?
There are other liquid cooling methods available besides immersion cooling, one of them is direct-chip cooling. Also known as direct-to-chip liquid cooling, this method involves delivering a coolant directly to the surfaces of heat-generating chips. Small, liquid-filled cold plates (heatsinks) are mounted directly onto chips like CPUs and GPUs, where they absorb heat. The coolant then circulates through a system of pipes to an external cooling unit, where it is cooled and recirculated. This technique is more targeted than traditional air cooling, offering better temperature control and increased cooling efficiency on an individual chip level. Direct-chip cooling can be seen as a solution between traditional air cooling and all-in immersion cooling.
While both direct chip and immersion cooling aim to reduce power consumption and optimize space, they differ significantly in approach and application. The choice between direct chip and immersion cooling often depends on the specific needs and structure of the data center:
Direct Chip Cooling
Works well for facilities that already have cooling infrastructure but need a more precise, targeted solution for high-power chips. For AI workloads that require consistent temperature regulation on specific GPUs or CPUs, direct chip cooling can be a highly efficient option.
Immersion Cooling
Ideal for new data centers or those ready to undergo significant retrofitting, as it can handle cooling for entire clusters at once. It’s particularly effective in high-density environments, where space is at a premium, and uniform cooling is needed to manage the intense heat generated by large AI models.
The maintenance operations such as failed hardware replacement are obviously more complicated with both approaches, and especially with immersion cooling.
The Future of Immersion Cooling in AI-Driven Data Centers
As the AI boom intensifies, tech giants like NVIDIA are paying close attention to immersion cooling as a viable solution for their hardware. NVIDIA’s GPUs, particularly the H100/H200 and B100/B200 series designed for AI applications, generate considerable heat due to their processing power. Recognizing the limitations of air cooling, NVIDIA has been supportive of alternative cooling solutions. For now, Nvidia works with direct-chip cooling and has partnered with cooling companies to ensure that its GPUs are compatible with immersion cooling but does not yet offer a warranty on its products for this method.
Hypertec, for example, is a company that provides a customizable warranty program for its computing servers, workstations, and laptops, designed to meet varying customer needs. The warranty covers essential components such as CPUs, GPUs, RAM, and SSDs, with support options that include return-to-depot, advance exchange, and on-site services. This approach includes its immersion-born servers.
However, as the tech industry’s AI grows, the energy demands are intensified, which challenges sustainability goals as data centers consume more power. In response, tech companies are increasingly turning to nuclear energy. Recently, Google has partnered with Kairos Power for small modular reactors (SMRs) projects, and Microsoft signed a 20-year agreement to restart the Unit 1 nuclear reactor at Three Mile Island by 2028, adding over 800 MW of carbon-free energy to the U.S. power grid.
Furthermore, Amazon is investing in nuclear energy, joining Google and Microsoft in using SMRs to power its operations and support its 2040 net-zero carbon goal. SMRs are advanced reactors with a smaller footprint, allowing faster construction near the grid and thus minimizing the transportation losses. Amazon has also reached agreements with Energy Northwest to build SMRs and with Dominion Energy to site a reactor near North Anna Nuclear Generating Station.
Closing Thoughts
As AI workloads drive up power demands, data centers are adopting innovative solutions to manage energy use and increase efficiency. Immersion and direct chip cooling offer complementary benefits, from space efficiency to targeted temperature control, making them ideal for handling the intense cooling needs of AI-driven systems.
More companies are working on creating immersion-capable servers that are ready for immersion cooling from the get go, supporting sustainable operations with warranties on key components, including immersion-cooled servers, reinforcing reliability.
Meanwhile, tech giants like Amazon, Google, and Microsoft are investing in small modular reactors (SMRs) to provide carbon-free power, addressing environmental goals as energy needs grow.
Through advanced cooling methods and cleaner energy sources, the industry seems to be moving towards a more sustainable data center model that can support AI’s future demands while reducing its environmental impact.
Are you looking to deploy and manage GPU clusters with Kubernetes, OpenStack or Nvidia Base Command Manager? – Get in touch, Cloudification team has extensive experience around GPU virtualization, Nvidia MIG and building GPU clusters for various use-cases.