The AI Factory Heats Up: Liquid Cooling Options Explained

By Shahar Belkin, Chief Evangelist at ZutaCore

The compute energy required by AI and HPC is skyrocketing and driving a world transition from 10-15 megawatt knowledge facilities to 50-100 megawatt and even gigawatt AI factories. With the following era AI superchips working at 2,800 watts and past, the quantity of warmth anticipated to be generated by a single knowledge heart is off the charts.

State of the Cooling Market – Air vs. Liquid, or Each?

A knowledge heart utilizing solely air cooling wants 1 watt of cooling for each watt of computing. Which means 50 % of their energy goes to cooling! However with liquid cooling, each watt of cooling helps 10 watts of computing. And when it comes to energy utilization effectiveness (PUE), whereas air-based cooling delivers PUE of roughly 1.5, liquid cooing can lower that to 1.1 and 1.04 or decrease. A shift from 1.5 to 1.1 represents huge financial savings. Put one other method, the identical energy consumption utilizing direct on-chip liquid cooling will help 75 % extra compute.

That is why analysts estimate the liquid cooling market will develop from $5.65 billion in 2024 to $48.42 billion by 2034.

Liquid Cooling 101: Direct-to-Chip vs. Immersion

There are a number of forms of liquid cooling applied sciences, which fall beneath two classes: immersion and direct-to-chip.

Direct-to-chip is usually known as “chilly plate” cooling as a result of it makes use of chilly plates that sit on high of the GPU or CPU, versus immersion cooling that submerges the servers, chips and different gear into tanks of fluid.

With single-phase immersion, servers and different IT gear are immersed in an oily fluid in a tank, and because the CPU or GPU heats up, the fluid absorbs the warmth. This heated fluid rises to the highest of the tank and is then pumped to a warmth change unit that cools the fluid and sends it again to the tank, as proven beneath:

The benefit is that it could take 100% of the warmth off from the server. Nevertheless, it’s restricted to cooling decrease energy chips (500 watts and decrease) as a result of the oil is gradual to rise to the highest of the tank to be pumped for cooling. As well as, the oil is doubtlessly flammable at excessive temperatures, and since it touches all of the parts, it could cut back the lifetime of the gear. And it requires heavy upkeep.

Two-phase immersion additionally submerges servers and IT gear in tanks. In comparison with single-phase, the distinction is that it makes use of low boiling temperature, dielectric fluid as an alternative of oil. Because the element on the board heats up, it boils the fluid, which creates vapor that rises from the liquid to the highest of the tank, the place there’s a community of tubs flowing cooled facility water. The vapor from the tank touching the chilly tubs condenses and drips again into the tank.

In single-phase immersion, servers and IT gear are submerged in fluid encased in giant anks.

The benefit is that the dielectric fluid is not going to brief circuit the parts and servers like water will. The draw back is that it requires vital knowledge heart infrastructure funding as a result of giant and heavy tanks are required to accommodate the gear.

As well as, for gear to be immersed within the tank, all element
s have to be suitable with the dielectric liquid, so it’s not broken by the fluid itself. This requires specialised gear or a modification to servers. Upkeep can also be a problem as a result of two-phase usually includes lengthy down occasions with using cranes to take the servers out of the tanks.

Like single-phase immersion, two-phase immersion can also take away 100% of the warmth. Nevertheless, this course of includes boiling the dielectric fluid within the tanks which are additionally housing all of the server gear. Consequently, materials from the motherboard and different gear is routinely ‘boiled off.” This may be detrimental to the lifetime of kit and because the materials comes off, it must be frequently filtered, requiring giant and costly filters, and common upkeep. This is also detrimental to the setting as a result of when a tank is opened, dielectric liquid is distributed into the ambiance.

Direct-to-Chip Liquid Cooling

Direct-to-chip cooling brings cooling liquid to a chilly plate positioned instantly on high of the excessive warmth flux parts, resembling CPUs and GPUs. This liquid removes warmth from the parts and is contained within the chilly plate and doesn’t are available in contact with the chips or different server parts.

There are two forms of direct-to-chip liquid cooling: single section and two section. Each strategies use chilly plates – which don’t change the server and rack design. It solely includes changing the air-based warmth sink for a chilly plate on high of the CPU or GPU.

In two-phase immersion, vapor rises from the liquid to the highest of the tank.

Single-phase direct-to-chip cooling makes use of water or water glycol combine because the coolant within the chilly plate. Water stays in a liquid state and the flexibility to remove warmth with this technique relies on water circulation. The upper the ability of the chip that must be cooled, the extra water circulation is required. This requires the funding of bigger pipes, tubs and connectors in addition to power-hungry pumps to repeatedly carry the water by way of the system.

The problem with this strategy is the chance of water leakage and corrosion. With servers approaching the $300K vary, a single leak may be catastrophic, to not point out the price of a downed plant operation. As well as, over time, water is corrosive and in addition can result in mould, residue, and different organic growths. The water have to be frequently filtered, maintained and examined to verify it’s balanced, including to the upkeep expense.

A limitation with single-phase direct-to-chip liquid cooling is that warmth eliminated will depend on water circulation. The warmer the chips, the extra water is required. Utilizing this strategy for a 1000-watt chip, an information heart would wish to circulation 1.2-1.5 liters per minute. With the most recent GPUs within the space of 1.5 kilowatts, which means water circulation in each chilly plate would have to be two liters per minute. When GPU energy passes the two,000-watt threshold, a gallon per minute circulation will likely be wanted within the chilly plate. As we strategy the gigawatt knowledge facilities, the requirement for a lot water circulation makes this strategy much less efficient and requires excessive strain within the versatile tubs that may result in water leaks within the servers.

Not like singe-phase direct-to-chip, two-phase direct-to-chip doesn’t require the circulation of liquid and in reality, makes use of no water within the chilly plate. Contained in the server and chilly plate is a warmth switch fluid that’s 100% secure for IT gear. The warmth from GPUs and CPUs boils the warmth switch fluid at low temperature, absorbing the warmth, an environment friendly section change bodily phenomena conserving the chip at a relentless temperature.

That is just like the way in which boiling water retains the underside of a pot at 100⁰C, solely on this case utilizing the warmth switch fluid, at a decrease temperature. Because the liquid contained in the chilly plate boils, the liquid within the chilly plate by no means passes the boiling temp even when the warmth will increase by 3X (resembling with greater energy GPUs and CPUs). This makes this method extremely scalable for cooling greater energy chips of the long run. To know how this ‘pool boiling’ strategy works, see this tutorial video.

Two-phase direct-to-chip liquid cooling requires little to no knowledge heart infrastructure modifications, only a easy set up course of. It’s also pretty low upkeep as a result of the dielectric fluid doesn’t have to be filtered, balanced or changed. And in contrast to immersion, it doesn’t get launched into the ambiance throughout server and rack upkeep.

Hotter Chips Are Coming – Are you Prepared?

Whereas chips of over 2,500 watts are usually not anticipated till the tip 2025, knowledge facilities and AI factories are being making ready for his or her arrival. Many hyperscalers are shying away from water as a result of it poses an excessive amount of threat. Even insurance coverage corporations are making their considerations identified as a result of insuring for a water leak could possibly be an enormous expense. Other than this, there may be additionally strain to make the infrastructure scalable in order that it could deal with hotter chips as they change into accessible, whereas additionally being sustainable, vitality environment friendly, and cost-effective for the long-term.

Understanding all this, is your knowledge heart prepared?

Shahar Belkin is chief rvangelist at ZutaCore, a direct-to-chip liquid cooling options firm.

Source link

Redesigning Customer Interactions: Human-AI Collaboration with Agentic AI

Postman Unveils Agent Mode: AI-Native Development Revolutionizes API Lifecycle

Researchers Use AI in Pursuit of ALS Treatments

Building a Modern Dashboard with Python and Gradio

How to Start a YouTube Channel in 2024

Return-to-Office Push Meets Employee Pushback — What’s Next?

SandboxAQ Using NVIDIA DGX to Build Large Quantitative Models

Norma Kamali is transforming the future of fashion with AI | MIT News

Most Popular

Survey: Less than Half of Telco Gen AI Projects Meet Goals, 80% Are Over Budget

Amazon, AppLovin Submit Bids for TikTok As Deadline Looms

SoftBank, ZutaCore and Foxconn Join on Rack-Integrated Solution with Liquid Cooling for NVIDIA H200s

Our Picks

Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning

jhhhghgggg

Why Learning Data Engineering is Important for a Java Developer | by praga_t | Jun, 2025

The AI Factory Heats Up: Liquid Cooling Options Explained

Related Posts