Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics

Metric assortment is a necessary a part of each machine studying challenge, enabling us to trace mannequin efficiency and monitor coaching progress. Ideally, Metrics must be collected and computed with out introducing any further overhead to the coaching course of. Nevertheless, identical to different elements of the coaching loop, inefficient metric computation can introduce pointless overhead, enhance training-step instances and inflate coaching prices.

This submit is the seventh in our collection on performance profiling and optimization in PyTorch. The collection has aimed to emphasise the crucial position of efficiency evaluation and Optimization in machine studying improvement. Every submit has targeted on totally different levels of the coaching pipeline, demonstrating sensible instruments and methods for analyzing and boosting useful resource utilization and runtime effectivity.

On this installment, we deal with metric assortment. We are going to show how a naïve implementation of metric assortment can negatively influence runtime efficiency and discover instruments and methods for its evaluation and optimization.

To implement our metric assortment, we are going to use TorchMetrics a preferred library designed to simplify and standardize metric computation in Pytorch. Our targets might be to:

Exhibit the runtime overhead brought on by a naïve implementation of metric assortment.
Use PyTorch Profiler to pinpoint efficiency bottlenecks launched by metric computation.
Exhibit optimization methods to scale back metric assortment overhead.

To facilitate our dialogue, we are going to outline a toy PyTorch mannequin and assess how metric assortment can influence its runtime efficiency. We are going to run our experiments on an NVIDIA A40 GPU, with a PyTorch 2.5.1 docker picture and TorchMetrics 1.6.1.

It’s necessary to notice that metric assortment habits can range drastically relying on the {hardware}, runtime setting, and mannequin structure. The code snippets supplied on this submit are meant for demonstrative functions solely. Please don’t interpret our point out of any software or method as an endorsement for its use.

Toy Resnet Mannequin

Within the code block beneath we outline a easy picture classification mannequin with a ResNet-18 spine.

import time
import torch
import torchvision

system = "cuda"

mannequin = torchvision.fashions.resnet18().to(system)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mannequin.parameters())

We outline an artificial dataset which we are going to use to coach our toy mannequin.

from torch.utils.knowledge import Dataset, DataLoader

# A dataset with random pictures and labels
class FakeDataset(Dataset):
    def __len__(self):
        return 100000000

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(knowledge=index % 1000, dtype=torch.int64)
        return rand_image, label

train_set = FakeDataset()

batch_size = 128
num_workers = 12

train_loader = DataLoader(
    dataset=train_set,
    batch_size=batch_size,
    num_workers=num_workers,
    pin_memory=True
)

We outline a set of ordinary metrics from TorchMetrics, together with a management flag to allow or disable metric calculation.

from torchmetrics import (
    MeanMetric,
    Accuracy,
    Precision,
    Recall,
    F1Score,
)

# toggle to allow/disable metric assortment
capture_metrics = False

if capture_metrics:
        metrics = {
        "avg_loss": MeanMetric(),
        "accuracy": Accuracy(process="multiclass", num_classes=1000),
        "precision": Precision(process="multiclass", num_classes=1000),
        "recall": Recall(process="multiclass", num_classes=1000),
        "f1_score": F1Score(process="multiclass", num_classes=1000),
    }

    # Transfer all metrics to the system
    metrics = {identify: metric.to(system) for identify, metric in metrics.objects()}

Subsequent, we outline a PyTorch Profiler occasion, together with a management flag that enables us to allow or disable profiling. For an in depth tutorial on utilizing PyTorch Profiler, please confer with the first post on this collection.

from torch import profiler

# toggle to allow/disable profiling
enable_profiler = True

if enable_profiler:
    prof = profiler.profile(
        schedule=profiler.schedule(wait=10, warmup=2, energetic=3, repeat=1),
        on_trace_ready=profiler.tensorboard_trace_handler("./logs/"),
        profile_memory=True,
        with_stack=True
    )
    prof.begin()

Lastly, we outline an ordinary coaching step:

mannequin.practice()

t0 = time.perf_counter()
total_time = 0
depend = 0

for idx, (knowledge, goal) in enumerate(train_loader):
    knowledge = knowledge.to(system, non_blocking=True)
    goal = goal.to(system, non_blocking=True)
    optimizer.zero_grad()
    output = mannequin(knowledge)
    loss = criterion(output, goal)
    loss.backward()
    optimizer.step()

    if capture_metrics:
        # replace metrics
        metrics["avg_loss"].replace(loss)
        for identify, metric in metrics.objects():
            if identify != "avg_loss":
                metric.replace(output, goal)

        if (idx + 1) % 100 == 0:
            # compute metrics
            metric_results = {
                identify: metric.compute().merchandise() 
                    for identify, metric in metrics.objects()
            }
            # print metrics
            print(f"Step {idx + 1}: {metric_results}")
            # reset metrics
            for metric in metrics.values():
                metric.reset()

    elif (idx + 1) % 100 == 0:
        # print final loss worth
        print(f"Step {idx + 1}: Loss = {loss.merchandise():.4f}")

    batch_time = time.perf_counter() - t0
    t0 = time.perf_counter()
    if idx > 10:  # skip first steps
        total_time += batch_time
        depend += 1

    if enable_profiler:
        prof.step()

    if idx > 200:
        break

if enable_profiler:
    prof.cease()

avg_time = total_time/depend
print(f'Common step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} pictures/sec')

Metric Assortment Overhead

To measure the influence of metric assortment on coaching step time, we ran our coaching script each with and with out metric calculation. The outcomes are summarized within the following desk.

Our naïve metric assortment resulted in a virtually 10% drop in runtime efficiency!! Whereas metric assortment is important for machine studying improvement, it normally entails comparatively easy mathematical operations and hardly warrants such a major overhead. What’s going on?!!

Figuring out Efficiency Points with PyTorch Profiler

To raised perceive the supply of the efficiency degradation, we reran the coaching script with the PyTorch Profiler enabled. The resultant hint is proven beneath:

Hint of Metric Assortment Experiment (by Writer)

The hint reveals recurring “cudaStreamSynchronize” operations that coincide with noticeable drops in GPU utilization. Some of these “CPU-GPU sync” occasions have been mentioned intimately in part two of our collection. In a typical coaching step, the CPU and GPU work in parallel: The CPU manages duties like knowledge transfers to the GPU and kernel loading, and the GPU executes the mannequin on the enter knowledge and updates its weights. Ideally, we want to reduce the factors of synchronization between the CPU and GPU in an effort to maximize efficiency. Right here, nevertheless, we will see that the metric assortment has triggered a sync occasion by performing a CPU to GPU knowledge copy. This requires the CPU to droop its processing till the GPU catches up which, in flip, causes the GPU to attend for the CPU to renew loading the following kernel operations. The underside line is that these synchronization factors result in inefficient utilization of each the CPU and GPU. Our metric assortment implmentation provides eight such synchronization occasions to every coaching step.

A better examination of the hint reveals that the sync occasions are coming from the update name of the MeanMetric TorchMetric. For the skilled profiling skilled, this can be ample to determine the foundation trigger, however we are going to go a step additional and use the torch.profiler.record_function utility to determine the precise offending line of code.

Profiling with record_function

To pinpoint the precise supply of the sync occasion, we prolonged the MeanMetric class and overrode the update technique utilizing record_function context blocks. This method permits us to profile particular person operations inside the technique and determine efficiency bottlenecks.

class ProfileMeanMetric(MeanMetric):
    def replace(self, worth, weight = 1.0):
        # broadcast weight to worth form
        with profiler.record_function("course of worth"):
            if not isinstance(worth, torch.Tensor):
                worth = torch.as_tensor(worth, dtype=self.dtype,
                                        system=self.system)
        with profiler.record_function("course of weight"):
            if weight will not be None and never isinstance(weight, torch.Tensor):
                weight = torch.as_tensor(weight, dtype=self.dtype,
                                         system=self.system)
        with profiler.record_function("broadcast weight"):
            weight = torch.broadcast_to(weight, worth.form)
        with profiler.record_function("cast_and_nan_check"):
            worth, weight = self._cast_and_nan_check_input(worth, weight)

        if worth.numel() == 0:
            return

        with profiler.record_function("replace worth"):
            self.mean_value += (worth * weight).sum()
        with profiler.record_function("replace weight"):
            self.weight += weight.sum()

We then up to date our avg_loss metric to make use of the newly created ProfileMeanMetric and reran the coaching script.

Hint of Metric Assortment with record_function (by Writer)

The up to date hint reveals that the sync occasion originates from the next line:

weight = torch.as_tensor(weight, dtype=self.dtype, system=self.system)

This operation converts the default scalar worth weight=1.0 right into a PyTorch tensor and locations it on the GPU. The sync occasion happens as a result of this motion triggers a CPU-to-GPU knowledge copy, which requires the CPU to attend for the GPU to course of the copied worth.

Optimization 1: Specify Weight Worth

Now that we have now discovered the supply of the difficulty, we will overcome it simply by specifying a weight worth in our replace name. This prevents the runtime from changing the default scalar weight=1.0 right into a tensor on the GPU, avoiding the sync occasion:

# replace metrics
 if capture_metric:
     metrics["avg_loss"].replace(loss, weight=torch.ones_like(loss))

Rerunning the script after making use of this variation reveals that we have now succeeded in eliminating the preliminary sync occasion… solely to have uncovered a brand new one, this time coming from the _cast_and_nan_check_input operate:

Hint of Metric Assortment following Optimization 1 (by Writer)

Profiling with record_function — Half 2

To discover our new sync occasion, we prolonged our customized metric with further profiling probes and reran our script.

class ProfileMeanMetric(MeanMetric):
    def replace(self, worth, weight = 1.0):
        # broadcast weight to worth form
        with profiler.record_function("course of worth"):
            if not isinstance(worth, torch.Tensor):
                worth = torch.as_tensor(worth, dtype=self.dtype,
                                        system=self.system)
        with profiler.record_function("course of weight"):
            if weight will not be None and never isinstance(weight, torch.Tensor):
                weight = torch.as_tensor(weight, dtype=self.dtype,
                                         system=self.system)
        with profiler.record_function("broadcast weight"):
            weight = torch.broadcast_to(weight, worth.form)
        with profiler.record_function("cast_and_nan_check"):
            worth, weight = self._cast_and_nan_check_input(worth, weight)

        if worth.numel() == 0:
            return

        with profiler.record_function("replace worth"):
            self.mean_value += (worth * weight).sum()
        with profiler.record_function("replace weight"):
            self.weight += weight.sum()

    def _cast_and_nan_check_input(self, x, weight = None):
        """Convert enter ``x`` to a tensor and verify for Nans."""
        with profiler.record_function("course of x"):
            if not isinstance(x, torch.Tensor):
                x = torch.as_tensor(x, dtype=self.dtype,
                                    system=self.system)
        with profiler.record_function("course of weight"):
            if weight will not be None and never isinstance(weight, torch.Tensor):
                weight = torch.as_tensor(weight, dtype=self.dtype,
                                         system=self.system)
            nans = torch.isnan(x)
            if weight will not be None:
                nans_weight = torch.isnan(weight)
            else:
                nans_weight = torch.zeros_like(nans).bool()
                weight = torch.ones_like(x)

        with profiler.record_function("any nans"):
            anynans = nans.any() or nans_weight.any()

        with profiler.record_function("course of nans"):
            if anynans:
                if self.nan_strategy == "error":
                    elevate RuntimeError("Encountered `nan` values in tensor")
                if self.nan_strategy in ("ignore", "warn"):
                    if self.nan_strategy == "warn":
                        print("Encountered `nan` values in tensor."
                              " Can be eliminated.")
                    x = x[~(nans | nans_weight)]
                    weight = weight[~(nans | nans_weight)]
                else:
                    if not isinstance(self.nan_strategy, float):
                        elevate ValueError(f"`nan_strategy` shall be float"
                                         f" however you cross {self.nan_strategy}")
                    x[nans | nans_weight] = self.nan_strategy
                    weight[nans | nans_weight] = self.nan_strategy

        with profiler.record_function("return worth"):
            retval = x.to(self.dtype), weight.to(self.dtype)
        return retval

The resultant hint is captured beneath:

Hint of Metric Assortment with record_function — half 2 (by Writer)

The hint factors on to the offending line:

anynans = nans.any() or nans_weight.any()

This operation checks for NaN values within the enter tensors, nevertheless it introduces a expensive CPU-GPU synchronization occasion as a result of the operation entails copying knowledge from the GPU to the CPU.

Upon a better inspection of the TorchMetric BaseAggregator class, we discover a number of choices for dealing with NAN worth updates, all of which cross via the offending line of code. Nevertheless, for our use case — calculating the common loss metric — this verify is pointless and doesn’t justify the runtime efficiency penalty.

Optimization 2: Disable NAN Worth Checks

To eradicate the overhead, we suggest disabling the NaN worth checks by overriding the _cast_and_nan_check_input operate. As a substitute of a static override, we applied a dynamic resolution that may be utilized flexibly to any descendants of the BaseAggregator class.

from torchmetrics.aggregation import BaseAggregator

def suppress_nan_check(MetricClass):
    assert issubclass(MetricClass, BaseAggregator), MetricClass
    class DisableNanCheck(MetricClass):
        def _cast_and_nan_check_input(self, x, weight=None):
            if not isinstance(x, torch.Tensor):
                x = torch.as_tensor(x, dtype=self.dtype, 
                                    system=self.system)
            if weight will not be None and never isinstance(weight, torch.Tensor):
                weight = torch.as_tensor(weight, dtype=self.dtype,
                                         system=self.system)
            if weight is None:
                weight = torch.ones_like(x)
            return x.to(self.dtype), weight.to(self.dtype)
    return DisableNanCheck

NoNanMeanMetric = suppress_nan_check(MeanMetric)

metrics["avg_loss"] = NoNanMeanMetric().to(system)

Submit Optimization Outcomes: Success

After implementing the 2 optimizations — specifying the burden worth and disabling the NaN checks—we discover the step time efficiency and the GPU utilization to match these of our baseline experiment. As well as, the resultant PyTorch Profiler hint reveals that all the added “cudaStreamSynchronize” occasions that have been related to the metric assortment, have been eradicated. With just a few small adjustments, we have now decreased the price of coaching by ~10% with none adjustments to the habits of the metric assortment.

Within the subsequent part we are going to discover a further Metric assortment optimization.

Instance 2: Optimizing Metric System Placement

Within the earlier part, the metric values resided on the GPU, making it logical to retailer and compute the metrics on the GPU. Nevertheless, in eventualities the place the values we want to mixture reside on the CPU, it may be preferable to retailer the metrics on the CPU to keep away from pointless system transfers.

Within the code block beneath, we modify our script to calculate the common step time utilizing a MeanMetric on the CPU. This transformation has no influence on the runtime efficiency of our coaching step:

avg_time = NoNanMeanMetric()
t0 = time.perf_counter()

for idx, (knowledge, goal) in enumerate(train_loader):
    # transfer knowledge to system
    knowledge = knowledge.to(system, non_blocking=True)
    goal = goal.to(system, non_blocking=True)

    optimizer.zero_grad()
    output = mannequin(knowledge)
    loss = criterion(output, goal)
    loss.backward()
    optimizer.step()

    if capture_metrics:
        metrics["avg_loss"].replace(loss)
        for identify, metric in metrics.objects():
            if identify != "avg_loss":
                metric.replace(output, goal)

        if (idx + 1) % 100 == 0:
            # compute metrics
            metric_results = {
                identify: metric.compute().merchandise()
                    for identify, metric in metrics.objects()
            }
            # print metrics
            print(f"Step {idx + 1}: {metric_results}")
            # reset metrics
            for metric in metrics.values():
                metric.reset()

    elif (idx + 1) % 100 == 0:
        # print final loss worth
        print(f"Step {idx + 1}: Loss = {loss.merchandise():.4f}")

    batch_time = time.perf_counter() - t0
    t0 = time.perf_counter()
    if idx > 10:  # skip first steps
        avg_time.replace(batch_time)

    if enable_profiler:
        prof.step()

    if idx > 200:
        break

if enable_profiler:
    prof.cease()

avg_time = avg_time.compute().merchandise()
print(f'Common step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} pictures/sec')

The issue arises after we try to increase our script to assist distributed coaching. To show the issue, we modified our mannequin definition to make use of DistributedDataParallel (DDP):

# toggle to allow/disable ddp
use_ddp = True

if use_ddp:
    import os
    import torch.distributed as dist
    from torch.nn.parallel import DistributedDataParallel as DDP
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = "29500"
    dist.init_process_group("nccl", rank=0, world_size=1)
    torch.cuda.set_device(0)
    mannequin = DDP(torchvision.fashions.resnet18().to(system))
else:
    mannequin = torchvision.fashions.resnet18().to(system)

# insert coaching loop

# append to finish of the script:
if use_ddp:
    # destroy the method group
    dist.destroy_process_group()

The DDP modification leads to the next error:

RuntimeError: No backend kind related to system kind cpu

By default, metrics in distributed coaching are programmed to synchronize throughout all gadgets in use. Nevertheless, the synchronization backend utilized by DDP doesn’t assist metrics saved on the CPU.

One technique to clear up that is to disable the cross-device metric synchronization:

avg_time = NoNanMeanMetric(sync_on_compute=False)

In our case, the place we’re measuring the common time, this resolution is appropriate. Nevertheless, in some circumstances, the metric synchronization is important, and we have now could haven’t any alternative however to maneuver the metric onto the GPU:

avg_time = NoNanMeanMetric().to(system)

Sadly, this example offers rise to a brand new CPU-GPU sync occasion coming from the update operate.

Hint of avg_time Metric Assortment (by Writer)

This sync occasion ought to hardly come as a shock—in spite of everything, we’re updating a GPU metric with a worth residing on the CPU, which ought to necessitate a reminiscence copy. Nevertheless, within the case of a scalar metric, this knowledge switch could be utterly averted with a easy optimization.

Optimization 3: Carry out Metric Updates with Tensors as an alternative of Scalars

The answer is simple: as an alternative of updating the metric with a float worth, we convert to a Tensor earlier than calling replace.

batch_time = torch.as_tensor(batch_time)
avg_time.replace(batch_time, torch.ones_like(batch_time))

This minor change bypasses the problematic line of code, eliminates the sync occasion, and restores the step time to the baseline efficiency.

At first look, this outcome could seem shocking: We might count on that updating a GPU metric with a CPU tensor ought to nonetheless require a reminiscence copy. Nevertheless, PyTorch optimizes operations on scalar tensors through the use of a devoted kernel that performs the addition with out an express knowledge switch. This avoids the costly synchronization occasion that may in any other case happen.

Abstract

On this submit, we explored how a naïve method to TorchMetrics can introduce CPU-GPU synchronization occasions and considerably degrade PyTorch coaching efficiency. Utilizing PyTorch Profiler, we recognized the strains of code answerable for these sync occasions and utilized focused optimizations to eradicate them:

Explicitly specify a weight tensor when calling the MeanMetric.replace operate as an alternative of counting on the default worth.
Disable NaN checks within the base Aggregator class or substitute them with a extra environment friendly various.
Fastidiously handle the system placement of every metric to reduce pointless transfers.
Disable cross-device metric synchronization when not required.
When the metric resides on a GPU, convert floating-point scalars to tensors earlier than passing them to the replace operate to keep away from implicit synchronization.

Now we have created a devoted pull request on the TorchMetrics github web page protecting a number of the optimizations mentioned on this submit. Please be at liberty to contribute your individual enhancements and optimizations!

Source link

Boost Your LLM Output and Design Smarter Prompts: Real Tricks from an AI Engineer’s Toolbox

Connecting the Dots for Better Movie Recommendations

Agentic AI 103: Building Multi-Agent Teams

Graph Convolutional Networks (GCN) | by Machine Learning With K | Feb, 2025

Unified Robot Task Framework. Historically, robotic tasks were… | by andres hasfura | Apr, 2025

Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype

With generative AI, MIT chemists quickly calculate 3D genomic structures | MIT News

Accelerate data preparation and AI collaboration at scale

Most Popular

Manus AI: China’s Bold Leap into Autonomous Artificial Intelligence | by Anoop Sharma | Mar, 2025

Building TikTok-like Recommenders with Feature Pipelines

The Hidden Dangers of Earning Risk-Free Passive Income

Our Picks

How AI Agent Development Bridge the Gap Between Humans & Machines?

AI Isn’t Lulling Us to Sleep – It’s Forcing Us to Wake Up to What Consciousness Really Is | by Brendan Baker | Mar, 2025

Veriden Makine Öğrenmesine Giden Yol | by Vedat KOÇYİĞİT | Apr, 2025