Close Menu
    Trending
    • The Creator of Pepper X Feels Success in His Gut
    • How To Make AI Images Of Yourself (Free) | by VIJAI GOPAL VEERAMALLA | Jun, 2025
    • 8 Passive Income Ideas That Are Actually Worth Pursuing
    • From Dream to Reality: Crafting the 3Phases6Steps Framework with AI Collaboration | by Abhishek Jain | Jun, 2025
    • Your Competitors Are Winning with PR — You Just Don’t See It Yet
    • Papers Explained 381: KL Divergence VS MSE for Knowledge Distillation | by Ritvik Rastogi | Jun, 2025
    • Micro-Retirement? Quit Your Job Before You’re a Millionaire
    • Basic Feature Discovering for Machine Learning | by Sefza Auma Tiang Alam | Jun, 2025
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Load-Testing LLMs Using LLMPerf | Towards Data Science
    Artificial Intelligence

    Load-Testing LLMs Using LLMPerf | Towards Data Science

    FinanceStarGateBy FinanceStarGateApril 18, 2025No Comments9 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Language Mannequin (LLM) will not be essentially the ultimate step in productionizing your Generative AI software. An typically forgotten, but essential a part of the MLOPs lifecycle is correctly load testing your LLM and guaranteeing it is able to face up to your anticipated manufacturing visitors. Load testing at a excessive stage is the observe of testing your software or on this case your mannequin with the visitors it could expect in a manufacturing setting to make sure that it’s performant.

    Prior to now we’ve mentioned load testing traditional ML models utilizing open supply Python instruments similar to Locust. Locust helps seize normal efficiency metrics similar to requests per second (RPS) and latency percentiles on a per request foundation. Whereas that is efficient with extra conventional APIs and ML fashions it doesn’t seize the total story for LLMs. 

    LLMs historically have a a lot decrease RPS and better latency than conventional ML fashions as a result of their measurement and bigger compute necessities. Typically the RPS metric does probably not present probably the most correct image both as requests can enormously range relying on the enter to the LLM. As an illustration you might need a question asking to summarize a big chunk of textual content and one other question that may require a one-word response. 

    Because of this tokens are seen as a way more correct illustration of an LLM’s efficiency. At a excessive stage a token is a bit of textual content, each time an LLM is processing your enter it “tokenizes” the enter. A token differs relying particularly on the LLM you might be utilizing, however you possibly can think about it for example as a phrase, sequence of phrases, or characters in essence.

    Picture by Creator

    What we’ll do on this article is discover how we will generate token based mostly metrics so we will perceive how your LLM is acting from a serving/deployment perspective. After this text you’ll have an thought of how one can arrange a load-testing software particularly to benchmark totally different LLMs within the case that you’re evaluating many fashions or totally different deployment configurations or a mixture of each.

    Let’s get arms on! If you’re extra of a video based mostly learner be happy to comply with my corresponding YouTube video down beneath:

    NOTE: This text assumes a primary understanding of Python, LLMs, and Amazon Bedrock/SageMaker. If you’re new to Amazon Bedrock please discuss with my starter information here. If you wish to be taught extra about SageMaker JumpStart LLM deployments discuss with the video here.

    DISCLAIMER: I’m a Machine Studying Architect at AWS and my opinions are my very own.

    Desk of Contents

    1. LLM Particular Metrics
    2. LLMPerf Intro
    3. Making use of LLMPerf to Amazon Bedrock
    4. Extra Assets & Conclusion

    LLM-Particular Metrics

    As we briefly mentioned within the introduction with regard to LLM internet hosting, token based mostly metrics typically present a a lot better illustration of how your LLM is responding to totally different payload sizes or forms of queries (summarization vs QnA). 

    Historically we now have at all times tracked RPS and latency which we are going to nonetheless see right here nonetheless, however extra so at a token stage. Listed here are among the metrics to concentrate on earlier than we get began with load testing:

    1. Time to First Token: That is the period it takes for the primary token to generate. That is particularly helpful when streaming. As an illustration when utilizing ChatGPT we begin processing info when the primary piece of textual content (token) seems.
    2. Complete Output Tokens Per Second: That is the overall variety of tokens generated per second, you possibly can consider this as a extra granular various to the requests per second we historically observe.

    These are the most important metrics that we’ll concentrate on, and there’s a number of others similar to inter-token latency that may even be displayed as a part of the load checks. Take into account the parameters that additionally affect these metrics embrace the anticipated enter and output token measurement. We particularly play with these parameters to get an correct understanding of how our LLM performs in response to totally different era duties. 

    Now let’s check out a software that allows us to toggle these parameters and show the related metrics we’d like.

    LLMPerf Intro

    LLMPerf is constructed on prime of Ray, a well-liked distributed computing Python framework. LLMPerf particularly leverages Ray to create distributed load checks the place we will simulate real-time manufacturing stage visitors. 

    Observe that any load-testing software can be solely going to have the ability to generate your anticipated quantity of visitors if the shopper machine it’s on has sufficient compute energy to match your anticipated load. As an illustration as you scale the concurrency or throughput anticipated on your mannequin, you’d additionally need to scale the shopper machine(s) the place you might be working your load take a look at.

    Now particularly inside LLMPerf there’s a number of parameters which can be uncovered which can be tailor-made for LLM load testing as we’ve mentioned:

    • Mannequin: That is the mannequin supplier and your hosted mannequin that you just’re working with. For our use-case it’ll be Amazon Bedrock and Claude 3 Sonnet particularly.
    • LLM API: That is the API format by which the payload needs to be structured. We use LiteLLM which offers a standardized payload construction throughout totally different mannequin suppliers, thus simplifying the setup course of for us particularly if we need to take a look at totally different fashions hosted on totally different platforms.
    • Enter Tokens: The imply enter token size, it’s also possible to specify a typical deviation for this quantity.
    • Output Tokens: The imply output token size, it’s also possible to specify a typical deviation for this quantity.
    • Concurrent Requests: The variety of concurrent requests for the load take a look at to simulate.
    • Take a look at Length: You possibly can management the period of the take a look at, this parameter is enabled in seconds.

    LLMPerf particularly exposes all these parameters by way of their token_benchmark_ray.py script which we configure with our particular values. Let’s have a look now at how we will configure this particularly for Amazon Bedrock.

    Making use of LLMPerf to Amazon Bedrock

    Setup

    For this instance we’ll be working in a SageMaker Classic Notebook Instance with a conda_python3 kernel and ml.g5.12xlarge occasion. Observe that you just need to choose an occasion that has sufficient compute to generate the visitors load that you just need to simulate. Be certain that you even have your AWS credentials for LLMPerf to entry the hosted mannequin be it on Bedrock or SageMaker.

    LiteLLM Configuration

    We first configure our LLM API construction of selection which is LiteLLM on this case. With LiteLLM there’s assist throughout varied mannequin suppliers, on this case we configure the completion API to work with Amazon Bedrock:

    import os
    from litellm import completion
    
    os.environ["AWS_ACCESS_KEY_ID"] = "Enter your entry key ID"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret entry key"
    os.environ["AWS_REGION_NAME"] = "us-east-1"
    
    response = completion(
        mannequin="anthropic.claude-3-sonnet-20240229-v1:0",
        messages=[{ "content": "Who is Roger Federer?","role": "user"}]
    )
    output = response.decisions[0].message.content material
    print(output)

    To work with Bedrock we configure the Mannequin ID to level in the direction of Claude 3 Sonnet and move in our immediate. The neat half with LiteLLM is that messages key has a constant format throughout mannequin suppliers.

    Publish-execution right here we will concentrate on configuring LLMPerf for Bedrock particularly.

    LLMPerf Bedrock Integration

    To execute a load take a look at with LLMPerf we will merely use the offered token_benchmark_ray.py script and move within the following parameters that we talked of earlier:

    • Enter Tokens Imply & Normal Deviation
    • Output Tokens Imply & Normal Deviation
    • Max variety of requests for take a look at
    • Length of take a look at
    • Concurrent requests

    On this case we additionally specify our API format to be LiteLLM and we will execute the load take a look at with a easy shell script like the next:

    %%sh
    python llmperf/token_benchmark_ray.py 
        --model bedrock/anthropic.claude-3-sonnet-20240229-v1:0 
        --mean-input-tokens 1024 
        --stddev-input-tokens 200 
        --mean-output-tokens 1024 
        --stddev-output-tokens 200 
        --max-num-completed-requests 30 
        --num-concurrent-requests 1 
        --timeout 300 
        --llm-api litellm 
        --results-dir bedrock-outputs

    On this case we preserve the concurrency low, however be happy to toggle this quantity relying on what you’re anticipating in manufacturing. Our take a look at will run for 300 seconds and put up period you need to see an output listing with two recordsdata representing statistics for every inference and in addition the imply metrics throughout all requests within the period of the take a look at.

    We will make this look a little bit neater by parsing the abstract file with pandas:

    import json
    from pathlib import Path
    import pandas as pd
    
    # Load JSON recordsdata
    individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
    summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")
    
    with open(individual_path, "r") as f:
        individual_data = json.load(f)
    
    with open(summary_path, "r") as f:
        summary_data = json.load(f)
    
    # Print abstract metrics
    df = pd.DataFrame(individual_data)
    summary_metrics = {
        "Mannequin": summary_data.get("mannequin"),
        "Imply Enter Tokens": summary_data.get("mean_input_tokens"),
        "Stddev Enter Tokens": summary_data.get("stddev_input_tokens"),
        "Imply Output Tokens": summary_data.get("mean_output_tokens"),
        "Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
        "Imply TTFT (s)": summary_data.get("results_ttft_s_mean"),
        "Imply Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
        "Imply Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
        "Accomplished Requests": summary_data.get("results_num_completed_requests"),
        "Error Price": summary_data.get("results_error_rate")
    }
    print("Claude 3 Sonnet - Efficiency Abstract:n")
    for ok, v in summary_metrics.gadgets():
        print(f"{ok}: {v}")

    The ultimate load take a look at outcomes will look one thing like the next:

    Screenshot by Creator

    As we will see we see the enter parameters that we configured, after which the corresponding outcomes with time to first token(s) and throughput with regard to imply output tokens per second.

    In a real-world use case you would possibly use LLMPerf throughout many alternative mannequin suppliers and run checks throughout these platforms. With this software you should use it holistically to determine the appropriate mannequin and deployment stack on your use-case when used at scale.

    Extra Assets & Conclusion

    Your entire code for the pattern will be discovered at this related Github repository. For those who additionally need to work with SageMaker endpoints you could find a Llama JumpStart deployment load testing pattern here.

    All in all load testing and analysis are each essential to making sure that your LLM is performant in opposition to your anticipated visitors earlier than pushing to manufacturing. In future articles we’ll cowl not simply the analysis portion, however how we will create a holistic take a look at with each parts.

    As at all times thanks for studying and be happy to go away any suggestions and join with me on Linkedln and X.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleDay 45: Introduction to Natural Language Processing (NLP) | by Ian Clemence | Apr, 2025
    Next Article Here’s What Being an Entrepreneur Is Really Like — From Someone Who Did It
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    Building a Modern Dashboard with Python and Gradio

    June 5, 2025
    Artificial Intelligence

    The Journey from Jupyter to Programmer: A Quick-Start Guide

    June 5, 2025
    Artificial Intelligence

    Teaching AI models the broad strokes to sketch more like humans do | MIT News

    June 4, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    You can’t prevent an economic recession, but you can ensure you're financially prepared to weather one

    May 15, 2025

    IBM Adds Granite 3.2 LLMs for Multi-Modal AI and Reasoning

    February 26, 2025

    5 Ways CEOs Can Assess and Reset Their Company Culture

    May 8, 2025

    A Google Gemini model now has a “dial” to adjust how much it reasons

    April 17, 2025

    Former Google Engineer Risks Everything on Brain Tech

    April 11, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    DIY AI: How to Build a Linear Regression Model from Scratch | by Jacob Ingle | Feb, 2025

    February 3, 2025

    63-year-old wonders if she can retire with $100,000 debt

    February 12, 2025

    How to Avoid the Perils of Short-Term Thinking For Long-Term Success

    April 19, 2025
    Our Picks

    News Bytes 20250414: Argonne’s AI-based Reactor Monitor, AI on the Moon, TSMC under $1B Penalty Threat, HPC-AI in Growth Mode

    April 14, 2025

    Innovation vs. Regulation: The Arms Race of the Digital Age

    March 11, 2025

    Fyre Festival Brand and Assets Are For Sale, If You Dare

    April 25, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.