Close Menu
    Trending
    • This Fun Family Ritual Revealed a Surprising Truth About AI
    • Data Science with Generative Ai Online Training | Ai Course | by Harik Visualpath | May, 2025
    • Which States Have the Lowest Taxes for Small Businesses?
    • Your Data Career Starts Here: DICS Institute in Laxmi Nagar | by Yash | May, 2025
    • College Majors With the Lowest Unemployment Rates: Report
    • Agentic AI 102: Guardrails and Agent Evaluation
    • Empowering AI with Precision: Wisepl’s Expert Animal Dataset Annotation Service | by Wisepl | May, 2025
    • How I Scaled from Side Hustle to 7 Figures Using 4 AI Tools (No Tech Skills Needed)
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Agentic AI 102: Guardrails and Agent Evaluation
    Artificial Intelligence

    Agentic AI 102: Guardrails and Agent Evaluation

    FinanceStarGateBy FinanceStarGateMay 17, 2025No Comments13 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Within the first submit of this sequence (Agentic AI 101: Starting Your Journey Building AI Agents), we talked concerning the fundamentals of making AI Brokers and launched ideas like reasoning, reminiscence, and instruments.

    In fact, that first submit touched solely the floor of this new space of the info trade. There may be a lot extra that may be finished, and we’re going to be taught extra alongside the best way on this sequence.

    So, it’s time to take one step additional.

    On this submit, we are going to cowl three subjects:

    1. Guardrails: these are protected blocks that forestall a Massive Language Mannequin (LLM) from responding about some subjects.
    2. Agent Analysis: Have you ever ever considered how correct the responses from LLM are? I guess you probably did. So we are going to see the principle methods to measure that.
    3. Monitoring: We may even be taught concerning the built-in monitoring app in Agno’s framework.

    We will start now.

    Guardrails

    Our first subject is the only, for my part. Guardrails are guidelines that can maintain an AI agent from responding to a given subject or checklist of subjects.

    I consider there’s a good probability that you’ve ever requested one thing to ChatGPT or Gemini and acquired a response like “I can’t speak about this subject”, or “Please seek the advice of knowledgeable specialist”, one thing like that. Often, that happens with delicate subjects like well being recommendation, psychological situations, or monetary recommendation.

    These blocks are safeguards to stop folks from hurting themselves, harming their well being, or their pockets. As we all know, LLMs are educated on large quantities of textual content, ergo inheriting quite a lot of unhealthy content material with it, which might simply result in unhealthy recommendation in these areas for folks. And I didn’t even point out hallucinations!

    Take into consideration what number of tales there are of people that misplaced cash by following funding ideas from on-line boards. Or how many individuals took the flawed medication as a result of they examine it on the web.

    Effectively, I suppose you bought the purpose. We should forestall our brokers from speaking about sure subjects or taking sure actions. For that, we are going to use guardrails.

    The most effective framework I discovered to impose these blocks is Guardrails AI [1]. There, you will note a hub stuffed with predefined guidelines {that a} response should comply with as a way to move and be exhibited to the consumer.

    To get began rapidly, first go to this hyperlink [2] and get an API key. Then, set up the package deal. Subsequent, sort the guardrails setup command. It can ask you a few questions that you may reply n (for No), and it’ll ask you to enter the API Key generated.

    pip set up guardrails-ai
    guardrails configure

    As soon as that’s accomplished, go to the Guardrails AI Hub [3] and select one that you just want. Each guardrail has directions on find out how to implement it. Principally, you put in it by way of the command line after which use it like a module in Python.

    For this instance, we’re selecting one known as Limit to Subject [4], which, as its title says, lets the consumer speak solely about what’s within the checklist. So, return to the terminal and set up it utilizing the code beneath.

    guardrails hub set up hub://tryolabs/restricttotopic

    Subsequent, let’s open our Python script and import some modules.

    # Imports
    from agno.agent import Agent
    from agno.fashions.google import Gemini
    import os
    
    # Import Guard and Validator
    from guardrails import Guard
    from guardrails.hub import RestrictToTopic
    

    Subsequent, we create the guard. We’ll prohibit our agent to speak solely about sports activities or the climate. And we’re proscribing it to speak about shares.

    # Setup Guard
    guard = Guard().use(
        RestrictToTopic(
            valid_topics=["sports", "weather"],
            invalid_topics=["stocks"],
            disable_classifier=True,
            disable_llm=False,
            on_fail="filter"
        )
    )

    Now we will run the agent and the guard.

    # Create agent
    agent = Agent(
        mannequin= Gemini(id="gemini-1.5-flash",
                      api_key = os.environ.get("GEMINI_API_KEY")),
        description= "An assistant agent",
        directions= ["Be sucint. Reply in maximum two sentences"],
        markdown= True
        )
    
    # Run the agent
    response = agent.run("What is the ticker image for Apple?").content material
    
    # Run agent with validation
    validation_step = guard.validate(response)
    
    # Print validated response
    if validation_step.validation_passed:
        print(response)
    else:
        print("Validation Failed", validation_step.validation_summaries[0].failure_reason)

    That is the response after we ask a couple of inventory image.

    Validation Failed Invalid subjects discovered: ['stocks']

    If I ask a couple of subject that isn’t on the valid_topics checklist, I may even see a block.

    "What is the primary soda drink?"
    Validation Failed No legitimate subject was discovered.

    Lastly, let’s ask about sports activities.

    "Who's Michael Jordan?"
    Michael Jordan is a former skilled basketball participant extensively thought of one in every of 
    the best of all time.  He gained six NBA championships with the Chicago Bulls.

    And we noticed a response this time, as it’s a legitimate subject.

    Let’s transfer on to the analysis of brokers now.

    Agent Analysis

    Since I began finding out LLMs and Agentic Ai, one in every of my major questions has been about mannequin analysis. In contrast to conventional Knowledge Science Modeling, the place you might have structured metrics which can be ample for every case, for AI Brokers, that is extra blurry.

    Thankfully, the developer group is fairly fast find options for nearly every little thing, and they also created this good package deal for LLMs analysis: deepeval.

    DeepEval [5] is a library created by Assured AI that gathers many strategies to guage LLMs and AI Brokers. On this part, let’s be taught a few the principle strategies, simply so we will construct some instinct on the topic, and in addition as a result of the library is sort of in depth.

    The primary analysis is essentially the most fundamental we will use, and it’s known as G-Eval. As AI instruments like ChatGPT turn out to be extra widespread in on a regular basis duties, we’ve to ensure they’re giving useful and correct responses. That’s the place G-Eval from the DeepEval Python package deal is available in.

    G-Eval is sort of a good reviewer that makes use of one other AI mannequin to guage how properly a chatbot or AI assistant is performing. For instance. My agent runs Gemini, and I’m utilizing OpenAI to evaluate it. This technique takes a extra superior method than a human one by asking an AI to “grade” one other AI’s solutions based mostly on issues like relevance, correctness, and readability.

    It’s a pleasant strategy to take a look at and enhance generative AI techniques in a extra scalable method. Let’s rapidly code an instance. We’ll import the modules, create a immediate, a easy chat agent, and ask it a couple of description of the climate for the month of Could in NYC.

    # Imports
    from agno.agent import Agent
    from agno.fashions.google import Gemini
    import os
    # Analysis Modules
    from deepeval.test_case import LLMTestCase, LLMTestCaseParams
    from deepeval.metrics import GEval
    
    # Immediate
    immediate = "Describe the climate in NYC for Could"
    
    # Create agent
    agent = Agent(
        mannequin= Gemini(id="gemini-1.5-flash",
                      api_key = os.environ.get("GEMINI_API_KEY")),
        description= "An assistant agent",
        directions= ["Be sucint"],
        markdown= True,
        monitoring= True
        )
    
    # Run agent
    response = agent.run(immediate)
    
    # Print response
    print(response.content material)

    It responds: “Delicate, with common highs within the 60s°F and lows within the 50s°F. Anticipate some rain“.

    Good. Appears fairly good to me.

    However how can we put a quantity on it and present a possible supervisor or shopper how our agent is doing?

    Right here is how:

    1. Create a take a look at case passing the immediate and the response to the LLMTestCase class.
    2. Create a metric. We’ll use the strategy GEval and add a immediate for the mannequin to check it for coherence, after which I give it the that means of what coherence is to me.
    3. Give the output as evaluation_params.
    4. Run the measure technique and get the rating and cause from it.
    # Take a look at Case
    test_case = LLMTestCase(enter=immediate, actual_output=response)
    
    # Setup the Metric
    coherence_metric = GEval(
        title="Coherence",
        standards="Coherence. The agent can reply the immediate and the response is sensible.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
    )
    
    # Run the metric
    coherence_metric.measure(test_case)
    print(coherence_metric.rating)
    print(coherence_metric.cause)

    The output appears like this.

    0.9
    The response immediately addresses the immediate about NYC climate in Could, 
    maintains logical consistency, flows naturally, and makes use of clear language. 
    Nevertheless, it might be barely extra detailed.

    0.9 appears fairly good, on condition that the default threshold is 0.5.

    If you wish to verify the logs, use this subsequent snippet.

    # Examine the logs
    print(coherence_metric.verbose_logs)

    Right here’s the response.

    Standards:
    Coherence. The agent can reply the immediate and the response is sensible.
    
    Analysis Steps:
    [
        "Assess whether the response directly addresses the prompt; if it aligns,
     it scores higher on coherence.",
        "Evaluate the logical flow of the response; responses that present ideas
     in a clear, organized manner rank better in coherence.",
        "Consider the relevance of examples or evidence provided; responses that 
    include pertinent information enhance their coherence.",
        "Check for clarity and consistency in terminology; responses that maintain
     clear language without contradictions achieve a higher coherence rating."
    ]

    Very good. Now allow us to study one other fascinating use case, which is the analysis of activity completion for AI Brokers. Elaborating slightly extra, how our agent is doing when it’s requested to carry out a activity, and the way a lot of it the agent can ship.

    First, we’re making a easy agent that may entry Wikipedia and summarize the subject of the question.

    # Imports
    from agno.agent import Agent
    from agno.fashions.google import Gemini
    from agno.instruments.wikipedia import WikipediaTools
    import os
    from deepeval.test_case import LLMTestCase, ToolCall
    from deepeval.metrics import TaskCompletionMetric
    from deepeval import consider
    
    # Immediate
    immediate = "Search wikipedia for 'Time sequence evaluation' and summarize the three details"
    
    # Create agent
    agent = Agent(
        mannequin= Gemini(id="gemini-2.0-flash",
                      api_key = os.environ.get("GEMINI_API_KEY")),
        description= "You're a researcher specialised in looking out the wikipedia.",
        instruments= [WikipediaTools()],
        show_tool_calls= True,
        markdown= True,
        read_tool_call_history= True
        )
    
    # Run agent
    response = agent.run(immediate)
    
    # Print response
    print(response.content material)

    The consequence appears superb. Let’s consider it utilizing the TaskCompletionMetric class.

    # Create a Metric
    metric = TaskCompletionMetric(
        threshold=0.7,
        mannequin="gpt-4o-mini",
        include_reason=True
    )
    
    # Take a look at Case
    test_case = LLMTestCase(
        enter=immediate,
        actual_output=response.content material,
        tools_called=[ToolCall(name="wikipedia")]
        )
    
    # Consider
    consider(test_cases=[test_case], metrics=[metric])

    Output, together with the agent’s response.

    ======================================================================
    
    Metrics Abstract
    
      - ✅ Process Completion (rating: 1.0, threshold: 0.7, strict: False, 
    analysis mannequin: gpt-4o-mini, 
    cause: The system efficiently looked for 'Time sequence evaluation' 
    on Wikipedia and offered a transparent abstract of the three details, 
    absolutely aligning with the consumer's purpose., error: None)
    
    For take a look at case:
    
      - enter: Search wikipedia for 'Time sequence evaluation' and summarize the three details
      - precise output: Listed here are the three details about Time sequence evaluation based mostly on the
     Wikipedia search:
    
    1.  **Definition:** A time sequence is a sequence of information factors listed in time order,
     typically taken at successive, equally spaced closing dates.
    2.  **Functions:** Time sequence evaluation is utilized in numerous fields like statistics,
     sign processing, econometrics, climate forecasting, and extra, wherever temporal 
    measurements are concerned.
    3.  **Goal:** Time sequence evaluation entails strategies for extracting significant 
    statistics and traits from time sequence information, and time sequence forecasting 
    makes use of fashions to foretell future values based mostly on previous observations.
    
      - anticipated output: None
      - context: None
      - retrieval context: None
    
    ======================================================================
    
    Total Metric Cross Charges
    
    Process Completion: 100.00% move fee
    
    ======================================================================
    
    ✓ Exams completed 🎉! Run 'deepeval login' to save lots of and analyze analysis outcomes
     on Assured AI.

    Our agent handed the take a look at with honor: 100%!

    You’ll be able to be taught way more concerning the DeepEval library on this hyperlink [8].

    Lastly, within the subsequent part, we are going to be taught the capabilities of Agno’s library for monitoring brokers.

    Agent Monitoring

    Like I informed you in my earlier submit [9], I selected Agno to be taught extra about Agentic AI. Simply to be clear, this isn’t a sponsored submit. It’s simply that I feel that is the best choice for these beginning their journey studying about this subject.

    So, one of many cool issues we will make the most of utilizing Agno’s framework is the app they make out there for mannequin monitoring.

    Take this agent that may search the web and write Instagram posts, for instance.

    # Imports
    import os
    from agno.agent import Agent
    from agno.fashions.google import Gemini
    from agno.instruments.file import FileTools
    from agno.instruments.googlesearch import GoogleSearchTools
    
    
    # Subject
    subject = "Wholesome Consuming"
    
    # Create agent
    agent = Agent(
        mannequin= Gemini(id="gemini-1.5-flash",
                      api_key = os.environ.get("GEMINI_API_KEY")),
                      description= f"""You're a social media marketer specialised in creating participating content material.
                      Search the web for 'trending subjects about {subject}' and use them to create a submit.""",
                      instruments=[FileTools(save_files=True),
                             GoogleSearchTools()],
                      expected_output="""A brief submit for instagram and a immediate for an image associated to the content material of the submit.
                      Do not use emojis or particular characters within the submit. When you discover an error within the character encoding, take away the character earlier than saving the file.
                      Use the template:
                      - Publish
                      - Immediate for the image
                      Save the submit to a file named 'submit.txt'.""",
                      show_tool_calls=True,
                      monitoring=True)
    
    # Writing and saving a file
    agent.print_response("""Write a brief submit for instagram with ideas and methods that positions me as 
                         an authority in {subject}.""",
                         markdown=True)

    To watch its efficiency, comply with these steps:

    1. Go to https://app.agno.com/settings and get an API Key.
    2. Open a terminal and sort ag setup.
    3. If it’s the first time, it would ask for the API Key. Copy and Paste it within the terminal immediate.
    4. You will note the Dashboard tab open in your browser.
    5. If you wish to monitor your agent, add the argument monitoring=True.
    6. Run your agent.
    7. Go to the Dashboard on the internet browser.
    8. Click on on Classes. As it’s a single agent, you will note it beneath the tab Brokers on the highest portion of the web page.
    Agno Dashboard after operating the agent. Picture by the creator.

    The cools options we will see there are:

    • Information concerning the mannequin
    • The response
    • Instruments used
    • Tokens consumed
    That is the ensuing token consumption whereas saving the file. Picture by the creator.

    Fairly neat, huh?

    That is helpful for us to know the place the agent is spending roughly tokens, and the place it’s taking extra time to carry out a activity, for instance.

    Effectively, let’s wrap up then.

    Earlier than You Go

    We’ve got realized rather a lot on this second spherical. On this submit, we lined:

    • Guardrails for AI are important security measures and moral tips applied to stop unintended dangerous outputs and guarantee accountable AI conduct.
    • Mannequin analysis, exemplified by GEval for broad evaluation and TaskCompletion with DeepEval for brokers output high quality, is essential for understanding AI capabilities and limitations.
    • Mannequin monitoring with Agno’s app, together with monitoring token utilization and response time, which is significant for managing prices, making certain efficiency, and figuring out potential points in deployed AI techniques.

    Contact & Observe Me

    When you preferred this content material, discover extra of my work in my web site.

    https://gustavorsantos.me

    GitHub Repository

    https://github.com/gurezende/agno-ai-labs

    References

    [1. Guardrails Ai] https://www.guardrailsai.com/docs/getting_started/guardrails_server

    [2. Guardrails AI Auth Key] https://hub.guardrailsai.com/keys

    [3. Guardrails AI Hub] https://hub.guardrailsai.com/

    [4. Guardrails Restrict to Topic] https://hub.guardrailsai.com/validator/tryolabs/restricttotopic

    [5. DeepEval.] https://www.deepeval.com/docs/getting-started

    [6. DataCamp – DeepEval Tutorial] https://www.datacamp.com/tutorial/deepeval

    [7. DeepEval. TaskCompletion] https://www.deepeval.com/docs/metrics-task-completion

    [8. Llm Evaluation Metrics: The Ultimate LLM Evaluation Guide] https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

    [9. Agentic AI 101: Starting Your Journey Building AI Agents] https://towardsdatascience.com/agentic-ai-101-starting-your-journey-building-ai-agents/



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleEmpowering AI with Precision: Wisepl’s Expert Animal Dataset Annotation Service | by Wisepl | May, 2025
    Next Article College Majors With the Lowest Unemployment Rates: Report
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    The Automation Trap: Why Low-Code AI Models Fail When You Scale

    May 17, 2025
    Artificial Intelligence

    How to Set the Number of Trees in Random Forest

    May 16, 2025
    Artificial Intelligence

    How to Build an AI Journal with LlamaIndex

    May 16, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    🛒 The Smart Shopper: Crafting an AI-Powered E-Commerce Recommendation System | by Samuel Ayim | Mar, 2025

    March 10, 2025

    SandboxAQ Using NVIDIA DGX to Build Large Quantitative Models

    April 16, 2025

    Introduction to Machine Learning and its Techniques | by Asmamushtaq | Feb, 2025

    February 18, 2025

    AI ML Courses in Hyderabad | Best Artificial Intelligence | by Kalyanvisualpath | Apr, 2025

    April 19, 2025

    Photonic Fabric: Celestial AI Secures $250M Series C Funding

    March 12, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Sales of Small Businesses Surged in Q1, Per New Report

    April 25, 2025

    Before ChatGPT: The Core Ideas That Made Modern AI Possible | by Michal Mikulasi | May, 2025

    May 10, 2025

    Hsشماره خاله تهران شماره خاله کرج شماره خاله تهران شماره خاله اصفهان شماره خاله شیراز شماره خاله…

    February 28, 2025
    Our Picks

    Architects of Intelligence: The Truth about AI from the People Building It | by Murat Girgin | Mar, 2025

    March 26, 2025

    Why LLM hallucinations are key to your agentic AI readiness

    April 23, 2025

    From Bullet Train to Balance Beam: Welcome to the Intelligence Age

    April 29, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.