Agentic RAG Applications: Company Knowledge Slack Agents

I that the majority firms would have constructed or carried out their very own Rag brokers by now.

An AI information agent can dig via inside documentation — web sites, PDFs, random docs — and reply workers in Slack (or Groups/Discord) inside just a few seconds. So, these bots ought to considerably cut back time sifting via info for workers.

I’ve seen just a few of those in greater tech firms, like AskHR from IBM, however they aren’t all that mainstream but.

For those who’re eager to know how they’re constructed and the way a lot sources it takes to construct a easy one, that is an article for you.

Elements this text will undergo | Picture by writer

I’ll undergo the instruments, strategies, and structure concerned, whereas additionally trying on the economics of constructing one thing like this. I’ll additionally embrace a bit on what you’ll find yourself focusing essentially the most on.

Stuff you’ll spend time on | Picture by writer

There may be additionally a demo on the finish for what this can appear like in Slack.

For those who’re already accustomed to RAG, be at liberty to skip the subsequent part — it’s only a little bit of repetitive stuff round brokers and RAG.

What’s RAG and Agentic RAG?

Most of you who learn this can know what Retrieval-Augmented Era (RAG) is however when you’re new to it, it’s a strategy to fetch info that will get fed into the big language mannequin (LLM) earlier than it solutions the person’s query.

This permits us to offer related info from numerous paperwork to the bot in actual time so it will possibly reply the person appropriately.

This retrieval system is doing greater than easy key phrase search, because it finds related matches fairly than simply actual ones. For instance, if somebody asks about fonts, a similarity search may return paperwork on typography.

Many would say that RAG is a reasonably easy idea to know, however the way you retailer info, the way you fetch it, and how much embedding fashions you employ nonetheless matter so much.

For those who’re eager to study extra about embeddings and retrieval, I’ve written about this here.

In the present day, individuals have gone additional and primarily work with agent methods.

In agent methods, the LLM can resolve the place and the way it ought to fetch info, fairly than simply having content material dumped into its context earlier than producing a response.

Agent system with RAG instruments — the yellow dot i the agent and the grey dots are the instruments | Picture by writer

It’s essential to do not forget that simply because extra superior instruments exist doesn’t imply you must at all times use them. You wish to hold the system intuitive and likewise hold API calls to a minimal.

With agent methods the API calls will enhance, because it must a minimum of name one device after which make one other name to generate a response.

That stated, I actually just like the person expertise of the bot “going someplace” — to a device — to look one thing up. Seeing that move in Slack helps the person perceive what’s occurring.

However going with an agent or utilizing a full framework isn’t essentially the higher selection. I’ll elaborate on this as we proceed.

Technical Stack

There’s a ton of choices for agent frameworks, vector databases, and deployment choices, so I’ll undergo some.

For the deployment choice, since we’re working with Slack webhooks, we’re coping with event-driven structure the place the code solely runs when there’s a query in Slack.

To maintain prices to a minimal, we are able to use serverless features. The selection is both going with AWS Lambda or selecting a brand new vendor.

Lambda vs Modal comparability, discover the total desk here | Picture by writer

Platforms like Modal are technically constructed to serve LLM fashions, however they work nicely for long-running ETL processes, and for LLM apps basically.

Modal hasn’t been battle-tested as a lot, and also you’ll discover that when it comes to latency, but it surely’s very easy and provides tremendous low cost CPU pricing.

I ought to notice although that when setting this up with Modal on the free tier, I’ve had just a few 500 errors, however that is likely to be anticipated.

As for methods to choose the agent framework, that is fully non-obligatory. I did a comparability piece just a few weeks in the past on open-source agentic frameworks that you will discover here, and the one I neglected was LlamaIndex.

So I made a decision to present it a attempt right here.

The very last thing you have to choose is a vector database, or a database that helps vector search. That is the place we retailer the embeddings and different metadata, so we are able to carry out similarity search when a person’s question is available in.

There are lots of choices on the market, however I feel those with the very best potential are Weaviate, Milvus, pgvector, Redis, and Qdrant.

Vector DBs comparability, discover the total desk here | Picture by writer

Each Qdrant and Milvus have fairly beneficiant free tiers for his or her cloud choices. Qdrant, I do know, permits us to retailer each dense and sparse vectors. Llamaindex, together with most agent frameworks, assist many alternative vector databases so any can work.

I’ll attempt Milvus extra sooner or later to match efficiency and latency, however for now, Qdrant works nicely.

Redis is a stable choose too, or actually any vector extension of your present database.

Value & time to construct

When it comes to time and price, you must account for engineering hours, cloud, embedding, and enormous language mannequin (LLM) prices.

It doesn’t take that a lot time in addition up a framework to run one thing minimal. What takes time is connecting the content material correctly, prompting the system, parsing the outputs, and ensuring it runs quick sufficient.

But when we flip to overhead prices, cloud prices to run the agent system is minimal for only one bot for one firm utilizing serverless features as you noticed within the desk within the final part.

Nevertheless, for the vector databases, it’ll get dearer the extra information you retailer.

Each Zilliz and Qdrant Cloud has a very good quantity of free tier to your first 1 to 5GBs of knowledge, so until you transcend just a few thousand chunks you might not pay for something.

Vector DBs comparability for prices, discover the total desk here | Picture by writer

You’ll begin paying although when you transcend the 1000’s mark, with Weaviate being the most costly of the distributors above.

As for the embeddings, these are typically very low cost.

You possibly can see a desk beneath on utilizing OpenAI’s text-embedding-3-small with chunks of various sizes when you embed 1 to 10 million texts.

Embedding prices per chunk examples — discover the total desk here | Picture by writer

When individuals begin optimizing for embeddings and storage, they’ve normally moved past embedding hundreds of thousands of texts.

The one factor that issues essentially the most although is what giant language mannequin (LLM) you employ. That you must take into consideration API costs, since an agent system will sometimes name an LLM two to 4 instances per run.

Instance costs for LLMs in agent methods, full desk here | Picture by writer

For this method, I’m utilizing GPT-4o-mini or Gemini Flash 2.0, that are the most affordable choices.

So let’s say an organization is utilizing the bot just a few hundred instances per day and every run prices us 2–4 API calls, we’d find yourself at round much less of a greenback per day and round $10–50 {dollars} per thirty days.

You possibly can see that switching to a dearer mannequin would enhance the month-to-month invoice by 10x to 100x. Utilizing ChatGPT is generally sponsored free of charge customers, however whenever you construct your personal functions you’ll be financing it.

There will probably be smarter and cheaper fashions sooner or later, so no matter you construct now will seemingly enhance over time. However begin small, as a result of prices add up and for easy methods like this you don’t want them to be distinctive.

The subsequent part will get into methods to construct this method.

The structure (processing paperwork)

The system has two components. The primary is how we break up up paperwork — what we name chunking — and embed them. This primary half is essential, as it’ll dictate how the agent solutions later.

Splitting up paperwork to completely different chunks hooked up with metadata | Picture by writer

So, to be sure to’re getting ready all of the sources correctly, you have to think twice about methods to chunk them.

For those who take a look at the doc above, you may see that we are able to miss context if we break up the doc primarily based on headings but additionally on the variety of characters the place the paragraphs hooked up to the primary heading is break up up for being too lengthy.

Dropping context in chunks | Picture by writer

That you must be good about guaranteeing every chunk has sufficient context (however not an excessive amount of). You additionally want to ensure the chunk is hooked up to metadata so it’s simple to hint again to the place it was discovered.

Setting metadata to the sources to hint again to the place the chunks had been discovered | Picture by writer

That is the place you’ll spend essentially the most time, and truthfully, I feel there must be higher instruments on the market to do that intelligently.

I ended up utilizing Docling for PDFs, constructing it out to connect parts primarily based on headings and paragraph sizes. For net pages, I constructed a crawler that seemed over web page parts to resolve whether or not to chunk primarily based on anchor tags, headings, or common content material.

Keep in mind, if the bot is meant to quote sources, every chunk must be hooked up to URLs, anchor tags, web page numbers, block IDs, permalinks so the system can find the knowledge appropriately getting used.

Since a lot of the content material you’re working with is scattered and sometimes low high quality, I additionally determined to summarize texts utilizing an LLM. These summaries got completely different labels with larger authority, which meant they had been prioritized throughout retrieval.

Summarizing docs with larger authority | Picture by writer

There may be additionally the choice to push within the summaries in their very own instruments, and hold deep dive info separate. Letting the agent resolve which one to make use of however it’ll look unusual to customers because it’s not intuitive habits.

Nonetheless, I’ve to emphasize that if the standard of the supply info is poor, it’s exhausting to make the system work nicely.

For instance, if a person asks how an API request must be made and there are 4 completely different net pages giving completely different solutions, the bot received’t know which one is most related.

To demo this, I needed to do some guide evaluation. I additionally had AI do deeper analysis across the firm to assist fill in gaps, after which I embedded that too.

Sooner or later, I feel I’ll construct one thing higher for doc ingestion — most likely with the assistance of a language mannequin.

The structure (the agent)

For the second half, the place we connect with this information, we have to construct a system the place an agent can connect with completely different instruments that comprise completely different quantities of knowledge from our vector database.

We hold to at least one agent solely to make it simple sufficient to manage. This one agent can resolve what info it wants primarily based on the person’s query.

It’s good to not complicate issues and construct it out to make use of too many brokers, otherwise you’ll run into points, particularly with these smaller fashions.

Though this will likely go in opposition to my very own suggestions, I did arrange a primary LLM perform that decides if we have to run the agent in any respect.

First preliminary LLM name to resolve on the bigger agent | Picture by writer

This was primarily for the person expertise, because it takes just a few additional seconds in addition up the agent (even when beginning it as a background activity when the container begins).

As for methods to construct the agent itself, that is simple, as LlamaIndex does a lot of the work for us. For this, you should utilize the FunctionAgent, passing in several instruments when setting it up.

# Solely runs if the primary LLM thinks it's mandatory

access_links_tool = get_access_links_tool()
public_docs_tool = get_public_docs_tool()
onboarding_tool = get_onboarding_information_tool()
general_info_tool = get_general_info_tool()
    
formatted_system_prompt = get_system_prompt(team_name)
    
agent = FunctionAgent(
  instruments=[onboarding_tool, public_docs_tool, access_links_tool, general_info_tool],
  llm=global_llm,
  system_prompt=formatted_system_prompt
)

The instruments have entry to completely different information from the vector database, and they’re wrappers across the CitationQueryEngine. This engine helps to quote the supply nodes within the textual content. We will entry the supply nodes on the finish of the agent run, which you’ll connect to the message and within the footer.

To verify the person expertise is sweet, you may faucet into the occasion stream to ship updates again to Slack.

handler = agent.run(user_msg=full_msg, ctx=ctx, reminiscence=reminiscence)

async for occasion in handler.stream_events():
  if isinstance(occasion, ToolCall):
     display_tool_name = format_tool_name(occasion.tool_name)
     message = f"✅ Checking {display_tool_name}"
     post_thinking(message)
  if isinstance(occasion, ToolCallResult):
     post_thinking(f"✅ Finished checking...")

final_output = await handler  
final_text = final_output
blocks = build_slack_blocks(final_text, point out)

post_to_slack(
  channel_id=channel_id, 
  blocks=blocks,
  timestamp=initial_message_ts,
  shopper=shopper 
)

Be sure to format the messages and Slack blocks nicely, and refine the system immediate for the agent so it codecs the messages appropriately primarily based on the knowledge that the instruments will return.

The structure must be simple sufficient to know, however there are nonetheless some retrieval strategies we should always dig into.

Methods you may attempt

Lots of people will emphasize sure strategies when constructing RAG methods, and so they’re partially proper. It’s best to use hybrid search together with some sort of re-ranking.

How the question instruments work beneath the hood — a bit simplified | Picture by writer

The primary I’ll point out is hybrid search once we carry out retrieval.

I discussed that we use semantic similarity to fetch chunks of knowledge within the numerous instruments, however you additionally must account for instances the place actual key phrase search is required.

Simply think about a person asking for a particular certificates title, like CAT-00568. In that case, the system wants to search out actual matches simply as a lot as fuzzy ones.

With hybrid search, supported by each Qdrant and LlamaIndex, we use each dense and sparse vectors.

# when organising the vector retailer (each for embedding and fetching)
vector_store = QdrantVectorStore(
   shopper=shopper,
   aclient=async_client,
   collection_name="knowledge_bases",
   enable_hybrid=True,
   fastembed_sparse_model="Qdrant/bm25"
 )

Sparse is ideal for actual key phrases however blind to synonyms, whereas dense is nice for “fuzzy” matches (“advantages coverage” matches “worker perks”) however they’ll miss literal strings like CAT-00568.

As soon as the outcomes are fetched, it’s helpful to use deduplication and re-ranking to filter out irrelevant chunks earlier than sending them to the LLM for quotation and synthesis.

reranker = LLMRerank(llm=OpenAI(mannequin="gpt-3.5-turbo"), top_n=5)
dedup = SimilarityPostprocessor(similarity_cutoff=0.9)

engine = CitationQueryEngine(
    retriever=retriever,
    node_postprocessors=[dedup, reranker],
    metadata_mode=MetadataMode.ALL,
)

This half wouldn’t be mandatory in case your information had been exceptionally clear, which is why it shouldn’t be your foremost focus. It provides overhead and one other API name.

It’s additionally not mandatory to make use of a big mannequin for re-ranking, however you’ll want to do a little analysis by yourself to determine your choices.

These strategies are simple to know and fast to arrange, so that they aren’t the place you’ll spend most of your time.

What you’ll truly spend time on

Many of the stuff you’ll spend time on aren’t so attractive. It’s prompting, lowering latency, and chunking paperwork appropriately.

Earlier than you begin, you must look into completely different immediate templates from numerous frameworks to see how they immediate the fashions. You’ll spend fairly a little bit of time ensuring the system immediate is well-crafted for the LLM you select.

The second factor you’ll spend most of your time on is making it fast. I’ve seemed into inside instruments from tech firms constructing AI information brokers and located they normally reply in about 8 to 13 seconds.

So, you want one thing in that vary.

Utilizing a serverless supplier generally is a drawback right here due to chilly begins. LLM suppliers additionally introduce their very own latency, which is tough to manage.

One or two lagging API calls drags down all the system | Picture by writer

That stated, you may look into spinning up sources earlier than they’re used, switching to lower-latency fashions, skipping frameworks to cut back overhead, and usually lowering the variety of API calls per run.

The very last thing, which takes an enormous quantity of labor and which I’ve talked about earlier than, is chunking paperwork.

For those who had exceptionally clear information with clear headers and separations, this half can be simple. However extra usually, you’ll be coping with poorly structured HTML, PDFs, uncooked textual content information, Notion boards, and Confluence notes — usually scattered and formatted inconsistently.

The problem is determining methods to programmatically ingest these paperwork so the system will get the total info wanted to reply a query.

Simply working with PDFs, for instance, you’ll must extract tables and pictures correctly, separate sections by web page numbers or format parts, and hint every supply again to the right web page.

You need sufficient context, however not chunks which might be too giant, or it will likely be tougher to retrieve the correct information later.

This type of stuff isn’t nicely generalized. You possibly can’t simply push it in and anticipate the system to know it — you must suppose it via earlier than you construct it.

How one can construct it out additional

At this level, it really works nicely for what it’s alleged to do, however there are just a few items I ought to cowl (or individuals will suppose I’m simplifying an excessive amount of). You’ll wish to implement caching, a strategy to replace the info, and long-term reminiscence.

Caching isn’t important, however you may a minimum of cache the question’s embedding in bigger methods to hurry up retrieval, and retailer current supply outcomes for follow-up questions. I don’t suppose LlamaIndex helps a lot right here, however you must be capable of intercept the QueryTool by yourself.

You’ll additionally need a strategy to constantly replace info within the vector databases. That is the largest headache — it’s exhausting to know when one thing has modified, so that you want some sort of change-detection technique together with an ID for every chunk.

You might simply use periodic re-embedding methods the place you replace a bit with completely different meta tags altogether (that is my most popular method as a result of I’m lazy).

The very last thing I wish to point out is long-term reminiscence for the agent, so it will possibly perceive conversations you’ve had previously. For that, I’ve carried out some state by fetching historical past from the Slack API. This lets the agent see round 3–6 earlier messages when responding.

We don’t wish to push in an excessive amount of historical past, because the context window grows — which not solely will increase price but additionally tends to confuse the agent.

That stated, there are higher methods to deal with long-term reminiscence utilizing exterior instruments. I’m eager to put in writing extra on that sooner or later.

Learnings and so forth

After doing this now for a bit I’ve just a few notes to share about working with frameworks and preserving it easy (that I personally don’t at all times observe).

You study so much from utilizing a framework, particularly methods to immediate nicely and methods to construction the code. However in some unspecified time in the future, working across the framework provides overhead.

As an illustration, on this system, I’m bypassing the framework a bit by including an preliminary API name that decides whether or not to maneuver on to the agent and responds to the person shortly.

If I had constructed this and not using a framework, I feel I may have dealt with that sort of logic higher the place the primary mannequin decides what device to name straight away.

LLM API calls within the system | Picture by writer

I haven’t tried this however I’m assuming this is able to be cleaner.

Additionally, LlamaIndex optimizes the person question, which it ought to, earlier than retrieval.

However typically it reduces the question an excessive amount of, and I must go in and repair it. The quotation synthesizer doesn’t have entry to the dialog historical past, so with that overly simplified question, it doesn’t at all times reply nicely.

The abstractions can typically trigger the system to lose context | Picture by writer

With a framework, it’s additionally exhausting to hint the place latency is coming from within the workflow since you may’t at all times see every little thing, even with remark instruments.

Most builders suggest utilizing frameworks for fast prototyping or bootstrapping, then rewriting the core logic with direct calls in manufacturing.

It’s not as a result of the frameworks aren’t helpful, however as a result of in some unspecified time in the future it’s higher to put in writing one thing you totally perceive that solely does what you want.

The overall suggestion is to maintain issues so simple as attainable and decrease LLM calls (which I’m not even totally doing myself right here).

But when all you want is RAG and never an agent, persist with that.

You possibly can create a easy LLM name that units the correct parameters within the vector DB. From the person’s perspective, it’ll nonetheless appear like the system is “trying into the database” and returning related information.

For those who’re taking place the identical path, I hope this was helpful.

There may be bit extra to it although. You’ll wish to implement some sort of analysis, guardrails, and monitoring (I’ve used Phoenix right here).

As soon as completed although, the end result will appear like this:

Instance in firm agent trying via PDFs, web sites docs in Slack | Picture by writer

For those who to observe my writing, you will discover me right here, on my website, or on LinkedIn.

I’ll attempt to dive deeper into agentic reminiscence, evals, and prompting over the summer season.

❤

Source link

How to Build an MCQ App

Simulating Flood Inundation with Python and Elevation Data: A Beginner’s Guide

LLM Optimization: LoRA and QLoRA | Towards Data Science

Entrepreneur+ Subscriber-Only Event | May 28: How This Founder Sold 3 Million Units of His Toy Ball Idea

Feel Like Your Business Is Destined to Stay Small? Here’s How to Unlock Explosive Growth.

Best AI Writing, Image Generation, and Video Production Software | by FutureTech Chronicles | Feb, 2025

Buy A Fully Remodeled, Move-In Ready Home Over A Fixer-Upper

Exit Poll Calculation and Prediction Using Machine Learning | by dhurv | Apr, 2025

Most Popular

Grok 3: The AI That’s Redefining Intelligence | by RAJIM | Feb, 2025

Top ABBYY FlexiCapture alternatives for document processing

Get a Lifetime of Powerful PDF Tools That Won’t Give You a PDF Headache

Our Picks

He Went From a Meatball Empire to a Pizza Revolution

Run Audiocraft Locally with WSL on Windows | by Momin Aman

Movie Recommendation & Rating Prediction using KNN | by Akanksha Gupta | Feb, 2025