Close Menu
    Trending
    • Rethinking Reasoning: A Critical Look at Large Reasoning Models | by Eshaan Gupta | Jun, 2025
    • Streamline Your Workflow With This $30 Microsoft Office Professional Plus 2019 License
    • Future of Business Analytics in This Evolution of AI | by Advait Dharmadhikari | Jun, 2025
    • You’re Only Three Weeks Away From Reaching International Clients, Partners, and Customers
    • How Brain-Computer Interfaces Are Changing the Game | by Rahul Mishra | Coding Nexus | Jun, 2025
    • How Diverse Leadership Gives You a Big Competitive Advantage
    • Making Sense of Metrics in Recommender Systems | by George Perakis | Jun, 2025
    • AMD Announces New GPUs, Development Platform, Rack Scale Architecture
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Tutorial: Semantic Clustering of User Messages with LLM Prompts
    Artificial Intelligence

    Tutorial: Semantic Clustering of User Messages with LLM Prompts

    FinanceStarGateBy FinanceStarGateFebruary 17, 2025No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    As a Developer Advocate, it’s difficult to maintain up with person discussion board messages and perceive the large image of what customers are saying. There’s loads of helpful content material — however how will you shortly spot the important thing conversations? On this tutorial, I’ll present you an AI hack to carry out semantic clustering just by prompting LLMs!

    TL;DR 🔄 this weblog publish is about learn how to go from (information science + code) → (AI prompts + LLMs) for a similar outcomes — simply sooner and with much less effort! 🤖⚡. It’s organized as follows:

    • Inspiration and Knowledge Sources
    • Exploring the Knowledge with Dashboards
    • LLM Prompting to supply KNN Clusters
    • Experimenting with Customized Embeddings
    • Clustering Throughout A number of Discord Servers

    Inspiration and Knowledge Sources

    First, I’ll give props to the December 2024 paper Clio (Claude insights and observations), a privacy-preserving platform that makes use of AI assistants to investigate and floor aggregated utilization patterns throughout hundreds of thousands of conversations. Studying this paper impressed me to do that.

    Knowledge. I used solely publicly obtainable Discord messages, particularly “discussion board threads”, the place customers ask for tech assist. As well as, I aggregated and anonymized content material for this weblog.  Per thread, I formatted the info into dialog flip format, with person roles recognized as both “person”, asking the query or “assistant”, anybody answering the person’s preliminary query. I additionally added a easy, hard-coded binary sentiment rating (0 for “not joyful” and 1 for “joyful”) based mostly on whether or not the person stated thanks anytime of their thread. For vectorDB distributors I used Zilliz/Milvus, Chroma, and Qdrant.

    Step one was to transform the info right into a pandas information body. Beneath is an excerpt. You’ll be able to see for thread_id=2, a person solely requested 1 query. However for thread_id=3, a person requested 4 totally different questions in the identical thread (different 2 questions at farther down timestamps, not proven under).

    I added a naive sentiment 0|1 scoring perform.

    def calc_score(df):
       # Outline the goal phrases
       target_words = ["thanks", "thank you", "thx", "🙂", "😉", "👍"]
    
    
       # Helper perform to test if any goal phrase is within the concatenated message content material
       def contains_target_words(messages):
           concatenated_content = " ".be part of(messages).decrease()
           return any(phrase in concatenated_content for phrase in target_words)
    
    
       # Group by 'thread_id' and calculate rating for every group
       thread_scores = (
           df[df['role_name'] == 'person']
           .groupby('thread_id')['message_content']
           .apply(lambda messages: int(contains_target_words(messages)))
       )
       # Map the calculated scores again to the unique DataFrame
       df['score'] = df['thread_id'].map(thread_scores)
       return df
    
    
    ...
    
    
    if __name__ == "__main__":
      
       # Load parameters from YAML file
       config_path = "config.yaml"
       params = load_params(config_path)
       input_data_folder = params['input_data_folder']
       processed_data_dir = params['processed_data_dir']
       threads_data_file = os.path.be part of(processed_data_dir, "thread_summary.csv")
      
       # Learn information from Discord Discussion board JSON information right into a pandas df.
       clean_data_df = process_json_files(
           input_data_folder,
           processed_data_dir)
      
       # Calculate rating based mostly on particular phrases in message content material
       clean_data_df = calc_score(clean_data_df)
    
    
       # Generate experiences and plots
       plot_all_metrics(processed_data_dir)
    
    
       # Concat thread messages & save as CSV for prompting.
       thread_summary_df, avg_message_len, avg_message_len_user = 
       concat_thread_messages_df(clean_data_df, threads_data_file)
       assert thread_summary_df.form[0] == clean_data_df.thread_id.nunique()
    

    Exploring the Knowledge with Dashboards

    From the processed information above, I constructed conventional dashboards:

    • Message Volumes: One-off peaks in distributors like Qdrant and Milvus (presumably as a result of advertising occasions).
    • Person Engagement: High customers bar charts and scatterplots of response time vs. variety of person turns present that, on the whole, extra person turns imply larger satisfaction. However, satisfaction does NOT look correlated with response time. Scatterplot darkish dots appear random with regard to y-axis (response time). Perhaps customers usually are not in manufacturing, their questions usually are not very pressing? Outliers exist, similar to Qdrant and Chroma, which can have bot-driven anomalies.
    • Satisfaction Traits: Round 70% of customers seem joyful to have any interplay. Knowledge notice: ensure that to test emojis per vendor, typically customers reply utilizing emojis as a substitute of phrases! Instance Qdrant and Chroma.
    Picture by writer of aggregated, anonymized information. High lefts: Charts show Chroma’s highest message quantity, adopted by Qdrant, after which Milvus. High rights: High messaging customers, Qdrant + Chroma doable bots (see prime bar in prime messaging customers chart). Center rights: Scatterplots of Response time vs Variety of person turns exhibits no correlation with respect to darkish dots and y-axis (response time). Normally larger satisfaction w.r.t. x-axis (person turns), besides Chroma. Backside lefts: Bar charts of satisfaction ranges, ensure you catch doable emoji-based suggestions, see Qdrant and Chroma.

    LLM Prompting to supply KNN Clusters

    For prompting, the following step was to mixture information by thread_id. For LLMs, you want the texts concatenated collectively. I separate out person messages from whole thread messages, to see if one or the opposite would produce higher clusters. I ended up utilizing simply person messages.

    Instance anonymized information for prompting. All message texts concatenated collectively.

    With a CSV file for prompting, you’re able to get began utilizing a LLM to do information science!

    !pip set up -q google.generativeai
    import os
    import google.generativeai as genai
    
    
    # Get API key from native system
    api_key=os.environ.get("GOOGLE_API_KEY")
    
    
    # Configure API key
    genai.configure(api_key=api_key)
    
    
    # Record all of the mannequin names
    for m in genai.list_models():
       if 'generateContent' in m.supported_generation_methods:
           print(m.identify)
    
    
    # Attempt totally different fashions and prompts
    GEMINI_MODEL_FOR_SUMMARIES = "gemini-2.0-pro-exp-02-05"
    mannequin = genai.GenerativeModel(GEMINI_MODEL_FOR_SUMMARIES)
    # Mix the immediate and CSV information.
    full_input = immediate + "nnCSV Knowledge:n" + csv_data
    # Inference name to Gemini LLM
    response = mannequin.generate_content(full_input)
    
    
    # Save response.textual content as .json file...
    
    
    # Examine token counts and examine to mannequin restrict: 2 million tokens
    print(response.usage_metadata)
    
    Picture by writer. High: Instance LLM mannequin names. Backside: Instance inference name to Gemini LLM token counts: prompt_token_count = enter tokens; candidates_token_count = output tokens; total_token_count = sum complete tokens used.

    Sadly Gemini API stored chopping brief the response.textual content. I had higher luck utilizing AI Studio instantly.

    Picture by writer: Screenshot of instance outputs from Google AI Studio.

    My 5 prompts to Gemini Flash & Pro (temperature set to 0) are under.

    Immediate#1: Get thread Summaries:

    Given this .csv file, per row, add 3 columns:
    – thread_summary = 205 characters or much less abstract of the row’s column ‘message_content’
    – user_thread_summary = 126 characters or much less abstract of the row’s column ‘message_content_user’
    – thread_topic = 3–5 phrase tremendous high-level class
    Be sure that the summaries seize the primary content material with out shedding an excessive amount of element. Make person thread summaries straight to the purpose, seize the primary content material with out shedding an excessive amount of element, skip the intro textual content. If a shorter abstract is sweet sufficient desire the shorter abstract. Be sure that the subject is normal sufficient that there are fewer than 20 high-level matters for all the info. Favor fewer matters. Output JSON columns: thread_id, thread_summary, user_thread_summary, thread_topic.

    Immediate#2: Get cluster stats:

    Given this CSV file of messages, use column=’user_thread_summary’ to carry out semantic clustering of all of the rows. Use method = Silhouette, with linkage technique = ward, and distance_metric = Cosine Similarity. Simply give me the stats for the strategy Silhouette evaluation for now.

    Immediate#3: Carry out preliminary clustering:

    Given this CSV file of messages, use column=’user_thread_summary’ to carry out semantic clustering of all of the rows into N=6 clusters utilizing the Silhouette technique. Use column=”thread_topic” to summarize every cluster matter in 1–3 phrases. Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic.

    Silhouette Rating measures how comparable an object is to its personal cluster (cohesion) versus different clusters (separation). Scores vary from -1 to 1. The next common silhouette rating usually signifies better-defined clusters with good separation. For extra particulars, take a look at the scikit-learn silhouette score documentation.

    Making use of it to Chroma Knowledge. Beneath, I present outcomes from Immediate#2, as a plot of silhouette scores. I selected N=6 clusters as a compromise between excessive rating and fewer clusters. Most LLMs today for information evaluation take enter as CSV and output JSON.

    Picture by writer of aggregated, anonymized information. Left: I selected N=6 clusters as compromise between larger rating and fewer clusters. Proper: The precise clusters utilizing N=6. Highest sentiment (highest scores) are for matters about Question. Lowest sentiment (lowest scores) are for matters about “Shopper Issues”.

    From the plot above, you possibly can see we’re lastly moving into the meat of what customers are saying!

    Immediate#4: Get hierarchical cluster stats:

    Given this CSV file of messages, use the column=’thread_summary_user’ to carry out semantic clustering of all of the rows into Hierarchical Clustering (Agglomerative) with 2 ranges. Use Silhouette rating. What’s the optimum variety of subsequent Level0 and Level1 clusters? What number of threads per Level1 cluster? Simply give me the stats for now, we’ll do the precise clustering later.

    Immediate#5: Carry out hierarchical clustering:

    Settle for this clustering with 2-levels. Add cluster matters that summarize textual content column “thread_topic”. Cluster matters needs to be as brief as doable with out shedding an excessive amount of element within the cluster which means.
    – Level0 cluster matters ~1–3 phrases.
    – Level1 cluster matters ~2–5 phrases.
    Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic, level1_cluster_id, level1_cluster_topic.

    I additionally prompted to generate Streamlit code to visualise the clusters (since I’m not a JS knowledgeable 😄). Outcomes for a similar Chroma information are proven under.

    Picture by writer of aggregated, anonymized information. Left picture: Every scatterplot dot is a thread with hover-info. Proper picture: Hierarchical clustering with uncooked information drill-down capabilities. Api and Package deal Errors appears like Chroma’s most pressing matter to repair, as a result of sentiment is low and quantity of messages is excessive.

    I discovered this very insightful. For Chroma, clustering revealed that whereas customers had been proud of matters like Question, Distance, and Efficiency, they had been sad about areas similar to Knowledge, Shopper, and Deployment.

    Experimenting with Customized Embeddings

    I repeated the above clustering prompts, utilizing simply the numerical embedding (“user_embedding”) within the CSV as a substitute of the uncooked textual content summaries (“user_text”).I’ve defined embeddings intimately in earlier blogs earlier than, and the dangers of overfit fashions on leaderboards. OpenAI has dependable embeddings that are extraordinarily inexpensive by API name. Beneath is an instance code snippet learn how to create embeddings.

    from openai import OpenAI
    
    
    EMBEDDING_MODEL = "text-embedding-3-small"
    EMBEDDING_DIM = 512 # 512 or 1536 doable
    
    
    # Initialize consumer with API key
    openai_client = OpenAI(
       api_key=os.environ.get("OPENAI_API_KEY"),
    )
    
    
    # Operate to create embeddings
    def get_embedding(textual content, embedding_model=EMBEDDING_MODEL,
                     embedding_dim=EMBEDDING_DIM):
       response = openai_client.embeddings.create(
           enter=textual content,
           mannequin=embedding_model,
           dimensions=embedding_dim
       )
       return response.information[0].embedding
    
    
    # Operate to name per pandas df row in .apply()
    def generate_row_embeddings(row):
       return {
           'user_embedding': get_embedding(row['user_thread_summary']),
       }
    
    
    # Generate embeddings utilizing pandas apply
    embeddings_data = df.apply(generate_row_embeddings, axis=1)
    # Add embeddings again into df as separate columns
    df['user_embedding'] = embeddings_data.apply(lambda x: x['user_embedding'])
    show(df.head())
    
    
    # Save as CSV ...
    
    Instance information for prompting. Column “user_embedding” is an array size=512 of floating level numbers.

    Curiously, each Perplexity Professional and Gemini 2.0 Professional typically hallucinated cluster matters (e.g., misclassifying a query about sluggish queries as “Private Matter”).

    Conclusion: When performing NLP with prompts, let the LLM generate its personal embeddings — externally generated embeddings appear to confuse the mannequin.

    Picture by writer of aggregated, anonymized information. Each Perplexity Professional and Google’s Gemini 1.5 Professional hallucinated Cluster Subjects when given an externally-generated embedding column. Conclusion — when performing NLP with prompts, simply maintain the uncooked textual content and let the LLM create its personal embeddings behind the scenes. Feeding in externally-generated embeddings appears to confuse the LLM!

    Clustering Throughout A number of Discord Servers

    Lastly, I broadened the evaluation to incorporate Discord messages from three totally different VectorDB distributors. The ensuing visualization highlighted frequent points — like each Milvus and Chroma going through authentication issues.

    Picture by writer of aggregated, anonymized information: A multi-vendor VectorDB dashboard shows prime points throughout many firms. One factor that stands out is each Milvus and Chroma are having hassle with Authentication.

    Abstract

    Right here’s a abstract of the steps I adopted to carry out semantic clustering utilizing LLM prompts:

    1. Extract Discord threads.
    2. Format information into dialog turns with roles (“person”, “assistant”).
    3. Rating sentiment and save as CSV.
    4. Immediate Google Gemini 2.0 flash for thread summaries.
    5. Immediate Perplexity Professional or Gemini 2.0 Professional for clustering based mostly on thread summaries utilizing the identical CSV.
    6. Immediate Perplexity Professional or Gemini 2.0 Professional to write down Streamlit code to visualise clusters (as a result of I’m not a JS knowledgeable 😆).

    By following these steps, you possibly can shortly remodel uncooked discussion board information into actionable insights — what used to take days of coding can now be achieved in only one afternoon!

    References

    1. Clio: Privateness-Preserving Insights into Actual-World AI Use, https://arxiv.org/abs/2412.13678
    2. Anthropic weblog about Clio, https://www.anthropic.com/research/clio
    3. Milvus Discord Server, final accessed Feb 7, 2025
      Chroma Discord Server, final accessed Feb 7, 2025
      Qdrant Discord Server, final accessed Feb 7, 2025
    4. Gemini fashions, https://ai.google.dev/gemini-api/docs/models/gemini
    5. Weblog about Gemini 2.0 fashions, https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/
    6. Scikit-learn Silhouette Score
    7. OpenAI Matryoshka embeddings
    8. Streamlit


    Source link
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleExploring IMDB Movies Dataset: Key Insights and Marketing Research Implications | by Pourushporwal | Feb, 2025
    Next Article Your Growth Strategy Won’t Matter if Your Team Drowns — 5 Truths About Crisis Leadership
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    How AI Agents “Talk” to Each Other

    June 14, 2025
    Artificial Intelligence

    Stop Building AI Platforms | Towards Data Science

    June 14, 2025
    Artificial Intelligence

    What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

    June 14, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    How Deep Learning Enhances Machine Vision

    February 6, 2025

    Successful Entrepreneurs Are Using This New Platform to Improve International Connections

    May 3, 2025

    How to use SageMaker Pipelines and AWS Batch for Asynchronous Distributed Data Processing | by Urmi Ghosh | Thomson Reuters Labs | Feb, 2025

    February 28, 2025

    DeepSeek-R1 İnceleme. Geçtiğimiz haftalarda OpenAI’nın o1–127… | by Ümit | Feb, 2025

    February 2, 2025

    How Outdated Systems Are Putting Your Business at Risk

    March 16, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    AI Can Turn Your Raw Data into Actionable Insights and Visual Stories

    February 5, 2025

    How to Transition From Data Analyst to Data Scientist

    June 9, 2025

    A Home Within Walking Distance of Everything Might Not Be Ideal

    February 17, 2025
    Our Picks

    News Bytes Podcast 20250217: Arm Selling Its Own Chips to Meta?, Big xAI, Big Power, Big… Pollution?, TSMC in Intel Fab Takeover?, Europe’s Big AI Investment

    February 17, 2025

    Before ChatGPT: The Core Ideas That Made Modern AI Possible | by Michal Mikulasi | May, 2025

    May 10, 2025

    Cerebras Reports Fastest DeepSeek R1 Distill Llama 70B Inference

    February 4, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.