When a knowledge analyst sorts one thing like, “Which cohorts ought to we retarget?”, they’re not issuing SQL however are expressing intent. That straightforward sentence might indicate completely different operations: section rating, lookalike matching, cohort comparability. Translating these ambiguous prompts into structured, policy-compliant analytics requires greater than language modeling. It requires semantic understanding grounded in enterprise context and technical constraints.
That is the place intent classification turns into highly effective. It maps imprecise, messy human language into structured, predictable classes that downstream techniques can act on. Whether or not the consumer is attempting to check segments, discover comparable audiences, or establish untapped alternatives, the system’s first process is to know what the consumer means.
On this put up, we introduce a hybrid pure language interface that transforms consumer intent into safe SQL queries . We take these studying and apply it to actual world use instances akin to Data Clean Rooms (DCRs) where we have analyst and marketers collaborating between brands. These customers (analyst and/or marketeer) have completely different talent units however SQL just isn’t one in every of them, they know their branding enterprise greatest and in a worldwide group might be collaborating throughout area and utilizing their native language to precise their branded intent.
With this weblog we count on our audiences entrepreneurs/analyst/knowledge scientists working say inside knowledge clear rooms, see a system addressing:
- prompts that might be multimodal (completely different languages)
- Queries which might be explainable, traceable, and compliant
- Tooling supporting domain-specific language (e.g., “cohort overlap”, “lookalike segments”)
- Guarantee language-to-SQL system stays tightly grounded to accredited schemas.
Our hybrid system will mix machine studying, search, and immediate engineering to create a pure language interface aligned with the compliance, safety, and analytical wants. We are going to then consider the identical for all interactions between manufacturers in a Knowledge Clear Room.
On this world state of affairs (completely different manufacturers collaborating in DCR) the customers count on our system to serve/translate their intent into motion as they generate related SQL and executing the identical. On high of that even be the privacy-first environments the place solely accredited fashions with customized estimate features may be queried, as no buyer identifiable knowledge is accessible/accessible. Right here generic text-to-SQL fashions would fall quick.
Fashionable search techniques help semantic search i.e. understanding the consumer’s intent primarily based not on precise key phrase matches however on the that means of the question. Embeddings are generated for consumer queries and listed paperwork. Then the search interface retrieves essentially the most related outcomes utilizing quick approximate nearest neighbor search throughout these embeddings.
In our structure, we utilized the same precept. As an alternative of paperwork, we now have SQL immediate templates, and as an alternative of search queries, we now have consumer intents (e.g., “discover high cohorts by common order worth”). By embedding each the intents and the SQL immediate templates in the identical vector house, we are able to use methods akin to FAISS (Facebook AI Similarity Search) to match a consumer question to essentially the most applicable SQL template effectively.
Our system addresses this by mixing:
- Embedding-based intent classification
- FAISS-based semantic retrieval
- Template-driven SQL era
- Strict schema-guided LLM completion as fallback
The result’s a pure language system that’s correct, quick, critically compliant and in addition addresses the privacy-preserving knowledge collaboration throughout enterprise boundaries in Knowledge Clear Rooms (DCRs) . The queries generated auto addresses a few of the strict guidelines on what analysts can see and do akin to:
- No row-level entry (e.g., no uncooked buyer IDs)
- Solely pre-approved transformations for estimation (e.g., HyperLogLog sketches, standardized aggregates ) utilized as wanted
- Learn-only SQL (SELECT-only queries)
- Schema and performance constraints enforced by coverage
The implementation of this method will comprise of:
- Embedding Engine for Intent Classification: ( akin to OpenAI
text-embedding-3-small
) for changing textual content into high-dimensional vectors - Similarity Search : quick nearest-neighbor intent lookup (akin to FAISS Index)
- Template-driven SQL Generator, conscious of retailer/manufacturers and cohort metadata
- Giant Language Mannequin (LLM) fallback when templates can not resolve the question safely. These are extendable to any schema-aware LLM akin to OpenAI’s GPT-x
Now the intent-to-SQL era pipeline mirrors the semantic retrieval stage of many real-world search engines like google — simply tailored to the world of SQL. This hybrid structure ensures low-latency efficiency, whereas preserving excessive expressivity and strict schema alignment.
Intent classification is the inspiration of this method. It maps unstructured pure language into structured classes that drive downstream logic akin to SQL template choice.
Let’s begin with the concept of intent. A single immediate — say, “Which cohorts overlap between ComfyWearCo and SportStyleShop?” — might result in a big selection of logic. However in our system, we distill it right down to the core intent: cohort_overlap
between two manufacturers ComfyWearCo and SportStyleShop.
- Disambiguation: Helps differentiate between comparability, lookup, rating, and exploratory duties.
- Safety: Enforces template-based SQL paths over arbitrary text-to-SQL era.
- Pace: Allows quick routing of prompts with out invoking giant fashions and keep away from hallucinated outputs.
These intents have been skilled utilizing 5–10+ immediate examples per class, embedded and listed.
For our implementation we are going to use OpenAI’s text-embedding-3-small
to embed a curated set of labeled immediate examples for every intent class. These are listed utilizing FAISS.
Instance Immediate Set
Beneath is a subset of examples used to coach the index:
cohort_recommendation
:
- Counsel untested high-value cohorts we are able to attempt in ComfyWearCo.
- What are some promising segments in SportStyleShop that haven’t been used but?
- Checklist top-value cohorts in ComfyWearCo that haven’t been focused.
lookalike_request
:
- Which SportStyleShop cohorts are much like cohort 4 in ComfyWearCo?
- Discover lookalike segments in SportStyleShop for ComfyWearCo’s cohort 2.
- Present high matches for cohort 4 in ComfyWearCo from different shops.
cohort_overlap
:
- Which cohorts throughout manufacturers have overlapping customers?
- Do cohort 3 in ComfyWearCo and cohort 4 in SportStyleShop overlap?
cohort_comparison
:
- Examine metrics for cohort 1 in ComfyWearCo vs cohort 2 in SportStyleShop.
- Present cohort 1 from every model with KPIs.
nl_to_sql_query
:
- Checklist cohorts in ComfyWearCo with buyer lifetime above 2.0.
- Present high cohort by complete orders in ComfyWearCo.
intent_examples = [
{"text": "Suggest untested high-value cohorts...", "intent": "cohort_recommendation", "language": "en"},
{"text": "Show me similar cohorts", "expected_intent": "lookalike_request", "language": "en"},
{"text": "Compare cohorts from different stores", "expected_intent": "cohort_comparison", "language": "en"},
{"text": "Montre-moi des cohortes similaires", "expected_intent": "lookalike_request", "language": "fr"},
{"text": "Welche Kohorten überschneiden sich?", "expected_intent": "cohort_overlap", "language": "de"},
# ... (other examples as above)
]
response = openai.embeddings.create(enter=[e["text"] for e in intent_examples], mannequin="text-embedding-3-small")
embeddings = np.array([np.array(r.embedding, dtype='float32') for r in response.data])
index = faiss.IndexFlatL2(embeddings.form[1])
index.add(embeddings)
faiss.write_index(index, "intent_index.faiss")
Metadata (intent labels and unique prompts) is saved in a separate .pkl
file for runtime prediction.
Why FAISS over SentenceTransformer Classifier?
An alternate method is to coach a supervised classifier on high of SentenceTransformer embeddings. Nonetheless, we opted in our instance with FAISS-based nearest-neighbor classification for:
- Zero-shot generalization: Simply extensible by including extra labeled examples with out retraining.
- Explainability: Intent is backed by its nearest neighbors, which may be inspected.
- No mannequin coaching or internet hosting: Quicker to deploy and simpler to take care of.
- Flexibility: Allows open-world classification with help for fallback behaviors.
Whereas SentenceTransformers + classifier pipelines work nicely when class definitions are steady and exhaustive, the FAISS method affords higher flexibility in evolving Knowledge Clear Room eventualities.
Let’s assume we’re working in Knowledge Clear Room with say 2 manufacturers particularly ComfyWearCo and SportStyleShop. The ER diagram beneath illustrates the construction of our cohort-based knowledge mannequin for the 2 or extra manufacturers:
- hll_cohort_sketches_store_x1/ hll_cohort_sketches_store_y1: Tables storing cohort-level metrics per retailer. Every is keyed by
(store_id, cohort_id)
and accommodates privacy-preserving HLL sketches and standardized metrics. - Cohort_Similarity: Represents similarity scores between cohorts throughout shops. It references
(store_a, cohort_a)
and(store_b, cohort_b)
from the sketch tables.
All question era, whether or not by way of templates or LLM respects this Entity Relationship mannequin, utilizing solely accessible fields and protected operations for estimates like hll_estimate
or hll_merge_agg
.
As a result of model names in pure language (“ComfyWearCo”) differ from inner desk identify and retailer IDs ( akin to retailer id : “comfy_wear_store_view” and tablename: hll_cohort_sketches_store_x1), we use a brand_registry.json
to seize the mapping akin to:
{
"ComfyWearCo": {"store_id": "comfy_wear_store_view",
"desk": "hll_cohort_sketches_store_x1"},
"SportStyleShop": {"store_id": "sporty_style_store_view",
"desk": "hll_cohort_sketches_store_y1"}
}
Entity extraction makes use of regex and string matching to detect retailer
and cohort
mentions throughout the immediate. This additionally makes it extensible to help N shops with out actually altering the implementation.
Given the anticipated intent and extracted entities, we use intent-specific SQL templates.
Instance for cohort comparability:
SELECT 'ComfyWearCo' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE cohort_id = 2 AND store_id = 'comfy_wear_store_view'
UNION ALL
SELECT 'SportStyleShop' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_y1
WHERE cohort_id = 4 AND store_id = 'sporty_style_store_view';
This method avoids hallucinated SQL by grounding completely in accessible schema.
If no template matches, we fallback to LLM (in our instance we used GPT-4) utilizing this immediate:
You're a knowledge analyst working in a safe knowledge clear room. Solely use SELECT statements.
Use solely the schema and features offered. Don't invent any columns or features.
Use hll_estimate(cohort_hll_sketch) when deciding on a single cohort.
Don't use hll_merge_agg until aggregating throughout rows.SCHEMA:
USER QUESTION:
Which cohorts in ComfyWearCo have highest complete orders?
This schema-constrained prompting dramatically reduces errors in generated SQL.
“Which cohorts in SportStyleShop are most much like cohort 5 in ComfyWearCo?”
SELECT cohort_b, similarity_score_proxy
FROM cohort_similarity
WHERE store_a = 'comfy_wear_store_view'
AND cohort_a = 5
AND store_b = 'sporty_style_store_view'
ORDER BY similarity_score_proxy DESC
LIMIT 5;
“Counsel untapped high-value cohorts in ComfyWearCo.”
SELECT cohort_id, avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE store_id = 'comfy_wear_store_view' AND cohort_hll_sketch IS NULL
ORDER BY avg_standardized_avg_order_value DESC
LIMIT 5;
One key benefit of our method is that it helps superior privacy-preserving features akin to HyperLogLog (HLL) sketches, that are generally utilized in Knowledge Clear Rooms to approximate consumer counts with out exposing uncooked identifiers.
Why HLL is important in a Knowledge Clear Room
- Row-level knowledge is prohibited, so cardinality estimation (e.g., consumer counts) have to be computed by way of sketches.
- Capabilities like
hll_estimate
andhll_merge_agg
are allowed, whereas directCOUNT(DISTINCT user_id)
just isn’t. - Safe overlaps and lookalikes between cohorts throughout manufacturers depend on similarity of HLL sketches.
How Our System Helps It
- Templates are coded to make use of
hll_estimate
when deciding on particular person cohorts. - The system guards in opposition to misuse: e.g., it avoids
hll_merge_agg
in non-aggregation queries. - LLM fallback is schema-primed with legitimate HLL perform utilization and constraints. Now we have skilled the system to make use of the right HLL perform extension relying on the context:
hll_estimate(...)
is used when deciding on or displaying a single cohorthll_merge_agg(...)
is used solely when aggregating throughout rows (e.g., GROUP BY or multi-cohort aggregation)
Instance: Estimating Customers in Every Cohort
SELECT cohort_id, hll_estimate(cohort_hll_sketch) AS estimated_users
FROM hll_cohort_sketches_store_x1
WHERE store_id = 'comfy_wear_store_view';
Instance: Facet-by-Facet Union Throughout Manufacturers
SELECT 'ComfyWearCo' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE store_id = 'comfy_wear_store_view' AND cohort_id = 2
UNION ALL
SELECT 'SportStyleShop' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_y1
WHERE store_id = 'sporty_style_store_view' AND cohort_id = 4;
Instance: Aggregating HLL Sketches with hll_merge_agg
Our system acknowledges this intent and rewrites it safely to use hll_merge_agg(...)
:
SELECT
store_id,
hll_estimate(hll_merge_agg(cohort_hll_sketch)) AS estimated_users
FROM hll_cohort_sketches_store_x1
GROUP BY store_id;
This transformation is dealt with robotically by way of SQL templates or LLM immediate logic. It ensures the right use of sketch merging when aggregating throughout cohorts — preserving each compliance and accuracy.
Instance: Cross-brand Cohort Similarity
SELECT store_a, cohort_a, store_b, cohort_b, similarity_score_proxy
FROM cohort_similarity
WHERE store_a != store_b
ORDER BY similarity_score_proxy DESC;
By means of immediate engineering and template constraints, the interface preserves sketch semantics, guaranteeing HLL features are used accurately and safely in every context.
Regardless of the natural-language interface, this method is ready to protect mathematical constancy and compliance by mapping consumer intent into protected, sketch-aware SQL.
To validate our method, we ran a immediate check suite in opposition to 30+ pure language inputs, every mapped to one in every of our 5 supported intents. The outcomes present excessive accuracy and semantic protection throughout intent sorts. Beneath are a couple of consultant examples:
✅ Immediate: Which cohorts from SportStyleShop are most much like cohort 3 in ComfyWearCo
Predicted: lookalike_request
| Anticipated: lookalike_request
SELECT cohort_b, similarity_score_proxy
FROM cohort_similarity
WHERE store_a = 'comfy_wear_store_view'
AND cohort_a = 3
AND store_b = 'sporty_style_store_view'
ORDER BY similarity_score_proxy DESC
LIMIT 5;
✅ Immediate: Examine cohort 4 in ComfyWearCo with cohort 1 in SportStyleShop
Predicted: cohort_comparison
| Anticipated: cohort_comparison
SELECT 'ComfyWearCo' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE cohort_id = 4 AND store_id = 'comfy_wear_store_view'
UNION ALL
SELECT 'SportStyleShop' AS store_name, cohort_id,
hll_estimate(cohort_hll_sketch) AS estimated_users,
avg_standardized_avg_order_value,
avg_standardized_total_orders,
avg_standardized_customer_lifetime,
avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_y1
WHERE cohort_id = 1 AND store_id = 'sporty_style_store_view';
✅ Immediate: Counsel new high-value cohorts in ComfyWearCo we haven’t used but
Predicted: cohort_recommendation
| Anticipated: cohort_recommendation
SELECT cohort_id, avg_standardized_avg_order_value, avg_standardized_total_orders,
avg_standardized_customer_lifetime, avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE store_id = 'comfy_wear_store_view' AND cohort_hll_sketch IS NULL
ORDER BY avg_standardized_avg_order_value DESC
LIMIT 5;
✅ Immediate: Which cohorts overlap?
Predicted: cohort_overlap
| Anticipated: cohort_overlap
SELECT store_a, cohort_a, store_b, cohort_b, similarity_score_proxy
FROM cohort_similarity
WHERE store_a != store_b
ORDER BY similarity_score_proxy DESC;
✅ Immediate: Rank ComfyWearCo cohorts by recency of final buy
Predicted: nl_to_sql_query
| Anticipated: nl_to_sql_query
SELECT cohort_id, avg_standardized_days_since_last_purchase
FROM hll_cohort_sketches_store_x1
WHERE store_id = 'comfy_wear_store_view'
ORDER BY avg_standardized_days_since_last_purchase DESC;
Even edge instances the place retailer or cohort references are incomplete degrade gracefully — for instance, lacking goal model results in fallback habits with an applicable message.
This analysis confirms that our FAISS-based classifier mixed with intent-specific templates offered correct, explainable, and protected SQL for a broad vary of exploratory and diagnostic queries inside Knowledge Clear Rooms.
Though this implementation used OpenAI’s GPT-4 for fallback SQL era, the structure is designed to be LLM-agnostic. Swapping in a distinct mannequin akin to Anthropic’s Claude, Google’s Gemini, or an area LLM akin to LLaMA 3 is definitely carried out.
Easy methods to Swap LLMs
Importantly, the FAISS-based intent classifier stays unchanged no matter which LLM you employ. Nonetheless, in case your chosen LLM (like Claude or open-source fashions) doesn’t supply embedding APIs, you’ll must generate embeddings domestically.
Changing OpenAI Embeddings with SentenceTransformers
You possibly can swap to an area embedding mannequin akin to all-MiniLM-L6-v2
by way of the sentence-transformers
library:
from sentence_transformers import SentenceTransformer
import numpy as npmannequin = SentenceTransformer("all-MiniLM-L6-v2")
def get_embedding(textual content):
return mannequin.encode([text])[0].astype("float32").reshape(1, -1)
This retains your FAISS-based classification logic intact whereas decoupling from OpenAI’s embedding API. It additionally allows native inference in air-gapped or regulated environments.
Claude Integration Instance
If utilizing Claude for SQL era fallback:
import anthropic
shopper = anthropic.Anthropic()
response = shopper.messages.create(
mannequin="claude-3-opus-20240229",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.2
)
sql = response.content material[0].textual content
With a constant schema and SQL-grounded prompting, the fallback habits is transportable throughout distributors — enabling groups to stay versatile of their compute and privateness technique.
Importantly, the FAISS-based intent classifier stays unchanged no matter which LLM you employ. This implies:
- Intent classification stays quick, native, and deterministic
- You possibly can swap LLMs with out re-training or modifying the intent logic
- Embeddings out of your chosen supplier (OpenAI, Hugging Face, and many others.) can nonetheless be listed by FAISS
This separation of tasks lets you scale and experiment with completely different LLMs whereas preserving environment friendly classification and lowering mannequin invocation prices.
We demonstrated a hybrid system that mixes semantic search (by way of embeddings + FAISS) with safe SQL era tailor-made for the info clear room context. By grounding SQL era in schema and intent templates, and falling again to LLM solely when needed, this structure strikes a sensible stability between expressiveness and security. We might additionally lengthen our work by rating FAISS outcomes with confidence scores.
It is a blueprint for constructing clever interfaces in privacy-constrained analytics environments.
Cross branding Collaboration:
Causal Studying:
Studying-to-Rank (LTR)
Federated studying:
Ensemble Studying:
Machine Studying: