Gaussian-Weighted Word Embeddings for Sentiment Analysis | by Sgsahoo

Within the realm of pure language processing (NLP), sentence embeddings play a vital position in figuring out the general semantic which means of a given textual content. Historically, averaging pre-trained phrase embeddings like Word2Vec or GloVe has served as an easy but efficient approach. However what occurs when your knowledge isn’t easy? What should you’re analyzing nuanced, prolonged film evaluations that include each reward and criticism? That’s when easy averaging falls quick.

On this weblog, we’ll discover a novel approach: Gaussian-weighted phrase embeddings. This technique weights every phrase vector based mostly on its proximity to a centroid, decreasing the affect of outliers and preserving semantic richness. We’ll stroll by means of the idea, implementation, and the way it performs in a full machine studying pipeline.

In n-dimensional hyperspace, we regularly assume that phrases with “constructive” sentiment occupy areas far faraway from these with “adverse” sentiment. Below this assumption, easy averaging might yield vectors that fall near a category centroid, making classification straightforward.

Nonetheless, film evaluations are sometimes prolonged and multi-faceted. A single assessment may spotlight each the sensible performing and a weak storyline. In such circumstances, averaging the phrase vectors might end in semantic dilution, the place opposing sentiments cancel one another out, complicated the classifier.

To mitigate this, a Gaussian-weighted method is proposed. The thought is to:

Compute the centroid G of the phrase vectors in a sentence.
Calculate the space D from G to the farthest phrase vector.
Assign weights to every phrase vector utilizing a Gaussian distribution centred at G with a variance of D/2.
Combination the sentence vector utilizing these weights.

Why This Works?

The principle thought behind such a sampling is to cut back the load of “people who lie farther from the interpretation.” Since we’re already exploring the instance of film assessment classification, I could assume that such evaluations each laud and criticise the film. Even for a human being, to categorise such a assessment could be a considerate process. It’s possible you’ll need to concentrate on the tone of the assessment and what it majorly talks about — the constructive sentiment or the adverse sentiment. Thus, in such a case the vectors lie scattered throughout the area, and it turns into harder to cluster them collectively to generate a single enter vector. Thus, in assigning Gaussian-sampled weights to the phrase vectors, we scale back the scattering by attempting to centralise the notion of the assessment and selectively valuing these sentiments which are nearer to the central interpretation.

Gaussian-weighted resultant vector

Step 1: Preprocessing the Information

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.obtain('punkt')
nltk.obtain('stopwords')
nltk.obtain('wordnet')stop_words = set(stopwords.phrases('english'))
lemmatizer = WordNetLemmatizer()
def clean_text(textual content):
textual content = re.sub(r'[^a-zA-Z0-9s]', '', textual content.decrease())
tokens = word_tokenize(textual content)
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
return tokens
df['clean_text'] = df['text'].apply(clean_text)

Step 2: Practice a Word2Vec Mannequin

from gensim.fashions import Word2Vecmannequin = Word2Vec(sentences=df['clean_text'], vector_size=100, window=5, min_count=2, employees=4)
word_vectors = mannequin.wv

Step 3: Gaussian-Weighted Sentence Vector

import numpy as npdef get_weighted_sentence_vector(tokens, word_vectors):
vectors = [word_vectors[word] for phrase in tokens if phrase in word_vectors]
if not vectors:
return np.zeros(word_vectors.vector_size)
vectors = np.array(vectors)
centroid = np.imply(vectors, axis=0)
distances = np.linalg.norm(vectors - centroid, axis=1)
max_dist = np.max(distances) or 1e-6
weights = np.exp(-((distances / (max_dist / 2)) ** 2))
weighted_sum = np.sum(weights[:, np.newaxis] * vectors, axis=0)
return weighted_sum / len(vectors)
X = np.array([get_weighted_sentence_vector(tokens, word_vectors) for tokens in df['clean_text']])
y = df['label'].values

Step 4: Coaching Classifiers

Now you can use any scikit-learn classifier:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_reportX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier()
clf.match(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

By making use of Gaussian-weighted sentence vectors, we improve the robustness of textual content representations for complicated evaluations. This technique not solely reduces noise from contradictory sentiments but in addition permits fashions to higher distinguish sentiment in nuanced texts.

This method is especially helpful in real-world sentiment evaluation functions like film evaluations, the place the textual content typically incorporates blended sentiments.

Source link

Hopfield Neural Network. The main takeaway of this paper is a… | by bhagya | Jun, 2025

The Next Frontier of Human Performance | by Lyrah | Jun, 2025

9 AI Skills You MUST Learn Before Everyone Else Does (or Get Left Behind) | by S3CloudHub | Jun, 2025

GitHub Models 免費試玩 GPT、Llama、DeepSeek | by Tsai Yi Lin | Feb, 2025

5555555555555555555Supervised vs Unsupervised Learning | The First Big Choice in ML | M003 | by Mehul Ligade | May, 2025

Understanding Skewness in Machine Learning: A Beginner’s Guide with Python Example | by Codes With Pankaj | Mar, 2025

Airbnb Now Offers Bookings for Massages, Chefs, Fitness

Make Money on Autopilot With These Passive Income Ideas

Most Popular

China’s electric vehicle giants are betting big on humanoid robots

3 Lessons Entrepreneurs Can Learn from Frederick Douglass About Leading in Challenging Times

Affirm CEO: Leaders Should Help Laid-Off Workers Pack Boxes

Our Picks

6 Steps You Can Take to Set Your Best Budget in 2025

Improve Productivity With Better Sleep Thanks to These Noise-Blocking Earbuds

SEC Offering $50K Buyout Incentive; Education Dept $25K

Gaussian-Weighted Word Embeddings for Sentiment Analysis | by Sgsahoo | Jun, 2025

Why This Works?

Step 1: Preprocessing the Information

Step 2: Practice a Word2Vec Mannequin

Step 3: Gaussian-Weighted Sentence Vector

Step 4: Coaching Classifiers

Related Posts