Comprehensive Guide to Feature creation in Data Science: From Basics to Super Advanced Techniques 3 | by Adnan Mazraeh

Positive! Let’s dive deeper into sensible implementation and a few ideas and greatest practices for characteristic creation, in addition to discussing superior characteristic engineering in follow with extra advanced information eventualities.

a. Time Sequence Function Engineering

Time collection information typically presents distinctive challenges and alternatives for characteristic creation. It’s essential to know the temporal facet of the information. Let’s stroll by a sensible instance of making options for a gross sales prediction downside with every day gross sales information.

Lag Options: Lag options are essential for time collection modeling. If we now have every day gross sales information, we would create lag options to foretell future gross sales based mostly on earlier days’ gross sales.

import pandas as pd sales_data = pd.read_csv('sales_data.csv', parse_dates=['date'], index_col='date') sales_data['sales_lag_1'] = sales_data['sales'].shift(1) sales_data['sales_lag_2'] = sales_data['sales'].shift(2)

On this instance, we create lag options sales_lag_1 and sales_lag_2 for the previous 1 day and a pair of days.
Rolling Statistics: Rolling options may also help seize tendencies over time.

sales_data['rolling_mean_7'] = sales_data['sales'].rolling(window=7).imply() sales_data['rolling_std_7'] = sales_data['sales'].rolling(window=7).std()

Right here, we’re including a 7-day rolling imply and 7-day rolling normal deviation to seize short-term tendencies in gross sales.
Date Options: We will extract parts of the date characteristic corresponding to day of the week, month, or vacation indicator.

sales_data['day_of_week'] = sales_data.index.dayofweek sales_data['is_weekend'] = (sales_data['day_of_week'] >= 5).astype(int) sales_data['month'] = sales_data.index.month

Right here, we create options like day of the week and weekend indicator. These could be helpful for capturing patterns associated to weekday vs. weekend gross sales.
Seasonality Options: We will embrace options that seize the seasonality or periodicity in gross sales, like holidays or seasonal tendencies.

sales_data['is_holiday'] = sales_data.index.isin(pd.to_datetime(['2024-12-25', '2024-01-01']))

This creates a binary characteristic is_holiday, which signifies whether or not the date is a vacation.

b. Textual content Information Function Engineering

When coping with textual information, you typically want to remodel it into numerical options that seize the underlying construction of the textual content. Right here’s an instance of characteristic creation for a sentiment evaluation job utilizing a set of buyer opinions.

Textual content Preprocessing: First, you’ll want to wash and preprocess the textual content information (e.g., lowercasing, eradicating stopwords, tokenization).

import nltk from nltk.corpus import stopwords from sklearn.feature_extraction.textual content import CountVectorizer nltk.obtain('stopwords') stop_words = set(stopwords.phrases('english')) def preprocess(textual content): textual content = textual content.decrease() phrases = textual content.break up() return " ".be a part of([word for word in words if word not in stop_words]) df['cleaned_reviews'] = df['reviews'].apply(preprocess)

Bag-of-Phrases (BoW): As soon as the textual content is preprocessed, you may convert it into numerical options utilizing Bag-of-Phrases or TF-IDF.

vectorizer = CountVectorizer(max_features=1000) # Restrict to prime 1000 phrases X_bow = vectorizer.fit_transform(df['cleaned_reviews'])

The result’s a document-term matrix, the place every row represents a evaluate, and every column corresponds to a phrase. That is essentially the most fundamental characteristic creation approach for textual content.
TF-IDF: For a extra informative illustration, use TF-IDF as a substitute of uncooked phrase counts to spotlight essential phrases.

from sklearn.feature_extraction.textual content import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(max_features=1000) X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_reviews'])

Phrase Embeddings: You may as well convert the textual content into dense vectors utilizing pre-trained phrase embeddings (e.g., Word2Vec or GloVe) to seize semantic which means.

from gensim.fashions import Word2Vec # Prepare Word2Vec mannequin on cleaned opinions mannequin = Word2Vec(df['cleaned_reviews'].apply(lambda x: x.break up()), vector_size=100, window=5, min_count=1) df['word2vec'] = df['cleaned_reviews'].apply(lambda x: mannequin.wv[x.split()]) # Create vector for every evaluate

c. Categorical Information Function Engineering

For categorical information, we regularly have to encode it in a manner that machine studying algorithms can perceive. Let’s say we now have a dataset with a “Metropolis” characteristic.

One-Sizzling Encoding: A easy method to deal with categorical variables is One-Sizzling Encoding, which creates a brand new binary column for every distinctive class.

df_one_hot = pd.get_dummies(df['City'], prefix='Metropolis') df = pd.concat([df, df_one_hot], axis=1)

Goal Encoding: Goal encoding replaces every class with the imply of the goal variable for that class. This will work nicely for high-cardinality options.

df['City_encoded'] = df.groupby('Metropolis')['Target'].rework('imply')

Frequency Encoding: Frequency encoding replaces every class with its frequency within the dataset, which could be helpful when the distribution of classes varies tremendously.

freq_encoding = df['City'].value_counts() df['City_encoded'] = df['City'].map(freq_encoding)

Depend Encoding: Just like frequency encoding, rely encoding replaces classes with the variety of occasions they seem.

count_encoding = df['City'].value_counts() df['City_count'] = df['City'].map(count_encoding)

a. Hold Monitor of Function Significance

When you find yourself experimenting with completely different characteristic creation strategies, it’s important to trace which options are literally helpful on your mannequin. Some strategies for this:

Correlation: For numerical options, verify correlations to establish redundancy or multicollinearity.

Function Significance with Tree Fashions: Fashions like Random Forest, XGBoost, and LightGBM can be utilized to rank options by significance, serving to to establish which options contribute most to mannequin efficiency.

import xgboost as xgb mannequin = xgb.XGBClassifier() mannequin.match(X_train, y_train) feature_importances = mannequin.feature_importances_

b. Keep away from Overfitting

Be cautious of making too many options, particularly if they’re derived from a small variety of rows. This will result in overfitting. follow is to:

Use cross-validation to check characteristic mixtures.
Regularize fashions which are susceptible to overfitting (e.g., Lasso or Ridge regression).
Think about dimensionality discount methods (e.g., PCA, t-SNE, or UMAP) to cut back the characteristic house.

c. Automate Function Creation with Featuretools

You probably have relational information (e.g., buyer transactions, product particulars, and so on.), you should utilize Featuretools, which is a Python library designed to automate characteristic engineering.

import featuretools as ft
es = ft.EntitySet(id="sales_data")
es = es.entity_from_dataframe(entity_id="gross sales", dataframe=df, index="transaction_id")
# Routinely generate options based mostly on the relationships between tables
feature_matrix, feature_names = ft.dfs(entityset=es, target_entity="gross sales")

d. Function Creation for Imbalanced Datasets

For those who’re working with an imbalanced dataset, contemplate the next:

Artificial Function Creation: Use methods like SMOTE (Artificial Minority Over-sampling Approach) to generate artificial information factors.
Resampling: Stability the dataset utilizing over-sampling or under-sampling methods.

Source link

Unveiling the Neural Mind: Tracing Step-by-Step Reasoning in Large Language Models | by Vilohit | Apr, 2025

How to break into data science \ machine learning | by Data_Guy | Apr, 2025

Mastering Natural Language Processing — Part 13 Running and Evaluating Classification Experiments in NLP | by Connie Zhou | Apr, 2025

Accelerate Your Growth: How Machine Learning Is Revolutionizing Skill Acquisition | by Tyler McGrath | Feb, 2025

Deploying DeepSeek on a VM Using Ollama and Open WebUI | by Nikhil Kumar | techbeatly | Feb, 2025

Text-To-Image using Diffusion model with AWS Sagemaker Distributed Training | by Aniketp | Mar, 2025

Bringing NGINX + TLS to the Vector DB Stack: Secure, Optional, and Open | by Brian Bates | Apr, 2025

Automated playing styles using unsupervised learning: Handball case study | by Data in Motion | Mar, 2025

Most Popular

CPI Report: Inflation Rose in January. Will the Fed Cut Rates?

SambaNova Reports Fastest DeepSeek-R1 671B with High Efficiency

Trying Your Best When There’s Little To No Chance Of Succeeding

Our Picks

30 Most Asked PySpark Questions on Date Functions: Part 5| Solved | by B V Sarath Chandra | Apr, 2025

Deep Dive into WebSockets and Their Role in Client-Server Communication

Living My Dream As A Data Scientist at Microsoft! | by Harsh Mani | Apr, 2025

Comprehensive Guide to Feature creation in Data Science: From Basics to Super Advanced Techniques 3 | by Adnan Mazraeh | Feb, 2025