What’s artificial knowledge?
Knowledge created by a pc supposed to copy or increase current knowledge.
Why is it helpful?
We have now all skilled the success of ChatGPT, Llama, and extra lately, DeepSeek. These language fashions are getting used ubiquitously throughout society and have triggered many claims that we’re quickly approaching Synthetic Basic Intelligence — AI able to replicating any human perform.
Earlier than getting too excited, or scared, relying in your perspective — we’re additionally quickly approaching a hurdle to the development of those language fashions. In response to a paper printed by a bunch from the analysis institute, Epoch [1], we’re operating out of information. They estimate that by 2028 we could have reached the higher restrict of potential knowledge upon which to coach language fashions.
What occurs if we run out of information?
Nicely, if we run out of information then we aren’t going to have something new with which to coach our language fashions. These fashions will then cease enhancing. If we need to pursue Synthetic Basic Intelligence then we’re going to need to provide you with new methods of enhancing AI with out simply rising the quantity of real-world coaching knowledge.
One potential saviour is artificial knowledge which may be generated to imitate current knowledge and has already been used to enhance the efficiency of fashions like Gemini and DBRX.
Artificial knowledge past LLMs
Past overcoming knowledge shortage for big language fashions, artificial knowledge can be utilized within the following conditions:
- Delicate Knowledge — if we don’t need to share or use delicate attributes, artificial knowledge may be generated which mimics the properties of those options whereas sustaining anonymity.
- Costly knowledge — if gathering knowledge is pricey we will generate a big quantity of artificial knowledge from a small quantity of real-world knowledge.
- Lack of information — datasets are biased when there’s a disproportionately low variety of particular person knowledge factors from a selected group. Artificial knowledge can be utilized to steadiness a dataset.
Imbalanced datasets
Imbalanced datasets can (*however not all the time*) be problematic as they might not include sufficient info to successfully prepare a predictive mannequin. For instance, if a dataset comprises many extra males than ladies, our mannequin could also be biased in direction of recognising males and misclassify future feminine samples as males.
On this article we present the imbalance within the common UCI Adult dataset [2], and the way we will use a variational auto-encoder to generate Synthetic Data to enhance classification on this instance.
We first obtain the Grownup dataset. This dataset comprises options comparable to age, schooling and occupation which can be utilized to foretell the goal end result ‘revenue’.
# Obtain dataset right into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/grownup/grownup.knowledge"
columns = [
"age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain",
"capital-loss", "hours-per-week", "native-country", "income"
]
knowledge = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
# Drop rows with lacking values
knowledge = knowledge.dropna()
# Break up into options and goal
X = knowledge.drop(columns=["income"])
y = knowledge['income'].map({'>50K': 1, '
Within the Grownup dataset, revenue is a binary variable, representing people who earn above, and under, $50,000. We plot the distribution of revenue over the whole dataset under. We will see that the dataset is closely imbalanced with a far bigger variety of people who earn lower than $50,000.

Regardless of this imbalance we will nonetheless prepare a machine studying classifier on the Grownup dataset which we will use to find out whether or not unseen, or take a look at, people needs to be labeled as incomes above, or under, 50k.
# Preprocessing: One-hot encode categorical options, scale numerical options
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
"workclass", "education", "marital-status", "occupation", "relationship",
"race", "sex", "native-country"
]
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numerical_features),
("cat", OneHotEncoder(), categorical_features)
]
)
X_processed = preprocessor.fit_transform(X)
# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Break up dataset in prepare and take a look at units
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_model_train, y_model_train)
# Make predictions
y_pred = rf_classifier.predict(X_model_test)
# Show confusion matrix
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()
Printing out the confusion matrix of our classifier reveals that our mannequin performs pretty effectively regardless of the imbalance. Our mannequin has an total error charge of 16% however the error charge for the constructive class (revenue > 50k) is 36% the place the error charge for the unfavourable class (revenue
This discrepancy reveals that the mannequin is certainly biased in direction of the unfavourable class. The mannequin is regularly incorrectly classifying people who earn greater than 50k as incomes lower than 50k.
Beneath we present how we will use a Variational Autoencoder to generate artificial knowledge of the constructive class to steadiness this dataset. We then prepare the identical mannequin utilizing the synthetically balanced dataset and cut back mannequin errors on the take a look at set.

How can we generate artificial knowledge?
There are many completely different strategies for producing artificial knowledge. These can embrace extra conventional strategies comparable to SMOTE and Gaussian Noise which generate new knowledge by modifying current knowledge. Alternatively Generative fashions comparable to Variational Autoencoders or Basic Adversarial networks are predisposed to generate new knowledge as their architectures be taught the distribution of actual knowledge and use these to generate artificial samples.
On this tutorial we use a variational autoencoder to generate artificial knowledge.
Variational Autoencoders
Variational Autoencoders (VAEs) are nice for artificial knowledge technology as a result of they use actual knowledge to be taught a steady latent area. We will view this latent area as a magic bucket from which we will pattern artificial knowledge which carefully resembles current knowledge. The continuity of this area is one in every of their large promoting factors because it means the mannequin generalises effectively and doesn’t simply memorise the latent area of particular inputs.
A VAE consists of an encoder, which maps enter knowledge right into a likelihood distribution (imply and variance) and a decoder, which reconstructs the information from the latent area.
For that steady latent area, VAEs use a reparameterization trick, the place a random noise vector is scaled and shifted utilizing the discovered imply and variance, guaranteeing easy and steady representations within the latent area.
Beneath we assemble a BasicVAE class which implements this course of with a easy structure.
- The encoder compresses the enter right into a smaller, hidden illustration, producing each a imply and log variance that outline a Gaussian distribution aka creating our magic sampling bucket. As an alternative of straight sampling, the mannequin applies the reparameterization trick to generate latent variables, that are then handed to the decoder.
- The decoder reconstructs the unique knowledge from these latent variables, guaranteeing the generated knowledge maintains traits of the unique dataset.
class BasicVAE(nn.Module):
def __init__(self, input_dim, latent_dim):
tremendous(BasicVAE, self).__init__()
# Encoder: Single small layer
self.encoder = nn.Sequential(
nn.Linear(input_dim, 8),
nn.ReLU()
)
self.fc_mu = nn.Linear(8, latent_dim)
self.fc_logvar = nn.Linear(8, latent_dim)
# Decoder: Single small layer
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 8),
nn.ReLU(),
nn.Linear(8, input_dim),
nn.Sigmoid() # Outputs values in vary [0, 1]
)
def encode(self, x):
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
return self.decoder(z)
def ahead(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
Given our BasicVAE structure we assemble our loss features and mannequin coaching under.
def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
recon_loss = nn.MSELoss()(recon_x, x)
# KL Divergence Loss
kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kld_loss / x.measurement(0)
def train_vae(mannequin, data_loader, epochs, learning_rate):
optimizer = optim.Adam(mannequin.parameters(), lr=learning_rate)
mannequin.prepare()
losses = []
reconstruction_mse = []
for epoch in vary(epochs):
total_loss = 0
total_mse = 0
for batch in data_loader:
batch_data = batch[0]
optimizer.zero_grad()
reconstructed, mu, logvar = mannequin(batch_data)
loss = vae_loss(reconstructed, batch_data, mu, logvar)
loss.backward()
optimizer.step()
total_loss += loss.merchandise()
# Compute batch-wise MSE for comparability
mse = nn.MSELoss()(reconstructed, batch_data).merchandise()
total_mse += mse
losses.append(total_loss / len(data_loader))
reconstruction_mse.append(total_mse / len(data_loader))
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
return losses, reconstruction_mse
combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)
# Practice-test break up
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)
batch_size = 128
# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)
basic_vae = BasicVAE(input_dim=X_train.form[1], latent_dim=8)
basic_losses, basic_mse = train_vae(
basic_vae, train_loader, epochs=50, learning_rate=0.001,
)
# Visualize outcomes
plt.determine(figsize=(12, 6))
plt.plot(basic_mse, label="Primary VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Coaching Reconstruction MSE")
plt.legend()
plt.present()
vae_loss consists of two elements: reconstruction loss, which measures how effectively the generated knowledge matches the unique enter utilizing Imply Squared Error (MSE), and KL divergence loss, which ensures that the discovered latent area follows a traditional distribution.
train_vae optimises the VAE utilizing the Adam optimizer over a number of epochs. Throughout coaching, the mannequin takes mini-batches of information, reconstructs them, and computes the loss utilizing vae_loss. These errors are then corrected through backpropagation the place the mannequin weights are up to date. We prepare the mannequin for 50 epochs and plot how the reconstruction imply squared error decreases over coaching.
We will see that our mannequin learns shortly methods to reconstruct our knowledge, evidencing environment friendly studying.

Now we’ve educated our BasicVAE to precisely reconstruct the Grownup dataset we will now use it to generate artificial knowledge. We need to generate extra samples of the constructive class (people who earn over 50k) so as to steadiness out the lessons and take away the bias from our mannequin.
To do that we choose all of the samples from our VAE dataset the place revenue is the constructive class (earn greater than 50k). We then encode these samples into the latent area. As we’ve solely chosen samples of the constructive class to encode, this latent area will replicate properties of the constructive class which we will pattern from to create artificial knowledge.
We pattern 15000 new samples from this latent area and decode these latent vectors again into the enter knowledge area as our artificial knowledge factors.
# Create column names
col_number = sample_df.form[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names
# Outline the characteristic worth to filter
feature_value = 1.0 # Specify the characteristic worth - right here we set the revenue to 1
# Set all revenue values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)
basic_vae.eval() # Set mannequin to analysis mode
with torch.no_grad():
mu, logvar = basic_vae.encode(selected_samples_tensor)
latent_vectors = basic_vae.reparameterize(mu, logvar)
# Compute the imply latent vector for this characteristic
mean_latent_vector = latent_vectors.imply(dim=0)
num_samples = 15000 # Variety of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)
with torch.no_grad():
generated_samples = basic_vae.decode(latent_samples)
Now we’ve generated artificial knowledge of the constructive class, we will mix this with the unique coaching knowledge to generate a balanced artificial dataset.
new_data = pd.DataFrame(generated_samples)
# Create column names
col_number = new_data.form[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names
X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])
X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)
mapping = {1: '>50K', 0: '

We will now use our balanced coaching artificial dataset to retrain our random forest classifier. We will then consider this new mannequin on the unique take a look at knowledge to see how efficient our artificial knowledge is at lowering the mannequin bias.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.match(X_synthetic_train, y_synthetic_train)
# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)
cm = confusion_matrix(y_model_test, y_pred)
# Create heatmap
plt.determine(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Precise")
plt.title("Confusion Matrix")
plt.present()
Our new classifier, educated on the balanced artificial dataset makes fewer errors on the unique take a look at set than our unique classifier educated on the imbalanced dataset and our error charge is now diminished to 14%.

Nonetheless, we’ve not been in a position to cut back the discrepancy in errors by a major quantity, our error charge for the constructive class remains to be 36%. This could possibly be as a result of to the next causes:
- We have now mentioned how one of many advantages of VAEs is the educational of a steady latent area. Nonetheless, if the bulk class dominates, the latent area may skew in direction of the bulk class.
- The mannequin could not have correctly discovered a definite illustration for the minority class as a result of lack of information, making it onerous to pattern from that area precisely.
On this tutorial we’ve launched and constructed a BasicVAE structure which can be utilized to generate artificial knowledge which improves the classification accuracy on an imbalanced dataset.
Comply with for future articles the place I’ll present how we will construct extra refined VAE architectures which deal with the above issues with imbalanced sampling and extra.
[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of information? Limits of LLM scaling based mostly on human-generated knowledge. arXiv preprint arXiv:2211.04325, 3.
[2] Becker, B. & Kohavi, R. (1996). Grownup [Dataset]. UCI Machine Studying Repository. https://doi.org/10.24432/C5XW20.
Source link