Abstract
This text gives a complete workflow for constructing and evaluating machine studying fashions to foretell farmers’ local weather change adaptation methods in Ethiopia. The unique paper utilizing the multinomial logit econometric mannequin is linked here. The workflow begins by importing important libraries comparable to TensorFlow, scikit-learn, XGBoost, and others to create and practice numerous fashions. The dataset, loaded from a Stata file, is preprocessed by renaming columns, dealing with lacking values, and splitting the info into coaching, validation, and check units. GitHub hyperlink here.
Options are standardized, and the info is reshaped for the convolutional neural community (CNN) mannequin. The CNN mannequin is outlined and skilled utilizing a deep studying strategy with two convolutional layers, max-pooling, dropout layers for regularization, and a last dense layer for classification. Early stopping and studying charge discount strategies are utilized to forestall overfitting and guarantee environment friendly coaching.
Along with CNN, superior machine studying fashions comparable to XGBoost, Random Forest, Assist Vector Machine (SVM), and Okay-Nearest Neighbors (KNN) are skilled and evaluated. Every mannequin’s efficiency is assessed based mostly on accuracy.
The RandomForest mannequin exhibits the very best accuracy at 0.8577, adopted by XGBoost at 0.8415. CNN displays the bottom efficiency with an accuracy of 0.4512, highlighting the potential challenges when utilizing deep studying fashions on this context.
The next sections give detailed breakdown of each step within the evaluation.
Importing important libraries for knowledge dealing with, machine studying, deep studying, and visualization.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.fashions import Sequential
from tensorflow.keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from tabulate import tabulate
Loading the dataset utilizing pandas from a stata.dta
file. The dataset is assumed to be pre-cleaned.
df = pd.read_stata('/content material/MLDeepLearningModel.dta')
Renaming characteristic columns to extra descriptive names for readability and readability.
# Function Mapping: Rename characteristic columns to extra descriptive names
feature_mapping = {
'edu': 'Years of schooling',
'hhsize': 'Dimension of family',
'gender': 'Gender of the top of family',
'age': 'Age of the top of family',
'inc': 'Farm revenue',
'nfinc': 'Nonfarm revenue',
'ownlv': 'Livestock possession',
'ext': 'Extension on crop and livestock',
'extcl': 'Info on local weather change',
'ffext': 'Farmer-to-farmer extension',
'cred': 'Credit score',
'rlgo': 'Variety of relations in acquired',
'kolla': 'Native agroecology kola (lowlands)',
'woinadega': 'Native agroecology weynadega (midlands)',
'dega': 'Native agroecology dega (highlands)',
'av_temp': 'Temperature',
'av_rain': 'Precipitation'
}# Apply the characteristic column renaming
df.rename(columns=feature_mapping, inplace=True)
Renaming the labels to characterize the difference decisions extra clearly.
# Label Mapping: Rename adaptation decisions for readability
label_mapping = {
'one': 'No adaptation',
'two': 'Planting bushes',
'three': 'Soil conservation',
'4': 'Completely different crop varieties',
'5': 'Early and late planting',
'six': 'Irrigation'
}# Apply the label column renaming
df.rename(columns=label_mapping, inplace=True)
Extracting options (X) and labels (y) for machine studying fashions.
# Extract characteristic and goal variables
options = record(feature_mapping.values()) # Listing of characteristic column names
labels = record(label_mapping.values()) # Listing of label column names# Put together knowledge for modeling
X = df[features].values # Function matrix
y = df[labels].values # Goal labels (one-hot encoded)
Dealing with lacking values by changing NaN with zeros to make sure no errors throughout mannequin coaching.
# Deal with lacking values by changing NaN with zeros
X = np.nan_to_num(X, nan=0)
y = np.nan_to_num(y, nan=0)
Splitting the info into coaching, validation, and check units utilizing an 80–20 break up for coaching and check units, with an additional break up for validation.
# Cut up the dataset into coaching, validation, and check units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)
Standardizing options to make sure the fashions carry out higher, as most machine studying fashions require options to be on an identical scale.
# Standardize options utilizing StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.remodel(X_val)
X_test = scaler.remodel(X_test)
Reshaping the characteristic knowledge right into a 3D format appropriate for Convolutional Neural Networks (CNN), which expects enter within the type of (samples, timesteps, options).
# Reshape the info for CNN enter (including a 3rd dimension)
X_train_3d = X_train.reshape(X_train.form[0], X_train.form[1], 1)
X_val_3d = X_val.reshape(X_val.form[0], X_val.form[1], 1)
X_test_3d = X_test.reshape(X_test.form[0], X_test.form[1], 1)
Defining the structure of the CNN mannequin utilizing convolution layers, max pooling layers, dropout layers, and totally linked layers.
# Outline the CNN mannequin structure
cnn_model = Sequential([
Conv1D(64, 3, activation='relu', input_shape=(X_train_3d.shape[1], 1)), # First convolution layer
MaxPooling1D(pool_size=2), # Max pooling layer
Dropout(0.4), # Dropout layer to forestall overfitting
Conv1D(128, 3, activation='relu'), # Second convolution layer
MaxPooling1D(pool_size=2), # Max pooling layer
Dropout(0.4), # Dropout layer
Flatten(), # Flatten the output for the totally linked layers
Dense(128, activation='relu'), # Absolutely linked layer
Dropout(0.4), # Dropout layer
Dense(64, activation='relu'), # Absolutely linked layer
Dense(y_train.form[1], activation='softmax') # Output layer with softmax activation for multi-class classification
])
Compiling the mannequin with the Adam optimizer, categorical cross-entropy loss perform (for multi-class classification), and accuracy because the analysis metric.
# Compile the mannequin
cnn_model.compile(optimizer=Adam(learning_rate=0.0005), loss='categorical_crossentropy', metrics=['accuracy'])
Defining callbacks for early stopping (if validation loss doesn’t enhance) and studying charge discount (if validation loss plateaus).
# Outline callbacks to cease early if the mannequin is not enhancing and cut back the training charge if wanted
callbacks = [
EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=1e-6)
]
Coaching the CNN mannequin with the coaching and validation datasets, utilizing the outlined callbacks.
# Practice the CNN mannequin
cnn_history = cnn_model.match(X_train_3d, y_train, epochs=100, batch_size=32, validation_data=(X_val_3d, y_val), callbacks=callbacks)
Coaching a number of conventional machine studying fashions (XGBoost, Random Forest, SVM, KNN, Logistic Regression) to match their efficiency.
# Practice different machine studying fashions for comparability
xgb_model = xgb.XGBClassifier(goal='multi:softmax', num_class=len(labels), eval_metric='mlogloss')
xgb_model.match(X_train, y_train.argmax(axis=1))rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.match(X_train, y_train.argmax(axis=1))
svm_model = SVC(kernel='linear', decision_function_shape='ovr', chance=True)
svm_model.match(X_train, y_train.argmax(axis=1))
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.match(X_train, y_train.argmax(axis=1))
mnl_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500)
mnl_model.match(X_train, y_train.argmax(axis=1))
Evaluating the efficiency of every mannequin on the check knowledge and storing the outcomes.
# Retailer mannequin efficiency for analysis
model_results = {
'CNN': accuracy_score(y_test.argmax(axis=1), cnn_model.predict(X_test_3d).argmax(axis=1))
}# Consider all fashions and retailer the outcomes
fashions = {
'XGBoost': xgb_model,
'RandomForest': rf_model,
'SVM': svm_model,
'KNN': knn_model,
'Multinomial Logit': mnl_model
}
# Consider the fashions
for model_name, mannequin in fashions.gadgets():
y_pred = mannequin.predict(X_test)
accuracy = accuracy_score(y_test.argmax(axis=1), y_pred)
model_results[model_name] = accuracy
Sorting the fashions based mostly on their accuracy and displaying the leads to a readable desk format.
# Type fashions by accuracy and show the outcomes
sorted_model_results = sorted(model_results.gadgets(), key=lambda x: x[1], reverse=True)
print(tabulate(sorted_model_results, headers=['Model', 'Accuracy'], tablefmt='fancy_grid'))
Plotting a bar chart to match the efficiency of the fashions visually.
# Visualize the leads to a bar chart
plt.determine(figsize=(10, 6))
sns.barplot(x=[result[0] for lead to sorted_model_results], y=[result[1] for lead to sorted_model_results])
plt.title('Comparability of Mannequin Accuracies')
plt.xlabel('Mannequin')
plt.ylabel('Accuracy')
plt.present()