Kaggle California House Pricing — A Machine Learning Approach | by WanQi.Khaw

Providing worth by way of Machine Studying

Can we precisely predict California housing costs utilizing key options like location, revenue, and housing traits? Understanding what drives home costs is essential for homebuyers, traders, and policymakers. This mission explores totally different machine studying fashions to find out which performs greatest and uncovers the important thing elements influencing housing prices.

Dependent Variable: median_house_value (Goal variable)
Unbiased Variables: All different columns aside from median_house_value

This weblog highlights necessary snippets of code. Discuss with the complete code for a complete evaluation.

Dealing with Skewed Knowledge: Log Transformation

💡 Why apply log transformation?
Log transformation helps:
✅ Normalize skewed distributions
✅ Scale back the affect of outliers
✅ Enhance mannequin interpretability

#Log the chosen options
information['total_rooms']=np.log(information['total_rooms']+1)
information['total_bedrooms']=np.log(information['total_bedrooms']+1)
information['population']=np.log(information['population']+1)
information['households']=np.log(information['households']+1)

💡Why add +1?
The first purpose for including 1 earlier than taking the logarithm is to deal with zero values. The logarithm of zero is undefined (unfavorable infinity), which may trigger points in calculations and mannequin coaching. By including 1, we make sure that all values are optimistic and keep away from encountering this undefined scenario.

✅ Pandas Get_dummies — Transformed categorical options (ocean_proximity) into numerical values
✅ Correlation Evaluation — Recognized which options affect home values essentially the most
✅ Function Mixture — Mixed comparable options to keep away from redundancy
✅ StratifiedShuffleSplit — Ensured balanced coaching and check information distribution
✅ StandardScaler — Scaled chosen options for higher ML efficiency

We examined a number of machine studying fashions and evaluated their efficiency:

💡 What’s RMSE & MAE?

Root Imply Squared Error (RMSE): Measures the common prediction error. A decrease RMSE signifies higher mannequin efficiency.
Imply Absolute Error (MAE): Measures the common absolute distinction between predicted and precise values. The decrease the higher.

📌 Rule of Thumb: Hyperparameter tuning is required for increased accuracy fashions, particularly for tree-based strategies like XGBoost.

To optimize mannequin efficiency, we tuned hyperparameters akin to:
✅ Max depth — Prevents overfitting by controlling tree measurement
✅ Studying charge — Adjusts how a lot fashions be taught per iteration
✅ Variety of estimators — Controls the variety of boosting rounds

Most Essential Function: Location Issues!
Utilizing characteristic significance evaluation, we discovered that essentially the most influential issue was proximity to inland areas (INLAND).

# Choose the mannequin with highest rating to establish which issue impacts the home value essentially the mostfeature_importances = best_xgb_model.feature_importances_
feature_names = train_inputs.columns
importance_df = pd.DataFrame({'Function': feature_names, 'Significance': feature_importances})
importance_df = importance_df.sort_values(by='Significance', ascending=False)
print(importance_df)
# Calculate the correlation
correlation = information['INLAND'].corr(information['median_house_value'])
# Print the correlation
print(f"Correlation between INLAND and median_house_value: {correlation:.2f}")

📌 Perception: Homes positioned inland are inclined to have decrease costs. The correlation between INLAND and median_house_value was -0.48, confirming an inverse relationship.

2. Which ML Mannequin Predicts Finest?
XGBoost outperformed all different fashions, reaching the highest R² (0.88) and lowest RMSE (0.46).

📊 Mannequin Efficiency Comparability:

💡 Why XGBoost?
✅ Handles non-linearity higher than conventional regression fashions
✅ Makes use of boosting to appropriate errors from earlier fashions
✅ Reduces overfitting in comparison with a single choice tree

✔ Location Drives Value Variability — Inland properties are considerably cheaper than coastal ones

✔ Revenue Ranges Comes Second as Value Predictor — Greater median incomes result in increased home costs

✔ XGBoost is the Finest Mannequin — It achieved the best accuracy in value predictions

✅ Check out the full code here

✅ Download the dataset here

Source link

How Brain-Computer Interfaces Are Changing the Game | by Rahul Mishra | Coding Nexus | Jun, 2025

Making Sense of Metrics in Recommender Systems | by George Perakis | Jun, 2025

Systematic Hedging Of An Equity Portfolio With Short-Selling Strategies Based On The VIX | by Domenico D’Errico | Jun, 2025

09370673570 – شماره خاله #شماره خاله# تهران #شماره خاله# اصفهان

Infinite Reality in $500M Acquisition of Agentic AI Company Touchcast

Gemini Robotics uses Google’s top language model to make robots more useful

Why Read 300 Pages When You Can Learn the Key Points in 15 Minutes?

How a Business With $20M Annual Revenue Pulls Off a Rebrand

Most Popular

Income And Net Worth Required To Afford A $10 Million Home

Her ‘No New Things’ Challenge Paid Off $22k Debt, Saved $36k

Amazon CEO Andy Jassy Says He Wants Fewer Middle Managers

Our Picks

Experiments Illustrated: Can $1 Change Behavior More Than $100?

Should You Switch from Scikit-learn to PyTorch for GPU-Accelerated Machine Learning? | by ThamizhElango Natarajan | Jun, 2025

Learnings from a Machine Learning Engineer — Part 1: The Data

Kaggle California House Pricing — A Machine Learning Approach | by WanQi.Khaw | Feb, 2025

Related Posts