Pandas is a darling of the info science world — highly effective, versatile, and superb for small to medium-sized information units.
Nonetheless, as soon as your information reaches a number of gigabytes in dimension, Pandas is not superb. Since Pandas reads all the info into reminiscence, bigger information units can decelerate and even crash your system.
However don’t fear. There are some good methods you’ll be able to deal with thousands and thousands of data effectively with out incinerating your RAM.
On this article, we’ll talk about 6 beginner-friendly methods to make Pandas environment friendly and quick to make use of — even when your information isn’t.
That is the ultimate article in our information preprocessing sequence.
Earlier than studying this text, I like to recommend testing the earlier ones to get a greater understanding of knowledge preprocessing!

Information Preprocessing – Information Science
By default, Pandas makes use of information sorts like float64
or int64
, that are type of like utilizing a full-sized SUV simply to go grocery buying. In case your information doesn’t want that degree of precision or vary, you’ll be able to downsize to smaller sorts like float32
, float16
, or int8
, int16
, and int32
.
💡 Why hassle?
As a result of smaller sorts = much less reminiscence utilization = smoother efficiency.
That is particularly useful if you’re coping with thousands and thousands of rows.
🛠 Learn how to do it:
- Whereas studying information utilizing
pd.read_csv()
orpd.read_sql()
, specify thedtype
parameter. - For already-loaded information, use
.astype()
to transform columns to extra environment friendly sorts.
import pandas as pd# Outline the scale of the dataset
num_rows = 1000000 # 1 million rows
# Instance DataFrame with inefficient datatypes
information = {'A': [1, 2, 3, 4],
'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(information)
# Replicate the DataFrame to create a bigger dataset
df_large =…