You ran .present()
in your first Spark DataFrame and noticed the information — however what actually occurred below the hood?
Once we write one thing like:
df = spark.learn.csv("path/to/information.csv", header=True)
df.present()
…many people assume Spark “reads” the information and shows it, similar to Pandas.
However right here’s the reality:
Nothing truly occurs till
.present()
is executed.
Spark is a lazy execution engine. It builds up a plan behind the scenes — and solely acts when it actually has to.
Let’s stroll by the behind-the-scenes steps Spark performs:
If you write df = spark.learn.csv(...)
, Spark creates a logical plan:
- It is aware of the supply (CSV)
- It information the schema
- It tracks any transformations you apply later (
filter
,groupBy
, and so forth.)