Close Menu
    Trending
    • History of Artificial Intelligence: Key Milestones That Shaped the Future | by amol pawar | softAai Blogs | Jun, 2025
    • FedEx Deploys Hellebrekers Robotic Sorting Arm in Germany
    • Call Klarna’s AI Hotline and Talk to an AI Clone of Its CEO
    • A First-Principles Guide to Multilingual Sentence Embeddings | by Tharunika L | Jun, 2025
    • Google, Spotify Down in a Massive Outage Affecting Thousands
    • Prediksi Kualitas Anggur dengan Random Forest — Panduan Lengkap dengan Python | by Gilang Andhika | Jun, 2025
    • How a 12-Year-Old’s Side Hustle Makes Nearly $50,000 a Month
    • Boost Your LLM Output and Design Smarter Prompts: Real Tricks from an AI Engineer’s Toolbox
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Anatomy of a Parquet File
    Artificial Intelligence

    Anatomy of a Parquet File

    FinanceStarGateBy FinanceStarGateMarch 14, 2025No Comments11 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    In recent times, Parquet has change into a regular format for information storage in Big Data ecosystems. Its column-oriented format affords a number of benefits:

    • Sooner question execution when solely a subset of columns is being processed
    • Fast calculation of statistics throughout all information
    • Decreased storage quantity due to environment friendly compression

    When mixed with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with question engines (e.g., Trino) and information warehouse compute clusters (e.g., Snowflake, BigQuery). On this article, the content material of a Parquet file is dissected utilizing primarily customary Python instruments to higher perceive its construction and the way it contributes to such performances.

    Writing Parquet file(s)

    To provide Parquet information, we use PyArrow, a Python binding for Apache Arrow that shops dataframes in reminiscence in columnar format. PyArrow permits fine-grained parameter tuning when writing the file. This makes PyArrow ideally suited for Parquet manipulation (one also can merely use Pandas).

    # generator.py
    
    import pyarrow as pa
    import pyarrow.parquet as pq
    from faker import Faker
    
    pretend = Faker()
    Faker.seed(12345)
    num_records = 100
    
    # Generate pretend information
    names = [fake.name() for _ in range(num_records)]
    addresses = [fake.address().replace("n", ", ") for _ in range(num_records)]
    birth_dates = [
        fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
    ]
    cities = [addr.split(", ")[1] for addr in addresses]
    birth_years = [date.year for date in birth_dates]
    
    # Solid the information to the Arrow format
    name_array = pa.array(names, sort=pa.string())
    address_array = pa.array(addresses, sort=pa.string())
    birth_date_array = pa.array(birth_dates, sort=pa.date32())
    city_array = pa.array(cities, sort=pa.string())
    birth_year_array = pa.array(birth_years, sort=pa.int32())
    
    # Create schema with non-nullable fields
    schema = pa.schema(
        [
            pa.field("name", pa.string(), nullable=False),
            pa.field("address", pa.string(), nullable=False),
            pa.field("date_of_birth", pa.date32(), nullable=False),
            pa.field("city", pa.string(), nullable=False),
            pa.field("birth_year", pa.int32(), nullable=False),
        ]
    )
    
    desk = pa.Desk.from_arrays(
        [name_array, address_array, birth_date_array, city_array, birth_year_array],
        schema=schema,
    )
    
    print(desk)
    pyarrow.Desk
    title: string not null
    deal with: string not null
    date_of_birth: date32[day] not null
    metropolis: string not null
    birth_year: int32 not null
    ----
    title: [["Adam Bryan","Jacob Lee","Candice Martinez","Justin Thompson","Heather Rubio"]]
    deal with: [["822 Jennifer Field Suite 507, Anthonyhaven, UT 98088","292 Garcia Mall, Lake Belindafurt, IN 69129","31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323","00716 Kristina Trail Suite 381, Howelltown, SC 64961","351 Christopher Expressway Suite 332, West Edward, CO 68607"]]
    date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
    metropolis: [["Anthonyhaven","Lake Belindafurt","East Tammiestad","Howelltown","West Edward"]]
    birth_year: [[1955,1950,1955,1957,1956]]

    The output clearly displays a columns-oriented storage, not like Pandas, which normally shows a standard “row-wise” desk.

    How is a Parquet file saved?

    Parquet information are typically saved in low cost object storage databases like S3 (AWS) or GCS (GCP) to be simply accessible by information processing pipelines. These information are normally organized with a partitioning technique by leveraging listing buildings:

    # generator.py
    
    num_records = 100
    
    # ...
    
    # Writing the parquet information to disk
    pq.write_to_dataset(
        desk,
        root_path='dataset',
        partition_cols=['birth_year', 'city']
    )

    If birth_year and metropolis columns are outlined as partitioning keys, PyArrow creates such a tree construction within the listing dataset:

    dataset/
    ├─ birth_year=1949/
    ├─ birth_year=1950/
    │ ├─ metropolis=Aaronbury/
    │ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
    │ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
    │ │ ├─ …
    │ ├─ metropolis=Alicialand/
    │ ├─ …
    ├─ birth_year=1951 ├─ ...
    

    The technique allows partition pruning: when a question filters on these columns, the engine can use folder names to learn solely the mandatory information. Because of this the partitioning technique is essential for limiting delay, I/O, and compute sources when dealing with giant volumes of knowledge (as has been the case for many years with conventional relational databases).

    The pruning impact may be simply verified by counting the information opened by a Python script that filters the delivery yr:

    # question.py
    import duckdb
    
    duckdb.sql(
        """
        SELECT * 
        FROM read_parquet('dataset/*/*/*.parquet', hive_partitioning = true)
        the place birth_year = 1949
        """
    ).present()
    > strace -e hint=open,openat,learn -f python question.py 2>&1 | grep "dataset/.*.parquet"
    
    [pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    37] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Boxpercent203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=DPOpercent20APpercent2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Eastpercent20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=FPOpercent20AApercent2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Newpercent20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Northpercent20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Portpercent20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 4
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 5
    [pid    39] openat(AT_FDCWD, "dataset/birth_year=1949/metropolis=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet", O_RDONLY) = 3

    Solely 23 information are learn out of 100.

    Studying a uncooked Parquet file

    Let’s decode a uncooked Parquet file with out specialised libraries. For simplicity, the dataset is dumped right into a single file with out compression or encoding.

    # generator.py
    
    # ...
    
    pq.write_table(
        desk,
        "dataset.parquet",
        use_dictionary=False,
        compression="NONE",
        write_statistics=True,
        column_encoding=None,
    )

    The very first thing to know is that the binary file is framed by 4 bytes whose ASCII illustration is “PAR1”. The file is corrupted if this isn’t the case.

    # reader.py
    
    with open("dataset.parquet", "rb") as file:
        parquet_data = file.learn()
    
    assert parquet_data[:4] == b"PAR1", "Not a legitimate parquet file"
    assert parquet_data[-4:] == b"PAR1", "File footer is corrupted"

    As indicated within the documentation, the file is split into two components: the “row teams” containing precise information, and the footer containing metadata (schema under).

    The footer

    The dimensions of the footer is indicated within the 4 bytes previous the tip marker as an unsigned integer written in “little endian” format (famous “unpack perform).

    # reader.py
    
    import struct
    
    # ...
    
    footer_length = struct.unpack("
    Footer measurement in bytes: 1088

    The footer info is encoded in a cross-language serialization format known as Apache Thrift. Utilizing a human-readable however verbose format like JSON after which translating it into binary can be much less environment friendly by way of reminiscence utilization. With Thrift, one can declare information buildings as follows:

    struct Buyer {
    	1: required string title,
    	2: elective i16 birthYear,
    	3: elective checklist pursuits
    }

    On the premise of this declaration, Thrift can generate Python code to decode byte strings with such information construction (it additionally generates code to carry out the encoding half). The thrift file containing all the information buildings carried out in a Parquet file may be downloaded here. After having put in the thrift binary, let’s run:

    thrift -r --gen py parquet.thrift

    The generated Python code is positioned within the “gen-py” folder. The footer’s information construction is represented by the FileMetaData class – a Python class mechanically generated from the Thrift schema. Utilizing Thrift’s Python utilities, binary information is parsed and populated into an occasion of this FileMetaData class.

    # reader.py
    
    import sys
    
    # ...
    
    # Add the generated lessons to the python path
    sys.path.append("gen-py")
    from parquet.ttypes import FileMetaData, PageHeader
    from thrift.transport import TTransport
    from thrift.protocol import TCompactProtocol
    
    def read_thrift(information, thrift_instance):
        """
        Learn a Thrift object from a binary buffer.
        Returns the Thrift object and the variety of bytes learn.
        """
        transport = TTransport.TMemoryBuffer(information)
        protocol = TCompactProtocol.TCompactProtocol(transport)
        thrift_instance.learn(protocol)
        return thrift_instance, transport._buffer.inform()
    
    # The variety of bytes learn is just not used for now
    file_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())
    
    print(f"Variety of rows in the entire file: {file_metadata_thrift.num_rows}")
    print(f"Variety of row teams: {len(file_metadata_thrift.row_groups)}")
    
    Variety of rows in the entire file: 100
    Variety of row teams: 1

    The footer incorporates in depth details about the file’s construction and content material. For example, it precisely tracks the variety of rows within the generated dataframe. These rows are all contained inside a single “row group.” However what’s a “row group?”

    Row teams

    Not like purely column-oriented codecs, Parquet employs a hybrid method. Earlier than writing column blocks, the dataframe is first partitioned vertically into row teams (the parquet file we generated is simply too small to be break up in a number of row teams).

    This hybrid construction affords a number of benefits:

    Parquet calculates statistics (akin to min/max values) for every column inside every row group. These statistics are essential for question optimization, permitting question engines to skip total row teams that don’t match filtering standards. For instance, if a question filters for birth_year > 1955 and a row group’s most delivery yr is 1954, the engine can effectively skip that total information part. This optimisation known as “predicate pushdown”. Parquet additionally shops different helpful statistics like distinct worth counts and null counts.

    # reader.py
    # ...
    
    first_row_group = file_metadata_thrift.row_groups[0]
    birth_year_column = first_row_group.columns[4]
    
    min_stat_bytes = birth_year_column.meta_data.statistics.min
    max_stat_bytes = birth_year_column.meta_data.statistics.max
    
    min_year = struct.unpack("
    The delivery yr vary is between 1949 and 1958
    • Row teams allow parallel processing of knowledge (significantly worthwhile for frameworks like Apache Spark). The dimensions of those row teams may be configured based mostly on the computing sources out there (utilizing the row_group_size property in perform write_table when utilizing PyArrow).
    # generator.py
    
    # ...
    
    pq.write_table(
        desk,
        "dataset.parquet",
        row_group_size=100,
    )
    
    # /! Maintain the default worth of "row_group_size" for the following components
    • Even when this isn’t the first goal of a column format, Parquet’s hybrid construction maintains affordable efficiency when reconstructing full rows. With out row teams, rebuilding a whole row may require scanning everything of every column which might be extraordinarily inefficient for big information.

    Knowledge Pages

    The smallest substructure of a Parquet file is the web page. It incorporates a sequence of values from the identical column and, subsequently, of the identical sort. The selection of web page measurement is the results of a trade-off:

    • Bigger pages imply much less metadata to retailer and skim, which is perfect for queries with minimal filtering.
    • Smaller pages cut back the quantity of pointless information learn, which is best when queries goal small, scattered information ranges.

    Now let’s decode the contents of the primary web page of the column devoted to addresses whose location may be discovered within the footer (given by the data_page_offset attribute of the appropriate ColumnMetaData) . Every web page is preceded by a Thrift PageHeader object containing some metadata. The offset really factors to a Thrift binary illustration of the web page metadata that precedes the web page itself. The Thrift class known as a PageHeader and may also be discovered within the gen-py listing.

    💡 Between the PageHeader and the precise values contained inside the web page, there could also be just a few bytes devoted to implementing the Dremel format, which permits encoding nested data structures. Since our information has a daily tabular format and the values will not be nullable, these bytes are skipped when writing the file (https://parquet.apache.org/docs/file-format/data-pages/).

    # reader.py
    # ...
    
    address_column = first_row_group.columns[1]
    column_start = address_column.meta_data.data_page_offset
    column_end = column_start + address_column.meta_data.total_compressed_size
    column_content = parquet_data[column_start:column_end]
    
    page_thrift, page_header_size = read_thrift(column_content, PageHeader())
    page_content = column_content[
        page_header_size : (page_header_size + page_thrift.compressed_page_size)
    ]
    print(column_content[:100])
    b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'

    The generated values lastly seem, in plain textual content and never encoded (as specified when writing the Parquet file). Nonetheless, to optimize the columnar format, it is strongly recommended to make use of one of many following encoding algorithms: dictionary encoding, run size encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 varieties), adopted by compression utilizing gzip or snappy (out there codecs are listed here). Since encoded pages comprise comparable values (all addresses, all decimal numbers, and so on.), compression ratios may be significantly advantageous.

    As documented within the specification, when character strings (BYTE_ARRAY) will not be encoded, every worth is preceded by its measurement represented as a 4-byte integer. This may be noticed within the earlier output:

    To learn all of the values (for instance, the primary 10), the loop is fairly easy:

    idx = 0
    for _ in vary(10):
        str_size = struct.unpack("
    481 Mata Squares Suite 260, Lake Rachelville, KY 87464
    671 Barker Crossing Suite 390, Mooretown, MI 21488
    62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068
    948 Victor Sq. Apt. 753, Braybury, RI 67113
    365 Edward Place Apt. 162, Calebborough, AL 13037
    894 Reed Lock, New Davidmouth, NV 84612
    24082 Allison Squares Suite 345, North Sharonberg, WY 97642
    00266 Johnson Drives, South Lori, MI 98513
    15255 Kelly Plains, Richardmouth, GA 33438
    260 Thomas Glens, Port Gabriela, OH 96758

    And there we’ve it! We’ve got efficiently recreated, in a quite simple means, how a specialised library would learn a Parquet file. By understanding its constructing blocks together with headers, footers, row teams, and information pages, we will higher recognize how options like predicate pushdown and partition pruning ship such spectacular efficiency advantages in data-intensive environments. I’m satisfied realizing how Parquet works beneath the hood helps making higher choices about storage methods, compression decisions, and efficiency optimization.

    All of the code used on this article is offered on my GitHub repository at https://github.com/kili-mandjaro/anatomy-parquet, the place you possibly can discover extra examples and experiment with totally different Parquet file configurations.

    Whether or not you might be constructing information pipelines, optimizing question efficiency, or just interested by information storage codecs, I hope this deep dive into Parquet’s inside buildings has supplied worthwhile insights to your Data Engineering journey.

    All pictures are by the creator.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article364. Satellite Data Viewer with ML, Next.js for the frontend, Python (Flask) for the backend | by Ilakkuvaselvi (Ilak) Manoharan | Mar, 2025
    Next Article Apple and Android Appear Powerless Against Toll Scam Texts
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    Boost Your LLM Output and Design Smarter Prompts: Real Tricks from an AI Engineer’s Toolbox

    June 13, 2025
    Artificial Intelligence

    Connecting the Dots for Better Movie Recommendations

    June 13, 2025
    Artificial Intelligence

    Agentic AI 103: Building Multi-Agent Teams

    June 12, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    Principal Component Analysis (PCA) Made Simple | by Michal Mikulasi | Apr, 2025

    April 27, 2025

    Papers Explained 362: Llama-Nemotron | by Ritvik Rastogi | May, 2025

    May 9, 2025

    Cloud Computing in 2025: Revolutionizing Technology

    April 10, 2025

    Artificial Intelligence Training: Elevate Your Career with Weskill’s Premier Programs | by Weskill | Apr, 2025

    April 13, 2025

    09389212898

    June 6, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    The Case for Centralized AI Model Inference Serving

    April 2, 2025

    School’s Out — How to Support Working Parents This Summer

    June 2, 2025

    This Quiet Shift Is Helping Founders Build Fierce Customer Loyalty

    April 26, 2025
    Our Picks

    AI and Crypto Security: Protecting Digital Assets with Advanced Technology

    February 18, 2025

    Cloud-Driven Financial Analytics: Improving Decision-Making in Banking | by Avinash pamisetty | Mar, 2025

    March 29, 2025

    The Time To Participate In A No Spend Challenge Is Now

    June 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.