Close Menu
    Trending
    • These States Have the Most Affordable Housing in US: Ranking
    • How I Finally Understood MCP — and Got It Working in Real Life
    • Dados não falam sozinhos. Aqui vai o checklist pra fazê-los falar. | by Deboradelazantos | May, 2025
    • Adaptive Power Systems in AI Data Centers for 100kw Racks
    • More Robots Will Fill Pharmacy Prescriptions at Walgreens
    • Running Python Programs in Your Browser
    • Week 2: From Text to Tensors – LLM Input Pipeline Engineering | by Luke Jang | May, 2025
    • Buying The Dip: Overcome Fear During A Correction And Prosper
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Machine Learning»Week 2: From Text to Tensors – LLM Input Pipeline Engineering | by Luke Jang | May, 2025
    Machine Learning

    Week 2: From Text to Tensors – LLM Input Pipeline Engineering | by Luke Jang | May, 2025

    FinanceStarGateBy FinanceStarGateMay 12, 2025No Comments7 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Positional Embeddings

    Transformers don’t inherently know token order. To deal with this, I added positional embeddings, which assign every place in a sequence its personal vector. The textbook explains:

    “We will now add these on to the token embeddings… leading to enter embeddings that may now be processed by the principle LLM modules” (Raschka 47).

    Implementing this helped me respect how construction is added to what would in any other case be a bag-of-words illustration.

    GitHub Listing:

    • tokenizer_v1.py
      Regex-based tokenizer with .encode() and .decode(). Works just for identified vocabulary.
    class SimpleTokenizerV1:
    def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.gadgets()}

    def encode(self, textual content):
    preprocessed = re.break up(r'([,.:;?_!"()']|--|s)', textual content)

    preprocessed = [
    item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

    def decode(self, ids):
    textual content = " ".be part of([self.int_to_str[i] for i in ids])
    # Change areas earlier than the required punctuations
    textual content = re.sub(r's+([,.?!"()'])', r'1', textual content)
    return textual content

    • tokenizer_v2.py
      Provides to deal with unknown phrases exterior of vocabulary checklist, and doc boundary token .
    class SimpleTokenizerV2:
    def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = { i:s for s,i in vocab.gadgets()}

    def encode(self, textual content):
    preprocessed = re.break up(r'([,.:;?_!"()']|--|s)', textual content)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    preprocessed = [
    item if item in self.str_to_int
    else "" for item in preprocessed
    ]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

    def decode(self, ids):
    textual content = " ".be part of([self.int_to_str[i] for i in ids])
    # Change areas earlier than the required punctuations
    textual content = re.sub(r's+([,.:;?!"()'])', r'1', textual content)
    return textual content

    # Instantiate the byte pair encoding (BPE) tokenizer utilized in GPT-2.
    # This tokenizer breaks textual content into subword items and assigns every a token ID.
    # The 'gpt2' encoder features a predefined vocabulary of fifty,257 tokens.
    tokenizer = tiktoken.get_encoding("gpt2")

    # Pattern textual content to be tokenized. The is a particular token utilized by GPT fashions
    # to point the top of a doc or separate completely different textual content segments.
    textual content = (
    "Howdy, do you want tea? Within the sunlit terraces"
    "of someunknownPlace."
    )

    # Encode the textual content into a listing of token IDs utilizing the BPE tokenizer.
    # 'allowed_special' ensures that particular tokens like are preserved as-is.
    integers = tokenizer.encode(textual content, allowed_special={""})

    # Print the ensuing checklist of token IDs.
    # Every ID corresponds to a subword or character from the enter textual content.
    print(integers)

    # Decode the checklist of token IDs again right into a human-readable string.
    # This step verifies that encoding and decoding are constant.
    strings = tokenizer.decode(integers)

    # Print the reconstructed string, which ought to match the unique enter textual content
    # (apart from formatting of particular tokens and dealing with of unknown phrases by way of subword splits).
    print(strings)

    • tokens_to_token_id.py
      Constructs a sorted vocabulary and maps it to integers, forming the premise for ID translation.
    import urllib.request

    #get file from the textbook repository
    url = ("https://uncooked.githubusercontent.com/rasbt/"
    "LLMs-from-scratch/most important/ch02/01_main-chapter-code/"
    "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)

    with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.learn()

    print("Whole variety of character:", len(raw_text))
    print(raw_text[:99])

    import re

    preprocessed = re.break up(r'([,.:;?_!"()']|--|s)', raw_text) # tokenizing the uncooked textual content
    preprocessed = [item.strip() for item in preprocessed if item.strip()]

    ### Now changing tokens to token ID
    ### This creates set of vocabs for LLM to make use of.

    all_words = sorted(set(preprocessed))
    vocab_size = len(all_words)

    print(vocab_size)

    vocab = {token:integer for integer,token in enumerate(all_words)}

    • data_sampling.py
      Demonstrates windowed next-token sampling with context shifting.
    ### Sliding window method to sampling datasets to coach GPT fashion LLM.

    from data_preparation_and_sampling.byte_pair_encoding import tokenizer

    # Load the complete brief story "The Verdict" as uncooked textual content
    with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.learn()

    # Encode the complete textual content into token IDs utilizing the BPE tokenizer
    enc_text = tokenizer.encode(raw_text)

    # Discard the primary 50 tokens for extra fascinating pattern context
    # (e.g., to skip introductions and deal with narrative-rich parts)
    enc_sample = enc_text[50:]

    # Outline the context dimension (i.e., what number of tokens the LLM can "see")
    context_size = 4

    # Extract an enter sequence of dimension 4
    x = enc_sample[:context_size]

    # Extract the corresponding goal sequence by shifting x by one token
    # The mannequin will attempt to predict y[i] from x[i]
    y = enc_sample[1:context_size+1]

    # Show the uncooked token IDs for each enter and goal
    print(f"x: {x}")
    print(f"y: {y}")

    # Visualize input-target token alignment utilizing a sliding window
    # This mimics how LLMs be taught next-token prediction
    for i in vary(1, context_size+1):
    context = enc_sample[:i] # Enter context as much as i tokens
    desired = enc_sample[i] # The subsequent token to foretell

    # Present the precise token IDs and the corresponding decoded textual content
    print(context, "---->", desired)
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

    • dataset.py
      Implements GPTDatasetV1 and create_dataloader_v1. Permits batching, shuffling, and overlap management.
    import torch
    import tiktoken
    from torch.utils.information import Dataset, DataLoader

    # Customized PyTorch Dataset for producing input-target token ID pairs for LLM coaching
    class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
    self.input_ids = [] # Holds all enter sequences
    self.target_ids = [] # Holds corresponding goal sequences (shifted by one)

    # Encode the uncooked textual content into token IDs utilizing the GPT-2 tokenizer
    # Observe: is handled as a particular token
    token_ids = tokenizer.encode(txt, allowed_special={""})

    # Make sure that the textual content has sufficient tokens to generate at the least one full sequence
    assert len(token_ids) > max_length, "Variety of tokenized inputs should at the least be equal to max_length+1"

    # Use a sliding window to generate overlapping sequences
    # Every window creates one input-target pair
    for i in vary(0, len(token_ids) - max_length, stride):
    input_chunk = token_ids[i:i + max_length] # Enter sequence
    target_chunk = token_ids[i + 1: i + max_length + 1] # Goal sequence (shifted proper)
    self.input_ids.append(torch.tensor(input_chunk)) # Convert checklist to PyTorch tensor
    self.target_ids.append(torch.tensor(target_chunk)) # Each could have form [max_length]

    # Return the overall variety of samples within the dataset
    def __len__(self):
    return len(self.input_ids)

    # Return a single input-target pair by index
    def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]

    # Manufacturing facility operate to create a PyTorch DataLoader from uncooked textual content
    def create_dataloader_v1(txt, batch_size=4, max_length=256,
    stride=128, shuffle=True, drop_last=True,
    num_workers=0):

    # Initialize Byte Pair Encoding tokenizer utilized in GPT-2 and GPT-3
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create a dataset with overlapping input-target token pairs
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Wrap the dataset in a DataLoader to allow batching and parallel loading
    dataloader = DataLoader(
    dataset,
    batch_size=batch_size, # Variety of input-target pairs per batch
    shuffle=shuffle, # Randomize the order of samples (necessary for coaching)
    drop_last=drop_last, # Drop final batch if it has fewer samples than batch_size
    num_workers=num_workers # Parallelism for information loading
    )

    return dataloader

    ### Take a look at part

    # Load uncooked textual content from file
    with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.learn()

    # Take a look at DataLoader with batch dimension of 1, max_length=4, stride=1
    # This demonstrates token-by-token sliding (most overlap)
    dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
    )

    data_iter = iter(dataloader)
    first_batch = subsequent(data_iter) # First input-target pair
    print(first_batch)

    second_batch = subsequent(data_iter) # Second pair, shifted by 1
    print(second_batch)

    # Take a look at DataLoader with batch dimension of 8, max_length=4, stride=4
    # This creates non-overlapping sequences
    dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

    data_iter = iter(dataloader)
    inputs, targets = subsequent(data_iter)
    print("Inputs:n", inputs) # Form: [8, 4] (8 sequences, 4 tokens every)
    print("nTargets:n", targets) # Every row is the next-token sequence for the corresponding enter row

    • positional_embedding.py
      Builds token and positional embeddings and combines them into enter tensors for the transformer.
    import torch
    from data_preparation_and_sampling.dataset import create_dataloader_v1

    # Outline the vocabulary dimension (from GPT tokenizer) and desired embedding dimension
    vocab_size = 50257 # Token depend from GPT-2's BPE tokenizer
    output_dim = 256 # Dimensionality of embedding vectors

    # Create the token embedding layer (learns token ID -> vector mappings)
    token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

    # Load coaching textual content ("The Verdict") from file
    with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.learn()

    # Outline mannequin's context dimension (i.e., variety of tokens per coaching pattern)
    max_length = 4

    # Create a DataLoader to return tokenized coaching sequences in batches
    dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
    stride=max_length, shuffle=False
    )

    # Fetch the primary batch of enter and goal sequences
    data_iter = iter(dataloader)
    inputs, targets = subsequent(data_iter) # Every has form [8, 4]

    # Print the uncooked token IDs for visualization
    print("Token IDs:n", inputs)
    print("nInputs form:n", inputs.form)

    # Convert token IDs into dense vector representations
    # Output form: [batch_size, context_length, embedding_dim] → [8, 4, 256]
    token_embeddings = token_embedding_layer(inputs)
    print(token_embeddings.form)

    # Outline positional embedding layer:
    # This assigns a singular vector to every place (0 by max_length - 1)
    context_length = max_length
    pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

    # Generate a place index: tensor([0, 1, 2, 3])
    # Every index is mapped to its corresponding positional embedding
    pos_embeddings = pos_embedding_layer(torch.arange(max_length))
    print(pos_embeddings.form) # Form: [4, 256] (one for every place)

    # Add token and positional embeddings:
    # PyTorch will broadcast [4, 256] positional embeddings throughout the batch dimension (8)
    input_embeddings = token_embeddings + pos_embeddings
    print(input_embeddings.form) # Remaining form: [8, 4, 256]

    # Now, input_embeddings might be handed into the transformer mannequin's consideration blocks

    • Dealing with unknowns: Regex tokenization fails on uncommon phrases. BPE resolves this with swish degradation.
    • Vocab synchronization: It took care to make sure that token IDs, vocabulary, and decoding stayed in sync.
    • Tensor broadcasting: Including positional vectors throughout batch dimensions required form alignment.
    • Sampling mechanics: The interplay between stride, sequence size, and overlap was difficult at first.

    In Week 3, I’ll transfer into the second and third blocks of Stage 1: implementing self-attention and setting up the Transformer decoder. This marks the transition from enter engineering to mannequin logic.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBuying The Dip: Overcome Fear During A Correction And Prosper
    Next Article Running Python Programs in Your Browser
    FinanceStarGate

    Related Posts

    Machine Learning

    Dados não falam sozinhos. Aqui vai o checklist pra fazê-los falar. | by Deboradelazantos | May, 2025

    May 12, 2025
    Machine Learning

    Logistic Regression in Real Life: How Netflix, Uber, and Banks Use It Daily | by Jainil Gosalia | May, 2025

    May 12, 2025
    Machine Learning

    Optimizing SQL Queries for Complex Reports: Tips and Tricks | by The Analyst’s Edge | May, 2025

    May 12, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Beyond ARMA: Unveiling Mamba, GRU, KAN & GNN for the Future of Time Series Forecasting | by Subhasmukherjee | Apr, 2025

    April 30, 2025

    The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines

    February 27, 2025

    Dell Issues Strict RTO Mandate for Most Employees

    February 1, 2025

    Papers Explained 362: Llama-Nemotron | by Ritvik Rastogi | May, 2025

    May 9, 2025

    Understanding Kimi k1.5: Scaling Reinforcement Learning with LLMs | by Nandini Lokesh Reddy | Feb, 2025

    February 8, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    How AI Agents Services Are Transforming Business Operations?

    February 1, 2025

    Cycling Benefits: Why Riding a Bike Every Day Can Revolutionize Your Health | by Professor | May, 2025

    May 2, 2025

    Graph Neural Networks Part 4: Teaching Models to Connect the Dots

    April 30, 2025
    Our Picks

    Slash ML Costs Without Losing Your Cool | by Baivab Mukhopadhyay | devdotcom | May, 2025

    May 3, 2025

    Ugشماره خاله تهران شماره خاله اصفهان شماره خاله شیراز شماره خاله کرج شماره خاله کرمانشاه شماره خاله…

    March 3, 2025

    Top Side Hustle in Your City? Here’s the Fastest-Growing Gig

    April 3, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.