Close Menu
    Trending
    • What’s the Highest Paid Hourly Position at Walmart?
    • Connecting the Dots for Better Movie Recommendations
    • Diabetes Prediction with Machine Learning by Model Mavericks | by Olivia Godwin | Jun, 2025
    • Mattel, OpenAI Sign Deal to Bring ChatGPT to ‘Iconic’ Toys
    • Agentic AI 103: Building Multi-Agent Teams
    • Vertical Integration in the AI Tech Stack | by Aashna Kumar | Jun, 2025
    • How to Build a Tech-Forward Company That Lasts
    • User Authorisation in Streamlit With OIDC and Google
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Machine Learning»Introduction to Sequence Modeling with Transformers | by Joni Kamarainen | Feb, 2025
    Machine Learning

    Introduction to Sequence Modeling with Transformers | by Joni Kamarainen | Feb, 2025

    FinanceStarGateBy FinanceStarGateFebruary 28, 2025No Comments6 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The Seq2SeqTransformer can be taught all above sequences with out bother. The ultimate step is to show all eight sequence-to-sequence translations to a single transformer mannequin:

    • 0,0,0,0 → 1,1,1,1
    • 1,1,1,1 → 0,0,0,0
    • 1,1,1 → 0
    • 0,0,0 → 1
    • 0 → 1,1,1
    • 1 → 0,0,0
    • 0,1,0,1 → 0,1,0,1
    • 1,0,1,0 → 1,0,1,0

    The requirement the present mannequin can’t deal with is that sequences are of various lengths. There are two choices, every sequence could be educated individually, which is inefficient, or dummy PAD tokens are added on the finish of sequences which might be shorter than the utmost size. If all sequences are roughly the identical size, then the latter one is extra environment friendly answer.

    Padding maskss

    Further bother with padding is that much like masking of future tokens, the PAD tokens should be masked throughout coaching. There are three torch.nn.Transformer.ahead() parameters by which the masks tensors should be offered:

    • Enter masking (src_key_padding_mask)
    • Output (goal) masking (tgt_key_padding_mask)
    • Decoder reminiscence masking (memory_key_padding_mask)

    In a lot of the circumstances decoder reminiscence masks is similar as enter masks, i.e. it prevents the decoder to see PAD tokens in its ’reminiscence’.

    The padding masks have an effect on the embedding as extra token must be added:

    # Token embedding layer - this takes care of changing integer to vectors
    self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)

    One other consideration is the loss perform because it ought to ignore gradients with respect to the PAD tokens.

    loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

    Let’s put all of it collectively.

    Ultimate run

    Generate knowledge:

    def generate_data5(n):
    SOS_token = np.array([2])
    EOS_token = np.array([3])

    knowledge = []
    seq_len = []

    # 0,0,0,0 -> 0,0,0,0
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
    y = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
    knowledge.append([X, y])
    seq_len.append([4+2, 4+2])

    # 1,1,1,1 -> 1,1,1,1
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
    y = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
    knowledge.append([X, y])
    seq_len.append([4+2, 4+2])

    # 0,0,0 -> 1
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
    y = np.concatenate((SOS_token, [1], EOS_token))
    knowledge.append([X, y])
    seq_len.append([3+2, 1+2])

    # 1,1,1 -> 0
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
    y = np.concatenate((SOS_token, [0], EOS_token))
    knowledge.append([X, y])
    seq_len.append([3+2, 1+2])

    # 1 -> 0,0,0
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [1], EOS_token))
    y = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
    knowledge.append([X, y])
    seq_len.append([1+2, 3+2])

    # 0 -> 1,1,1
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [0], EOS_token))
    y = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
    knowledge.append([X, y])
    seq_len.append([1+2, 3+2])

    # 0,1,0,1 -> 0,1,0,1
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
    y = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
    knowledge.append([X, y])
    seq_len.append([4+2, 4+2])

    # 1,0,1,0 -> 1,0,1,0
    for i in vary(n // 8):
    X = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
    y = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
    knowledge.append([X, y])
    seq_len.append([4+2, 4+2])

    temp = checklist(zip(knowledge, seq_len)) # Pair the weather
    random.shuffle(temp) # Shuffle the pairs
    knowledge, seq_len = zip(*temp) # Unzip into separate lists

    #np.random.shuffle(knowledge)

    return knowledge, seq_len

    Assemble coaching knowledge and add PAD tokens to sequences shorter than the utmost:

    # Generate knowledge and size of every sequence
    tr_data, tr_seq_len = generate_data5(200)

    # Add the pad tokens
    PAD_IDX = 4
    max_len_X = max([foo[0] for foo in tr_seq_len])
    max_len_Y = max([foo[1] for foo in tr_seq_len])
    print(max_len_X)
    print(max_len_Y)

    X_tr = PAD_IDX*torch.ones((max_len_X,len(tr_data)))
    Y_tr = PAD_IDX*torch.ones((max_len_Y,len(tr_data)))
    for ids, s in enumerate(tr_data):
    X_tr[:tr_seq_len[ids][0],ids] = torch.from_numpy(s[0])
    Y_tr[:tr_seq_len[ids][1],ids] = torch.from_numpy(s[1])

    # Assemble logical pad masks (True is PAD)
    src_padding_mask = (X_tr == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (Y_tr == PAD_IDX).transpose(0, 1)

    Re-define Seq2SeqTransformer with padding assist this time (extra parameters added to the transformer name within the ahead() perform:

    class Seq2SeqTransformer(nn.Module):
    # Constructor
    def __init__(
    self,
    num_tokens,
    d_model,
    nhead,
    num_encoder_layers,
    num_decoder_layers,
    dim_feedforward,
    dropout_p,
    layer_norm_eps,
    padding_idx = None
    ):
    tremendous().__init__()

    self.d_model = d_model
    self.padding_idx = padding_idx

    if padding_idx != None:
    # Token embedding layer - this takes care of changing integer to vectors
    self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)
    else:
    # Token embedding layer - this takes care of changing integer to vectors
    self.embedding = nn.Embedding(num_tokens, d_model)

    # Token "unembedding" to one-hot token vector
    self.unembedding = nn.Linear(d_model, num_tokens)

    # Positional encoding
    self.positional_encoder = PositionalEncoding(d_model=d_model, dropout=dropout_p)

    # nn.Transformer that does the magic
    self.transformer = nn.Transformer(
    d_model = d_model,
    nhead = nhead,
    num_encoder_layers = num_encoder_layers,
    num_decoder_layers = num_decoder_layers,
    dim_feedforward = dim_feedforward,
    dropout = dropout_p,
    layer_norm_eps = layer_norm_eps,
    norm_first = True
    )

    def ahead(
    self,
    src,
    tgt,
    tgt_mask = None,
    src_key_padding_mask = None,
    tgt_key_padding_mask = None
    ):
    # Notice: src & tgt default dimension is (seq_length, batch_num, feat_dim)

    # Token embedding
    src = self.embedding(src) * math.sqrt(self.d_model)
    tgt = self.embedding(tgt) * math.sqrt(self.d_model)

    # Positional encoding - that is delicate that knowledge _must_ be seq len x batch num x feat dim
    # Inference usually misses the batch num
    if src.dim() == 2: # seq len x feat dim
    src = torch.unsqueeze(src,1)
    src = self.positional_encoder(src)
    if tgt.dim() == 2: # seq len x feat dim
    tgt = torch.unsqueeze(tgt,1)
    tgt = self.positional_encoder(tgt)

    # Transformer output
    out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask = src_key_padding_mask,
    tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=src_key_padding_mask)
    out = self.unembedding(out)

    return out

    Assemble the mannequin and practice

    mannequin = Seq2SeqTransformer(num_tokens = 4, d_model = 8, nhead = 1, num_encoder_layers = 1,
    num_decoder_layers = 1, dim_feedforward = 8, dropout_p = 0.1,
    layer_norm_eps = 1e-05, padding_idx = PAD_IDX)

    num_of_epochs = 2000
    loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    optimizer = torch.optim.Adam(mannequin.parameters(), lr=0.01)
    scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[1000], gamma=0.1)
    mannequin.practice()
    for n in vary(num_of_epochs):
    running_loss = 0.0
    X_in = X_tr.lengthy()
    Y_in = Y_tr[:-1,:].lengthy()
    Y_out = Y_tr[1:,:].lengthy()
    tgt_padding_mask_in = tgt_padding_mask[:,:-1]

    # Get masks to masks out the following phrases
    sequence_length = Y_in.dimension(0)
    tgt_mask = nn.Transformer.generate_square_subsequent_mask(sequence_length)

    Y_pred = mannequin(X_in,Y_in, tgt_mask = tgt_mask, src_key_padding_mask = src_padding_mask,
    tgt_key_padding_mask = tgt_padding_mask_in)

    # seq len x num samples => num samples x seq len
    Y_out = Y_out.permute(1,0)
    # seq len x num samples x token one scorching => num samples x token one scorching x seq len
    Y_pred = Y_pred.permute(1, 2, 0)
    #print(Y_pred.form)
    loss = loss_fn(Y_pred,Y_out)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    running_loss += loss.merchandise()
    scheduler.step()
    if n % 100 == 0:
    print(f' Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')
    print(f'Ultimate: Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')

    Epoch 0 coaching loss 1.3525161743164062 (lr=0.01)
    Epoch 100 coaching loss 0.5397025942802429 (lr=0.01)
    Epoch 200 coaching loss 0.3458441495895386 (lr=0.01)
    Epoch 300 coaching loss 0.2927950918674469 (lr=0.01)
    Epoch 400 coaching loss 0.24350842833518982 (lr=0.01)
    Epoch 500 coaching loss 0.2193882018327713 (lr=0.01)
    Epoch 600 coaching loss 0.1860966682434082 (lr=0.01)
    Epoch 700 coaching loss 0.15263524651527405 (lr=0.01)
    Epoch 800 coaching loss 0.1544901579618454 (lr=0.01)
    Epoch 900 coaching loss 0.16662688553333282 (lr=0.01)
    Epoch 1000 coaching loss 0.13945193588733673 (lr=0.001)
    Epoch 1100 coaching loss 0.1130961999297142 (lr=0.001)
    Epoch 1200 coaching loss 0.12732738256454468 (lr=0.001)
    Epoch 1300 coaching loss 0.12633047997951508 (lr=0.001)
    Epoch 1400 coaching loss 0.12585079669952393 (lr=0.001)
    Epoch 1500 coaching loss 0.13260918855667114 (lr=0.001)
    Epoch 1600 coaching loss 0.09995909780263901 (lr=0.001)
    Epoch 1700 coaching loss 0.09377395361661911 (lr=0.001)
    Epoch 1800 coaching loss 0.12214040011167526 (lr=0.001)
    Epoch 1900 coaching loss 0.09379428625106812 (lr=0.001)
    Ultimate: Epoch 1999 coaching loss 0.11169883608818054 (lr=0.001)

    Check the mannequin with all sequences

    # Right here we check some examples to look at how the mannequin predicts
    examples = [
    torch.tensor([2, 0, 0, 0, 0, 3], dtype=torch.lengthy),
    torch.tensor([2, 1, 1, 1, 1, 3], dtype=torch.lengthy),
    torch.tensor([2, 1, 1, 1, 3], dtype=torch.lengthy),
    torch.tensor([2, 0, 0, 0, 3], dtype=torch.lengthy),
    torch.tensor([2, 0, 3], dtype=torch.lengthy),
    torch.tensor([2, 1, 3], dtype=torch.lengthy),
    torch.tensor([2, 0, 1, 0, 1, 3], dtype=torch.lengthy),
    torch.tensor([2, 1, 0, 1, 0, 3], dtype=torch.lengthy),
    ]

    for idx, instance in enumerate(examples):
    consequence = predict(mannequin, instance)
    print(f"Instance {idx}")
    print(f"Enter sequence: {instance.view(-1).tolist()[1:-1]}")
    print(f"Output (predicted) sequence: {consequence[1:-1]}")
    print()

    Instance 0
    Enter sequence: [0, 0, 0, 0]
    Output (predicted) sequence: [0, 0, 0, 0]

    Instance 1
    Enter sequence: [1, 1, 1, 1]
    Output (predicted) sequence: [1, 1, 1, 1]

    Instance 2
    Enter sequence: [1, 1, 1]
    Output (predicted) sequence: [0]

    Instance 3
    Enter sequence: [0, 0, 0]
    Output (predicted) sequence: [1]

    Instance 4
    Enter sequence: [0]
    Output (predicted) sequence: [1, 1, 1]

    Instance 5
    Enter sequence: [1]
    Output (predicted) sequence: [0, 0, 0]

    Instance 6
    Enter sequence: [0, 1, 0, 1]
    Output (predicted) sequence: [0, 1, 0, 1]

    Instance 7
    Enter sequence: [1, 0, 1, 0]
    Output (predicted) sequence: [1, 0, 1, 0]

    It really works as we anticipated. The complete mannequin can now be used for any Seq2Seq drawback.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous Article5 Real-World Applications of Quantum Computing in 2025
    Next Article An ancient RNA-guided system could simplify delivery of gene editing therapies | MIT News
    FinanceStarGate

    Related Posts

    Machine Learning

    Diabetes Prediction with Machine Learning by Model Mavericks | by Olivia Godwin | Jun, 2025

    June 12, 2025
    Machine Learning

    Vertical Integration in the AI Tech Stack | by Aashna Kumar | Jun, 2025

    June 12, 2025
    Machine Learning

    A Practical Guide to Time Series Model Explainability Using Darts | by Agreharshit | Jun, 2025

    June 12, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    5 Essential Tips Learned from My Data Science Journey | by Federico Rucci | Feb, 2025

    February 2, 2025

    Boston Celtics Are the Most Expensive Sports Sale Ever

    March 21, 2025

    Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack

    May 13, 2025

    Model Context Protocol (MCP) Tutorial: Build Your First MCP Server in 6 Steps

    June 11, 2025

    How do I trim tax on selling employee stock purchase plan shares?

    February 14, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Don’t Build Up Relationship Debt!

    April 29, 2025

    Boost Productivity With This Adjustable Stand With Port Hub for Just $100

    April 26, 2025

    Learnings from a Machine Learning Engineer — Part 1: The Data

    February 14, 2025
    Our Picks

    I will write data science ,data analyst ,data engineer, machine learning resume | by Oluwafemiadeola | Mar, 2025

    March 28, 2025

    Hdhdhe

    February 7, 2025

    Elon Musk Posted Nearly 200 Times on X In 24 Hours

    February 6, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.