Introduction to Sequence Modeling with Transformers | by Joni Kamarainen

The Seq2SeqTransformer can be taught all above sequences with out bother. The ultimate step is to show all eight sequence-to-sequence translations to a single transformer mannequin:

0,0,0,0 → 1,1,1,1
1,1,1,1 → 0,0,0,0
1,1,1 → 0
0,0,0 → 1
0 → 1,1,1
1 → 0,0,0
0,1,0,1 → 0,1,0,1
1,0,1,0 → 1,0,1,0

The requirement the present mannequin can’t deal with is that sequences are of various lengths. There are two choices, every sequence could be educated individually, which is inefficient, or dummy PAD tokens are added on the finish of sequences which might be shorter than the utmost size. If all sequences are roughly the identical size, then the latter one is extra environment friendly answer.

Padding maskss

Further bother with padding is that much like masking of future tokens, the PAD tokens should be masked throughout coaching. There are three torch.nn.Transformer.ahead() parameters by which the masks tensors should be offered:

Enter masking (src_key_padding_mask)
Output (goal) masking (tgt_key_padding_mask)
Decoder reminiscence masking (memory_key_padding_mask)

In a lot of the circumstances decoder reminiscence masks is similar as enter masks, i.e. it prevents the decoder to see PAD tokens in its ’reminiscence’.

The padding masks have an effect on the embedding as extra token must be added:

# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)

One other consideration is the loss perform because it ought to ignore gradients with respect to the PAD tokens.

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Let’s put all of it collectively.

Ultimate run

Generate knowledge:

def generate_data5(n):
SOS_token = np.array([2])
EOS_token = np.array([3])knowledge = []
seq_len = []
# 0,0,0,0 -> 0,0,0,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
y = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 1,1,1,1 -> 1,1,1,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
y = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 0,0,0 -> 1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
y = np.concatenate((SOS_token, [1], EOS_token))
knowledge.append([X, y])
seq_len.append([3+2, 1+2])
# 1,1,1 -> 0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
y = np.concatenate((SOS_token, [0], EOS_token))
knowledge.append([X, y])
seq_len.append([3+2, 1+2])
# 1 -> 0,0,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1], EOS_token))
y = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
knowledge.append([X, y])
seq_len.append([1+2, 3+2])
# 0 -> 1,1,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0], EOS_token))
y = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
knowledge.append([X, y])
seq_len.append([1+2, 3+2])
# 0,1,0,1 -> 0,1,0,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
y = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 1,0,1,0 -> 1,0,1,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
y = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
temp = checklist(zip(knowledge, seq_len))  # Pair the weather
random.shuffle(temp)  # Shuffle the pairs
knowledge, seq_len = zip(*temp)  # Unzip into separate lists
#np.random.shuffle(knowledge)
return knowledge, seq_len

Assemble coaching knowledge and add PAD tokens to sequences shorter than the utmost:

# Generate knowledge and size of every sequence
tr_data, tr_seq_len = generate_data5(200)# Add the pad tokens
PAD_IDX = 4
max_len_X = max([foo[0] for foo in tr_seq_len])
max_len_Y = max([foo[1] for foo in tr_seq_len])
print(max_len_X)
print(max_len_Y)
X_tr = PAD_IDX*torch.ones((max_len_X,len(tr_data)))
Y_tr = PAD_IDX*torch.ones((max_len_Y,len(tr_data)))
for ids, s in enumerate(tr_data):
X_tr[:tr_seq_len[ids][0],ids] = torch.from_numpy(s[0])
Y_tr[:tr_seq_len[ids][1],ids] = torch.from_numpy(s[1])
# Assemble logical pad masks (True is PAD)
src_padding_mask = (X_tr == PAD_IDX).transpose(0, 1)
tgt_padding_mask = (Y_tr == PAD_IDX).transpose(0, 1)

Re-define Seq2SeqTransformer with padding assist this time (extra parameters added to the transformer name within the ahead() perform:

class Seq2SeqTransformer(nn.Module):
# Constructor
def __init__(
self,
num_tokens,
d_model,
nhead,
num_encoder_layers,
num_decoder_layers,
dim_feedforward,
dropout_p,
layer_norm_eps,
padding_idx = None
):
tremendous().__init__()self.d_model = d_model
self.padding_idx = padding_idx
if padding_idx != None:
# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)
else:
# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens, d_model)
# Token "unembedding" to one-hot token vector
self.unembedding = nn.Linear(d_model, num_tokens)
# Positional encoding
self.positional_encoder = PositionalEncoding(d_model=d_model, dropout=dropout_p)
# nn.Transformer that does the magic
self.transformer = nn.Transformer(
d_model = d_model,
nhead = nhead,
num_encoder_layers = num_encoder_layers,
num_decoder_layers = num_decoder_layers,
dim_feedforward = dim_feedforward,
dropout = dropout_p,
layer_norm_eps = layer_norm_eps,
norm_first = True
)
def ahead(
self,
src,
tgt,
tgt_mask = None,
src_key_padding_mask = None,
tgt_key_padding_mask = None
):
# Notice: src & tgt default dimension is (seq_length, batch_num, feat_dim)
# Token embedding
src = self.embedding(src) * math.sqrt(self.d_model)
tgt = self.embedding(tgt) * math.sqrt(self.d_model)
# Positional encoding - that is delicate that knowledge _must_ be seq len x batch num x feat dim
# Inference usually misses the batch num
if src.dim() == 2: # seq len x feat dim
src = torch.unsqueeze(src,1) 
src = self.positional_encoder(src)
if tgt.dim() == 2: # seq len x feat dim
tgt = torch.unsqueeze(tgt,1) 
tgt = self.positional_encoder(tgt)
# Transformer output
out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask = src_key_padding_mask,
tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=src_key_padding_mask)
out = self.unembedding(out)
return out

Assemble the mannequin and practice

mannequin = Seq2SeqTransformer(num_tokens = 4, d_model = 8, nhead = 1, num_encoder_layers = 1,
num_decoder_layers = 1, dim_feedforward = 8, dropout_p = 0.1,
layer_norm_eps = 1e-05, padding_idx = PAD_IDX)num_of_epochs = 2000
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(mannequin.parameters(), lr=0.01)
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[1000], gamma=0.1)
mannequin.practice()
for n in vary(num_of_epochs):
running_loss = 0.0
X_in = X_tr.lengthy()
Y_in = Y_tr[:-1,:].lengthy()
Y_out = Y_tr[1:,:].lengthy()
tgt_padding_mask_in = tgt_padding_mask[:,:-1]
# Get masks to masks out the following phrases
sequence_length = Y_in.dimension(0)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(sequence_length)
Y_pred = mannequin(X_in,Y_in, tgt_mask = tgt_mask, src_key_padding_mask = src_padding_mask,
tgt_key_padding_mask = tgt_padding_mask_in)
# seq len x num samples => num samples x seq len
Y_out = Y_out.permute(1,0)
# seq len x num samples x token one scorching => num samples x token one scorching x seq len    
Y_pred = Y_pred.permute(1, 2, 0)
#print(Y_pred.form)
loss = loss_fn(Y_pred,Y_out)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.merchandise()
scheduler.step()
if n % 100 == 0:
print(f'   Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')
print(f'Ultimate:   Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')

Epoch 0 coaching loss 1.3525161743164062 (lr=0.01)
Epoch 100 coaching loss 0.5397025942802429 (lr=0.01)
Epoch 200 coaching loss 0.3458441495895386 (lr=0.01)
Epoch 300 coaching loss 0.2927950918674469 (lr=0.01)
Epoch 400 coaching loss 0.24350842833518982 (lr=0.01)
Epoch 500 coaching loss 0.2193882018327713 (lr=0.01)
Epoch 600 coaching loss 0.1860966682434082 (lr=0.01)
Epoch 700 coaching loss 0.15263524651527405 (lr=0.01)
Epoch 800 coaching loss 0.1544901579618454 (lr=0.01)
Epoch 900 coaching loss 0.16662688553333282 (lr=0.01)
Epoch 1000 coaching loss 0.13945193588733673 (lr=0.001)
Epoch 1100 coaching loss 0.1130961999297142 (lr=0.001)
Epoch 1200 coaching loss 0.12732738256454468 (lr=0.001)
Epoch 1300 coaching loss 0.12633047997951508 (lr=0.001)
Epoch 1400 coaching loss 0.12585079669952393 (lr=0.001)
Epoch 1500 coaching loss 0.13260918855667114 (lr=0.001)
Epoch 1600 coaching loss 0.09995909780263901 (lr=0.001)
Epoch 1700 coaching loss 0.09377395361661911 (lr=0.001)
Epoch 1800 coaching loss 0.12214040011167526 (lr=0.001)
Epoch 1900 coaching loss 0.09379428625106812 (lr=0.001)
Ultimate:   Epoch 1999 coaching loss 0.11169883608818054 (lr=0.001)

Check the mannequin with all sequences

# Right here we check some examples to look at how the mannequin predicts
examples = [
torch.tensor([2, 0, 0, 0, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 1, 1, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 1, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 0, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 1, 0, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 0, 1, 0, 3], dtype=torch.lengthy),
]for idx, instance in enumerate(examples):
consequence = predict(mannequin, instance)
print(f"Instance {idx}")
print(f"Enter sequence: {instance.view(-1).tolist()[1:-1]}")
print(f"Output (predicted) sequence: {consequence[1:-1]}")
print()

Instance 0
Enter sequence: [0, 0, 0, 0]
Output (predicted) sequence: [0, 0, 0, 0]Instance 1
Enter sequence: [1, 1, 1, 1]
Output (predicted) sequence: [1, 1, 1, 1]
Instance 2
Enter sequence: [1, 1, 1]
Output (predicted) sequence: [0]
Instance 3
Enter sequence: [0, 0, 0]
Output (predicted) sequence: [1]
Instance 4
Enter sequence: [0]
Output (predicted) sequence: [1, 1, 1]
Instance 5
Enter sequence: [1]
Output (predicted) sequence: [0, 0, 0]
Instance 6
Enter sequence: [0, 1, 0, 1]
Output (predicted) sequence: [0, 1, 0, 1]
Instance 7
Enter sequence: [1, 0, 1, 0]
Output (predicted) sequence: [1, 0, 1, 0]

It really works as we anticipated. The complete mannequin can now be used for any Seq2Seq drawback.

Source link

Diabetes Prediction with Machine Learning by Model Mavericks | by Olivia Godwin | Jun, 2025

Vertical Integration in the AI Tech Stack | by Aashna Kumar | Jun, 2025

A Practical Guide to Time Series Model Explainability Using Darts | by Agreharshit | Jun, 2025

5 Essential Tips Learned from My Data Science Journey | by Federico Rucci | Feb, 2025

Boston Celtics Are the Most Expensive Sports Sale Ever

Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack

Model Context Protocol (MCP) Tutorial: Build Your First MCP Server in 6 Steps

How do I trim tax on selling employee stock purchase plan shares?

Most Popular

Don’t Build Up Relationship Debt!

Boost Productivity With This Adjustable Stand With Port Hub for Just $100

Learnings from a Machine Learning Engineer — Part 1: The Data

Our Picks

I will write data science ,data analyst ,data engineer, machine learning resume | by Oluwafemiadeola | Mar, 2025

Hdhdhe

Elon Musk Posted Nearly 200 Times on X In 24 Hours

Introduction to Sequence Modeling with Transformers | by Joni Kamarainen | Feb, 2025

Padding maskss

Ultimate run

Related Posts