Introduction to Sequence Modeling with Transformers | by Joni Kamarainen

The Seq2SeqTransformer can be taught all above sequences with out bother. The ultimate step is to show all eight sequence-to-sequence translations to a single transformer mannequin:

0,0,0,0 → 1,1,1,1
1,1,1,1 → 0,0,0,0
1,1,1 → 0
0,0,0 → 1
0 → 1,1,1
1 → 0,0,0
0,1,0,1 → 0,1,0,1
1,0,1,0 → 1,0,1,0

The requirement the present mannequin can’t deal with is that sequences are of various lengths. There are two choices, every sequence could be educated individually, which is inefficient, or dummy PAD tokens are added on the finish of sequences which might be shorter than the utmost size. If all sequences are roughly the identical size, then the latter one is extra environment friendly answer.

Padding maskss

Further bother with padding is that much like masking of future tokens, the PAD tokens should be masked throughout coaching. There are three torch.nn.Transformer.ahead() parameters by which the masks tensors should be offered:

Enter masking (src_key_padding_mask)
Output (goal) masking (tgt_key_padding_mask)
Decoder reminiscence masking (memory_key_padding_mask)

In a lot of the circumstances decoder reminiscence masks is similar as enter masks, i.e. it prevents the decoder to see PAD tokens in its ’reminiscence’.

The padding masks have an effect on the embedding as extra token must be added:

# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)

One other consideration is the loss perform because it ought to ignore gradients with respect to the PAD tokens.

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Let’s put all of it collectively.

Ultimate run

Generate knowledge:

def generate_data5(n):
SOS_token = np.array([2])
EOS_token = np.array([3])knowledge = []
seq_len = []
# 0,0,0,0 -> 0,0,0,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
y = np.concatenate((SOS_token, [0, 0, 0, 0], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 1,1,1,1 -> 1,1,1,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
y = np.concatenate((SOS_token, [1, 1, 1, 1], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 0,0,0 -> 1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
y = np.concatenate((SOS_token, [1], EOS_token))
knowledge.append([X, y])
seq_len.append([3+2, 1+2])
# 1,1,1 -> 0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
y = np.concatenate((SOS_token, [0], EOS_token))
knowledge.append([X, y])
seq_len.append([3+2, 1+2])
# 1 -> 0,0,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1], EOS_token))
y = np.concatenate((SOS_token, [0, 0, 0], EOS_token))
knowledge.append([X, y])
seq_len.append([1+2, 3+2])
# 0 -> 1,1,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0], EOS_token))
y = np.concatenate((SOS_token, [1, 1, 1], EOS_token))
knowledge.append([X, y])
seq_len.append([1+2, 3+2])
# 0,1,0,1 -> 0,1,0,1 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
y = np.concatenate((SOS_token, [0,1,0,1], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
# 1,0,1,0 -> 1,0,1,0 
for i in vary(n // 8):
X = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
y = np.concatenate((SOS_token, [1,0,1,0], EOS_token))
knowledge.append([X, y])
seq_len.append([4+2, 4+2])
temp = checklist(zip(knowledge, seq_len))  # Pair the weather
random.shuffle(temp)  # Shuffle the pairs
knowledge, seq_len = zip(*temp)  # Unzip into separate lists
#np.random.shuffle(knowledge)
return knowledge, seq_len

Assemble coaching knowledge and add PAD tokens to sequences shorter than the utmost:

# Generate knowledge and size of every sequence
tr_data, tr_seq_len = generate_data5(200)# Add the pad tokens
PAD_IDX = 4
max_len_X = max([foo[0] for foo in tr_seq_len])
max_len_Y = max([foo[1] for foo in tr_seq_len])
print(max_len_X)
print(max_len_Y)
X_tr = PAD_IDX*torch.ones((max_len_X,len(tr_data)))
Y_tr = PAD_IDX*torch.ones((max_len_Y,len(tr_data)))
for ids, s in enumerate(tr_data):
X_tr[:tr_seq_len[ids][0],ids] = torch.from_numpy(s[0])
Y_tr[:tr_seq_len[ids][1],ids] = torch.from_numpy(s[1])
# Assemble logical pad masks (True is PAD)
src_padding_mask = (X_tr == PAD_IDX).transpose(0, 1)
tgt_padding_mask = (Y_tr == PAD_IDX).transpose(0, 1)

Re-define Seq2SeqTransformer with padding assist this time (extra parameters added to the transformer name within the ahead() perform:

class Seq2SeqTransformer(nn.Module):
# Constructor
def __init__(
self,
num_tokens,
d_model,
nhead,
num_encoder_layers,
num_decoder_layers,
dim_feedforward,
dropout_p,
layer_norm_eps,
padding_idx = None
):
tremendous().__init__()self.d_model = d_model
self.padding_idx = padding_idx
if padding_idx != None:
# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens+1, d_model, padding_idx = self.padding_idx)
else:
# Token embedding layer - this takes care of changing integer to vectors
self.embedding = nn.Embedding(num_tokens, d_model)
# Token "unembedding" to one-hot token vector
self.unembedding = nn.Linear(d_model, num_tokens)
# Positional encoding
self.positional_encoder = PositionalEncoding(d_model=d_model, dropout=dropout_p)
# nn.Transformer that does the magic
self.transformer = nn.Transformer(
d_model = d_model,
nhead = nhead,
num_encoder_layers = num_encoder_layers,
num_decoder_layers = num_decoder_layers,
dim_feedforward = dim_feedforward,
dropout = dropout_p,
layer_norm_eps = layer_norm_eps,
norm_first = True
)
def ahead(
self,
src,
tgt,
tgt_mask = None,
src_key_padding_mask = None,
tgt_key_padding_mask = None
):
# Notice: src & tgt default dimension is (seq_length, batch_num, feat_dim)
# Token embedding
src = self.embedding(src) * math.sqrt(self.d_model)
tgt = self.embedding(tgt) * math.sqrt(self.d_model)
# Positional encoding - that is delicate that knowledge _must_ be seq len x batch num x feat dim
# Inference usually misses the batch num
if src.dim() == 2: # seq len x feat dim
src = torch.unsqueeze(src,1) 
src = self.positional_encoder(src)
if tgt.dim() == 2: # seq len x feat dim
tgt = torch.unsqueeze(tgt,1) 
tgt = self.positional_encoder(tgt)
# Transformer output
out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask = src_key_padding_mask,
tgt_key_padding_mask=tgt_key_padding_mask, memory_key_padding_mask=src_key_padding_mask)
out = self.unembedding(out)
return out

Assemble the mannequin and practice

mannequin = Seq2SeqTransformer(num_tokens = 4, d_model = 8, nhead = 1, num_encoder_layers = 1,
num_decoder_layers = 1, dim_feedforward = 8, dropout_p = 0.1,
layer_norm_eps = 1e-05, padding_idx = PAD_IDX)num_of_epochs = 2000
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(mannequin.parameters(), lr=0.01)
scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[1000], gamma=0.1)
mannequin.practice()
for n in vary(num_of_epochs):
running_loss = 0.0
X_in = X_tr.lengthy()
Y_in = Y_tr[:-1,:].lengthy()
Y_out = Y_tr[1:,:].lengthy()
tgt_padding_mask_in = tgt_padding_mask[:,:-1]
# Get masks to masks out the following phrases
sequence_length = Y_in.dimension(0)
tgt_mask = nn.Transformer.generate_square_subsequent_mask(sequence_length)
Y_pred = mannequin(X_in,Y_in, tgt_mask = tgt_mask, src_key_padding_mask = src_padding_mask,
tgt_key_padding_mask = tgt_padding_mask_in)
# seq len x num samples => num samples x seq len
Y_out = Y_out.permute(1,0)
# seq len x num samples x token one scorching => num samples x token one scorching x seq len    
Y_pred = Y_pred.permute(1, 2, 0)
#print(Y_pred.form)
loss = loss_fn(Y_pred,Y_out)
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.merchandise()
scheduler.step()
if n % 100 == 0:
print(f'   Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')
print(f'Ultimate:   Epoch {n} coaching loss {running_loss} (lr={optimizer.param_groups[0]["lr"]})')

Epoch 0 coaching loss 1.3525161743164062 (lr=0.01)
Epoch 100 coaching loss 0.5397025942802429 (lr=0.01)
Epoch 200 coaching loss 0.3458441495895386 (lr=0.01)
Epoch 300 coaching loss 0.2927950918674469 (lr=0.01)
Epoch 400 coaching loss 0.24350842833518982 (lr=0.01)
Epoch 500 coaching loss 0.2193882018327713 (lr=0.01)
Epoch 600 coaching loss 0.1860966682434082 (lr=0.01)
Epoch 700 coaching loss 0.15263524651527405 (lr=0.01)
Epoch 800 coaching loss 0.1544901579618454 (lr=0.01)
Epoch 900 coaching loss 0.16662688553333282 (lr=0.01)
Epoch 1000 coaching loss 0.13945193588733673 (lr=0.001)
Epoch 1100 coaching loss 0.1130961999297142 (lr=0.001)
Epoch 1200 coaching loss 0.12732738256454468 (lr=0.001)
Epoch 1300 coaching loss 0.12633047997951508 (lr=0.001)
Epoch 1400 coaching loss 0.12585079669952393 (lr=0.001)
Epoch 1500 coaching loss 0.13260918855667114 (lr=0.001)
Epoch 1600 coaching loss 0.09995909780263901 (lr=0.001)
Epoch 1700 coaching loss 0.09377395361661911 (lr=0.001)
Epoch 1800 coaching loss 0.12214040011167526 (lr=0.001)
Epoch 1900 coaching loss 0.09379428625106812 (lr=0.001)
Ultimate:   Epoch 1999 coaching loss 0.11169883608818054 (lr=0.001)

Check the mannequin with all sequences

# Right here we check some examples to look at how the mannequin predicts
examples = [
torch.tensor([2, 0, 0, 0, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 1, 1, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 1, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 0, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 0, 1, 0, 1, 3], dtype=torch.lengthy),
torch.tensor([2, 1, 0, 1, 0, 3], dtype=torch.lengthy),
]for idx, instance in enumerate(examples):
consequence = predict(mannequin, instance)
print(f"Instance {idx}")
print(f"Enter sequence: {instance.view(-1).tolist()[1:-1]}")
print(f"Output (predicted) sequence: {consequence[1:-1]}")
print()

Instance 0
Enter sequence: [0, 0, 0, 0]
Output (predicted) sequence: [0, 0, 0, 0]Instance 1
Enter sequence: [1, 1, 1, 1]
Output (predicted) sequence: [1, 1, 1, 1]
Instance 2
Enter sequence: [1, 1, 1]
Output (predicted) sequence: [0]
Instance 3
Enter sequence: [0, 0, 0]
Output (predicted) sequence: [1]
Instance 4
Enter sequence: [0]
Output (predicted) sequence: [1, 1, 1]
Instance 5
Enter sequence: [1]
Output (predicted) sequence: [0, 0, 0]
Instance 6
Enter sequence: [0, 1, 0, 1]
Output (predicted) sequence: [0, 1, 0, 1]
Instance 7
Enter sequence: [1, 0, 1, 0]
Output (predicted) sequence: [1, 0, 1, 0]

It really works as we anticipated. The complete mannequin can now be used for any Seq2Seq drawback.

Source link

From Accidents to Actuarial Accuracy: The Role of Assumption Validation in Insurance Claim Amount Prediction Using Linear Regression | by Ved Prakash | Jun, 2025

Why You’re Still Coding AI Manually: Build a GPT-Backed API with Spring Boot in 30 Minutes | by CodeWithUs | Jun, 2025

From Grit to GitHub: My Journey Into Data Science and Analytics | by JashwanthDasari | Jun, 2025

CRA’s ‘stupid mistake’ compels taxpayer to pay taxes on extra income

This Overlooked Legal Tool Can Protect Your Most Sensitive Data

Data Center Cooling: PFCC and ENEOS Collaborate on Materials R&D with NVIDIA ALCHEMI Software

Building ETL Pipelines for Machine Learning Using PySpark: A Comprehensive Guide | by Orami | Apr, 2025

BOOK DRAGON: BOOK GENRE CLASSIFICATION USING MACHINE LEARNING | by Ishita Joshi | Apr, 2025

Most Popular

Anthropic CEO Predicts AI Will Take Over Coding in 12 Months

Smart Anomaly Detection Framework for Satellite Images — technical details | by Talex Maxim (Taimax) | Mar, 2025

Reincarnation of Robots and Machines | by AI & Tech by Nidhika, PhD | Jun, 2025

Our Picks

How the Gig Economy Is Failing Businesses

Apple iPhone Prices Could Rise to $3,500 if Made in the US

Liquid Cooling: CoolIT Systems Announces Row-Based Coolant Distribution Unit

Introduction to Sequence Modeling with Transformers | by Joni Kamarainen | Feb, 2025

Padding maskss

Ultimate run

Related Posts