Introduction
Pure Language Processing and Computer Vision was once two utterly totally different fields. Properly, at the very least again after I began to be taught machine studying and deep studying, I really feel like there are a number of paths to observe, and every of them, together with NLP and Pc Imaginative and prescient, directs me to a very totally different world. Over time, we are able to now observe that AI turns into increasingly superior, with the intersection between a number of fields of research getting extra frequent, together with the 2 I simply talked about.
As we speak, many language fashions have functionality to generate photographs primarily based on the given immediate. That’s one instance of the bridge between NLP and Pc Imaginative and prescient. However I suppose I’ll put it aside for my upcoming article because it is a little more complicated. As an alternative, on this article I’m going to debate the less complicated one: picture captioning. Because the title suggests, that is primarily a way the place a selected mannequin accepts a picture and returns a textual content that describes the enter picture.
One of many earliest papers on this subject is the one titled “Present and Inform: A Neural Picture Caption Generator” written by Vinyals et al. again in 2015 [1]. On this article, I’ll give attention to implementing the Deep Learning mannequin proposed within the paper utilizing PyTorch. Notice that I received’t really display the coaching course of right here as that’s a subject by itself. Let me know within the feedback if you need a separate tutorial on that.
Picture Captioning Framework
Usually talking, picture captioning will be carried out by combining two kinds of fashions: the one specialised to course of photographs and one other one able to processing sequences. I imagine you already know what sort of fashions work greatest for the 2 duties – sure, you’re proper, these are CNN and RNN, respectively. The thought right here is that the CNN is utilized to encode the enter picture (therefore this half is named encoder), whereas the RNN is used for producing a sequence of phrases primarily based on the options encoded by the CNN (therefore the RNN half is named decoder).
It’s mentioned within the paper that the authors tried to take action utilizing GoogLeNet (a.ok.a., Inception V1) for the encoder and LSTM for the decoder. The truth is, the usage of GoogLeNet isn’t explicitly talked about, but primarily based on the illustration supplied within the paper it looks as if the structure used within the encoder is adopted from the unique GoogLeNet paper [2]. The determine under reveals what the proposed structure appears to be like like.
![Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].](https://towardsdatascience.com/wp-content/uploads/2025/02/1kKTqOvW7PgvE7vVZKDjAJg.png)
Speaking extra particularly in regards to the connection between the encoder and the decoder, there are a number of strategies accessible for connecting the 2, particularly init-inject, pre-inject, par-inject and merge, as talked about in [3]. Within the case of the Present and Inform paper, authors used pre-inject, a technique the place the options extracted by the encoder are perceived because the 0th phrase within the caption. Later within the inference section, we count on the decoder to generate a caption primarily based solely on these picture options.
![Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).](https://towardsdatascience.com/wp-content/uploads/2025/02/1lIqALUziG9p9abVaosyyVA.png)
As we already understood the speculation behind the picture captioning mannequin, we are able to now leap into the code!
I’ll break the implementation half into three sections: the Encoder, the Decoder, and the mixture of the 2. Earlier than we really get into them, we have to import the modules and initialize the required parameters upfront. Have a look at the Codeblock 1 under to see the modules I exploit.
# Codeblock 1
import torch #(1)
import torch.nn as nn #(2)
import torchvision.fashions as fashions #(3)
from torchvision.fashions import GoogLeNet_Weights #(4)
Let’s break down these imports rapidly: the road marked with #(1)
is used for primary operations, line #(2)
is for initializing neural community layers, line #(3)
is for loading varied deep studying fashions, and #(4)
is the pretrained weights for the GoogLeNet mannequin.
Speaking in regards to the parameter configuration, EMBED_DIM
and LSTM_HIDDEN_DIM
are the one two parameters talked about within the paper, that are each set to 512 as proven at line #(1)
and #(2)
within the Codeblock 2 under. The EMBED_DIM
variable primarily signifies the function vector dimension representing a single token within the caption. On this case, we are able to merely consider a single token as a person phrase. In the meantime, LSTM_HIDDEN_DIM
is a variable representing the hidden state dimension contained in the LSTM cell. This paper doesn’t point out what number of instances this RNN-based layer is repeated, however primarily based on the diagram in Determine 1, it looks as if it solely implements a single LSTM cell. Thus, at line #(3)
I set the NUM_LSTM_LAYERS
variable to 1.
# Codeblock 2
EMBED_DIM = 512 #(1)
LSTM_HIDDEN_DIM = 512 #(2)
NUM_LSTM_LAYERS = 1 #(3)
IMAGE_SIZE = 224 #(4)
IN_CHANNELS = 3 #(5)
SEQ_LENGTH = 30 #(6)
VOCAB_SIZE = 10000 #(7)
BATCH_SIZE = 1
The subsequent two parameters are associated to the enter picture, particularly IMAGE_SIZE
(#(4)
) and IN_CHANNELS
(#(5)
). Since we’re about to make use of GoogLeNet for the encoder, we have to match it with its authentic enter form (3×224×224). Not just for the picture, however we additionally have to configure the parameters for the caption. Right here we assume that the caption size is not more than 30 phrases (#(6)
) and the variety of distinctive phrases within the dictionary is 10000 (#(7)
). Lastly, the BATCH_SIZE
parameter is used as a result of by default PyTorch processes tensors in a batch. Simply to make issues easy, the variety of image-caption pair inside a single batch is about to 1.
GoogLeNet Encoder
It’s really doable to make use of any type of CNN-based mannequin for the encoder. I discovered on the web that [4] makes use of DenseNet, [5] makes use of Inception V3, and [6] makes use of ResNet for the same duties. Nevertheless, since my objective is to breed the mannequin proposed within the paper as intently as doable, I’m utilizing the pretrained GoogLeNet mannequin as a substitute. Earlier than we get into the encoder implementation, let’s see what the GoogLeNet structure appears to be like like utilizing the next code.
# Codeblock 3
fashions.googlenet()
The ensuing output may be very lengthy because it lists actually all layers contained in the structure. Right here I truncate the output since I solely need you to give attention to the final layer (the fc
layer marked with #(1)
within the Codeblock 3 Output under). You’ll be able to see that this linear layer maps a function vector of dimension 1024 into 1000. Usually, in a regular picture classification activity, every of those 1000 neurons corresponds to a selected class. So, for instance, if you wish to carry out a 5-class classification activity, you would wish to switch this layer such that it tasks the outputs to five neurons solely. In our case, we have to make this layer produce a function vector of size 512 (EMBED_DIM
). With this, the enter picture will later be represented as a 512-dimensional vector after being processed by the GoogLeNet mannequin. This function vector dimension will precisely match with the token embedding dimension, permitting it to be handled as part of our phrase sequence.
# Codeblock 3 Output
GoogLeNet(
(conv1): BasicConv2d(
(conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
)
(maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
(conv2): BasicConv2d(
(conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
)
.
.
.
.
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(dropout): Dropout(p=0.2, inplace=False)
(fc): Linear(in_features=1024, out_features=1000, bias=True) #(1)
)
Now let’s really load and modify the GoogLeNet mannequin, which I do within the InceptionEncoder
class under.
# Codeblock 4a
class InceptionEncoder(nn.Module):
def __init__(self, fine_tune): #(1)
tremendous().__init__()
self.googlenet = fashions.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1) #(2)
self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features, #(3)
out_features=EMBED_DIM) #(4)
if fine_tune == True: #(5)
for param in self.googlenet.parameters():
param.requires_grad = True
else:
for param in self.googlenet.parameters():
param.requires_grad = False
for param in self.googlenet.fc.parameters():
param.requires_grad = True
The very first thing we do within the above code is to load the mannequin utilizing fashions.googlenet()
. It’s talked about within the paper that the mannequin is already pretrained on the ImageNet dataset. Thus, we have to cross GoogLeNet_Weights.IMAGENET1K_V1
into the weights
parameter, as proven at line #(2)
in Codeblock 4a. Subsequent, at line #(3)
we entry the classification head by the fc
attribute, the place we exchange the prevailing linear layer with a brand new one having the output dimension of 512 (EMBED_DIM
) (#(4)
). Since this GoogLeNet mannequin is already educated, we don’t want to coach it from scratch. As an alternative, we are able to both carry out fine-tuning or switch studying with a view to adapt it to the picture captioning activity.
In case you’re not but conversant in the 2 phrases, fine-tuning is a technique the place we replace the weights of all the mannequin. Alternatively, switch studying is a way the place we solely replace the weights of the layers we changed (on this case it’s the final fully-connected layer), whereas setting the weights of the prevailing layers non-trainable. To take action, I implement a flag named fine_tune
at line #(1)
which can let the mannequin to carry out fine-tuning at any time when it’s set to True
(#(5)
).
The ahead()
methodology is fairly easy since what we do right here is solely passing the enter picture by the modified GoogLeNet mannequin. See the Codeblock 4b under for the small print. Moreover, right here I additionally print out the tensor dimension earlier than and after processing in an effort to higher perceive how the InceptionEncoder
mannequin works.
# Codeblock 4b
def ahead(self, photographs):
print(f'originalt: {photographs.dimension()}')
options = self.googlenet(photographs)
print(f'after googlenett: {options.dimension()}')
return options
To check whether or not our decoder works correctly, we are able to cross a dummy tensor of dimension 1×3×224×224 by the community as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB picture of dimension 224×224. You’ll be able to see within the ensuing output that our picture now turns into a single-dimensional function vector with the size of 512.
# Codeblock 5
inception_encoder = InceptionEncoder(fine_tune=True)
photographs = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
options = inception_encoder(photographs)
# Codeblock 5 Output
authentic : torch.Measurement([1, 3, 224, 224])
after googlenet : torch.Measurement([1, 512])
LSTM Decoder
As we now have efficiently carried out the encoder, now that we’re going to create the LSTM decoder, which I display in Codeblock 6a and 6b. What we have to do first is to initialize the required layers, particularly an embedding layer (#(1)
), the LSTM layer itself (#(2)
), and a regular linear layer (#(3)
). The primary one (nn.Embedding
) is chargeable for mapping each single token right into a 512 (EMBED_DIM
)-dimensional vector. In the meantime, the LSTM layer goes to generate a sequence of embedded tokens, the place every of those tokens shall be mapped right into a 10000 (VOCAB_SIZE
)-dimensional vector by the linear layer. In a while, the values contained on this vector will characterize the probability of every phrase within the dictionary being chosen.
# Codeblock 6a
class LSTMDecoder(nn.Module):
def __init__(self):
tremendous().__init__()
#(1)
self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
embedding_dim=EMBED_DIM)
#(2)
self.lstm = nn.LSTM(input_size=EMBED_DIM,
hidden_size=LSTM_HIDDEN_DIM,
num_layers=NUM_LSTM_LAYERS,
batch_first=True)
#(3)
self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM,
out_features=VOCAB_SIZE)
Subsequent, let’s outline the circulation of the community utilizing the next code.
# Codeblock 6b
def ahead(self, options, captions): #(1)
print(f'options originalt: {options.dimension()}')
options = options.unsqueeze(1) #(2)
print(f"after unsqueezett: {options.form}")
print(f'captions originalt: {captions.dimension()}')
captions = self.embedding(captions) #(3)
print(f"after embeddingtt: {captions.form}")
captions = torch.cat([features, captions], dim=1) #(4)
print(f"after concattt: {captions.form}")
captions, _ = self.lstm(captions) #(5)
print(f"after lstmtt: {captions.form}")
captions = self.linear(captions) #(6)
print(f"after lineartt: {captions.form}")
return captions
You’ll be able to see within the above code that the ahead()
methodology of the LSTMDecoder
class accepts two inputs: options
and captions
, the place the previous is the picture that has been processed by the InceptionEncoder
, whereas the latter is the caption of the corresponding picture serving as the bottom fact (#(1)
). The thought right here is that we’re going to carry out pre-inject operation by prepending the options
tensor into captions
utilizing the code at line #(4)
. Nevertheless, remember the fact that we have to regulate the form of each tensors beforehand. To take action, we now have to insert a single dimension on the 1st axis of the picture options (#(2)
). In the meantime, the form of the captions
tensor will align with our requirement proper after being processed by the embedding layer (#(3)
). Because the options
and captions
have been concatenated, we then cross this tensor by the LSTM layer (#(5)
) earlier than it’s finally processed by the linear layer (#(6)
). Have a look at the testing code under to raised perceive the circulation of the 2 tensors.
# Codeblock 7
lstm_decoder = LSTMDecoder()
options = torch.randn(BATCH_SIZE, EMBED_DIM) #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(2)
captions = lstm_decoder(options, captions)
In Codeblock 7, I assume that options
is a dummy tensor that represents the output of the InceptionEncoder
mannequin (#(1)
). In the meantime, captions
is the tensor representing a sequence of tokenized phrases, the place on this case I initialize it as random numbers ranging between 0 to 10000 (VOCAB_SIZE
) with the size of 30 (SEQ_LENGTH
) (#(2)
).
We are able to see within the output under that the options tensor initially has the dimension of 1×512 (#(1)
). This tensor form modified to 1×1×512 after being processed with the unsqueeze()
operation (#(2)
). The extra dimension within the center (1) permits the tensor to be handled as a function vector similar to a single timestep, which is important for compatibility with the LSTM layer. To the captions
tensor, its form modified from 1×30 (#(3)
) to 1×30×512 (#(4)
), indicating that each single phrase is now represented as a 512-dimensional vector.
# Codeblock 7 Output
options authentic : torch.Measurement([1, 512]) #(1)
after unsqueeze : torch.Measurement([1, 1, 512]) #(2)
captions authentic : torch.Measurement([1, 30]) #(3)
after embedding : torch.Measurement([1, 30, 512]) #(4)
after concat : torch.Measurement([1, 31, 512]) #(5)
after lstm : torch.Measurement([1, 31, 512]) #(6)
after linear : torch.Measurement([1, 31, 10000]) #(7)
After pre-inject operation is carried out, our tensor is now having the dimension of 1×31×512, the place the options
tensor turns into the token on the 0th timestep within the sequence (#(5)
). See the next determine to raised illustrate this concept.
![Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].](https://towardsdatascience.com/wp-content/uploads/2025/02/1RhGAyYwE16KBpEtLGCz5_A.png)
Subsequent, we cross the tensor by the LSTM layer, which on this specific case the output tensor dimension stays the identical. Nevertheless, you will need to be aware that the tensor shapes at line #(5)
and #(6)
within the above output are literally specified by totally different parameters. The size seem to match right here as a result of EMBED_DIM
and LSTM_HIDDEN_DIM
had been each set to 512. Usually, if we use a unique worth for LSTM_HIDDEN_DIM
, then the output dimension goes to be totally different as effectively. Lastly, we projected every of the 31 token embeddings to a vector of dimension 10000, which can later include the chance of each doable token being predicted (#(7)
).
GoogLeNet Encoder + LSTM Decoder
At this level, we now have efficiently created each the encoder and the decoder components of the picture captioning mannequin. What I’m going to do subsequent is to mix them collectively within the ShowAndTell
class under.
# Codeblock 8a
class ShowAndTell(nn.Module):
def __init__(self):
tremendous().__init__()
self.encoder = InceptionEncoder(fine_tune=True) #(1)
self.decoder = LSTMDecoder() #(2)
def ahead(self, photographs, captions):
options = self.encoder(photographs) #(3)
print(f"after encodert: {options.form}")
captions = self.decoder(options, captions) #(4)
print(f"after decodert: {captions.form}")
return captions
I feel the above code is fairly easy. Within the __init__()
methodology, we solely have to initialize the InceptionEncoder
in addition to the LSTMDecoder
fashions (#(1)
and #(2)
). Right here I assume that we’re about to carry out fine-tuning relatively than switch studying, so I set the fine_tune
parameter to True
. Theoretically talking, fine-tuning is healthier than switch studying in case you have a comparatively giant dataset since it really works by re-adjusting the weights of all the mannequin. Nevertheless, in case your dataset is relatively small, you must go along with switch studying as a substitute – however that’s simply the speculation. It’s positively a good suggestion to experiment with each choices to see which works greatest in your case.
Nonetheless with the above codeblock, we configure the ahead()
methodology to simply accept image-caption pairs as enter. With this configuration, we principally design this methodology such that it will probably solely be used for coaching function. Right here we initially course of the uncooked picture with the GoogLeNet contained in the encoder block (#(3)
). Afterwards, we cross the extracted options in addition to the tokenized captions into the decoder block and let it produce one other token sequence (#(4)
). Within the precise coaching, this caption output will then be in contrast with the bottom fact to compute the error. This error worth goes for use to compute gradients by backpropagation, which determines how the weights within the community are up to date.
You will need to know that we can not use the ahead()
methodology to carry out inference, so we want a separate one for that. On this case, I’m going to implement the code particularly to carry out inference within the generate()
methodology under.
# Codeblock 8b
def generate(self, photographs): #(1)
options = self.encoder(photographs) #(2)
print(f"after encodertt: {options.form}n")
phrases = [] #(3)
for i in vary(SEQ_LENGTH): #(4)
print(f"iteration #{i}")
options = options.unsqueeze(1)
print(f"after unsqueezett: {options.form}")
options, _ = self.decoder.lstm(options)
print(f"after lstmtt: {options.form}")
options = options.squeeze(1) #(5)
print(f"after squeezett: {options.form}")
probs = self.decoder.linear(options) #(6)
print(f"after lineartt: {probs.form}")
_, phrase = probs.max(dim=1) #(7)
print(f"after maxtt: {phrase.form}")
phrases.append(phrase.merchandise()) #(8)
if phrase == 1: #(9)
break
options = self.decoder.embedding(phrase) #(10)
print(f"after embeddingtt: {options.form}n")
return phrases #(11)
As an alternative of taking two inputs just like the earlier one, the generate()
methodology takes uncooked picture as the one enter (#(1)
). Since we wish the options extracted from the picture to be the preliminary enter token, we first have to course of the uncooked enter picture with the encoder block prior to truly producing the following tokens (#(2)
). Subsequent, we allocate an empty checklist for storing the token sequence to be produced later (#(3)
). The tokens themselves are generated one after the other, so we wrap all the course of inside a for
loop, which goes to cease iterating as soon as it reaches at most 30 (SEQ_LENGTH
) phrases (#(4)
).
The steps carried out contained in the loop is algorithmically just like those we mentioned earlier. Nevertheless, for the reason that LSTM cell right here generates a single token at a time, the method requires the tensor to be handled a bit in another way from the one handed by the ahead()
methodology of the LSTMDecoder
class again in Codeblock 6b. The primary distinction you would possibly discover is the squeeze()
operation (#(5)
), which is principally only a technical step to be carried out such that the following layer does the linear projection appropriately (#(6)
). Then, we take the index of the function vector having the very best worth, which corresponds to the token most definitely to return subsequent (#(7)
), and append it to the checklist we allotted earlier (#(8)
). The loop goes to interrupt at any time when the expected index is a cease token, which on this case I assume that this token is on the 1st index of the probs
vector. In any other case, if the mannequin doesn’t discover the cease token, then it’s going to convert the final predicted phrase into its 512 (EMBED_DIM
)-dimensional vector (#(10)
), permitting it for use because the enter options for the subsequent iteration. Lastly, the generated phrase sequence shall be returned as soon as the loop is accomplished (#(11)
).
We’re going to simulate the ahead cross for the coaching section utilizing the Codeblock 9 under. Right here I cross two tensors by the show_and_tell
mannequin (#(1)
), every representing a uncooked picture of dimension 3×224×224 (#(2)
) and a sequence of tokenized phrases (#(3)
). Based mostly on the ensuing output, we discovered that our mannequin works correctly as the 2 enter tensors efficiently handed by the InceptionEncoder
and the LSTMDecoder
a part of the community.
# Codeblock 9
show_and_tell = ShowAndTell() #(1)
photographs = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) #(2)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH)) #(3)
captions = show_and_tell(photographs, captions)
# Codeblock 9 Output
after encoder : torch.Measurement([1, 512])
after decoder : torch.Measurement([1, 31, 10000])
Now, let’s assume that our show_and_tell
mannequin is already educated on a picture captioning dataset, and thus prepared for use for inference. Have a look at the Codeblock 10 under to see how I do it. Right here we set the mannequin to eval()
mode (#(1)
), initialize the enter picture (#(2)
), and cross it by the mannequin utilizing the generate()
methodology (#(3)
).
# Codeblock 10
show_and_tell.eval() #(1)
photographs = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE) #(2)
with torch.no_grad():
generated_tokens = show_and_tell.generate(photographs) #(3)
The circulation of the tensor will be seen within the output under. Right here I truncate the ensuing outputs as a result of it solely reveals the identical token technology course of 30 instances.
# Codeblock 10 Output
after encoder : torch.Measurement([1, 512])
iteration #0
after unsqueeze : torch.Measurement([1, 1, 512])
after lstm : torch.Measurement([1, 1, 512])
after squeeze : torch.Measurement([1, 512])
after linear : torch.Measurement([1, 10000])
after max : torch.Measurement([1])
after embedding : torch.Measurement([1, 512])
iteration #1
after unsqueeze : torch.Measurement([1, 1, 512])
after lstm : torch.Measurement([1, 1, 512])
after squeeze : torch.Measurement([1, 512])
after linear : torch.Measurement([1, 10000])
after max : torch.Measurement([1])
after embedding : torch.Measurement([1, 512])
.
.
.
.
To see what the ensuing caption appears to be like like, we are able to simply print out the generated_tokens
checklist as proven under. Understand that this sequence remains to be within the type of tokenized phrases. Later, within the post-processing stage, we might want to convert them again to the phrases corresponding to those numbers.
# Codeblock 11
generated_tokens
# Codeblock 11 Output
[5627,
3906,
2370,
2299,
4952,
9933,
402,
7775,
602,
4414,
8667,
6774,
9345,
8750,
3680,
4458,
1677,
5998,
8572,
9556,
7347,
6780,
9672,
2596,
9218,
1880,
4396,
6168,
7999,
454]
Ending
With the above output, we’ve reached the tip of our dialogue on picture captioning. Over time, many different researchers tried to make enhancements to perform this activity. So, I feel within the upcoming article I’ll focus on the state-of-the-art methodology on this subject.
Thanks for studying, I hope you be taught one thing new right now!
_By the way in which it’s also possible to discover the code used on this article here._
References
[1] Oriol Vinyals et al. Present and Inform: A Neural Picture Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed November 13, 2024].
[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. https://arxiv.org/pdf/1409.4842 [Accessed November 13, 2024].
[3] Marc Tanti et al. The place to place the Picture in an Picture Caption Generator. Arxiv. https://arxiv.org/pdf/1703.09137 [Accessed November 13, 2024].
[4] Stepan Ulyanin. Captioning Photos with CNN and RNN, utilizing PyTorch. Medium. https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].
[5] Saketh Kotamraju. Construct an Picture-Captioning Mannequin in Pytorch. In direction of Knowledge Science. https://towardsdatascience.com/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c [Accessed November 16, 2024].
[6] Code with Aarohi. Picture Captioning utilizing CNN and RNN | Picture Captioning utilizing Deep Studying. YouTube. https://www.youtube.com/watch?v=htNmFL2BG34 [Accessed November 16, 2024].