0% found this document useful (0 votes)
12 views

Project Guidelines_ AIML

Uploaded by

sandy30sept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Project Guidelines_ AIML

Uploaded by

sandy30sept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

00:00:00 hello everyone this is ashwin here in this video

we are going to see about how to create a image caption


generator using flicker data set so this is a deep learning
project i am going to use keras with tensorflow as a tech
stack in this project the objective is to build a image
captioning generator so we will have a random image and
our model will see the image and give us some captions
let's say uh let's say we have an image a boy is walking
means the model will extract the features from the image
and

00:00:35 try to generate the caption like a boy is walking


or running similar to that so that is the objective of this
project so we will be using both cnn for image feature
extraction and lstm for natural language processing so
we'll be doing a lot of pre-processing steps for both image
and text data now without further ado let's start the
project first let's import the modules now import modules
and this session is enabled with the gpu so we need a gpu
for faster processing of image data now let's import the
basic modules

00:01:21 we don't need pandas here because i'm going to


get the data from this gaggle so we just need the images
and the text so it will be easy apart from that we will
import all the other basic modules so import os import
vehicle oh this is for handling the files pickle is for storing
some numpy features let's say if we are extracting image
features means we have to store it somewhere or else we
have to redo the entire process again and numpy that is
our basic module for every project and apart from that we
have tqdm
00:02:05 dot notebook import dqdm okay this is tqdm
okay this is for giving us a ui for how much data is
processed till now so it will be very helpful for getting an
estimation of the overall process now we are going to
import a lot of tensorflow and keras modules i'm going to
use from [Music] tensorflow.keras dot applications
applications dot vgg16 this is for extracting the features
from the image data so vgg16 import vgg 16 and pre-
process input so this is for pre-processing the image data
for the vgg

00:03:08 so we have to do the pre-processing uh


corresponding to the model we are going to use and i'm
using tensorflow.keras because it's not showing any error
sometimes in the recent version of keras it's showing
some error for importing of modules so that's why um
currently this is like a better way of doing so that's why
i'm doing like this and next again we will import some
preprocessing tensor fluid or keras dot preprocessing dot
image import load image and image to array and apart
from that we have to

00:04:00 pre-process the text so for that from tensorflow


dot cast or preprocessing [Music] dot text import
tokenizer okay and again we have to copy the whole thing
now instead of text we will use sequence dot pad
sequence so for patch sequence means uh it will uh even
out the whole text representation of features let's say uh
some sentence have 10 words and some are in five to six
words means we have to even out that's why we will be
using a pass sequence in order to fill out with the zeros
for the remaining length
00:04:50 and after this we will be having again paste it
instead of preprocessing i will just say models import
model okay and other than that we need to import some
utilities so for that utils import to categorical and plot
model so this will give us a clear representation of the
whole model uh in terms of an image so it will be easy to
see the architecture of our model now after this we need
to import only one thing that are the layers so from
tensorflow.keras dot layers import input dense lstm
embedding

00:06:05 dropout and add so these are the modules i'm


going to import and this okay i just made a mistake here
again run this now we have imported all the modules now
we have to set the base directory and working directory
so for that base underscore directory equals kaggle slash
input slash flickr 8k so this is the kaggle directory uh
currently we are in that is this directory after that we will
concatenate this subfolders in order to get the data and
apart from that we will have a working directory working

00:07:07 directory equals goggle slash working that's it so


it'll be easy to access our files using this variables now i'm
going to extract image features so extract image features
so here load vgg16 model so for that model equals vgg 16
off and we have to restructure the model so restructure
the model so again model equals model of inputs equals
model dot inputs so this is the image input we are talking
and outputs equals model dot layers of minus two dot
output so what we are doing here is we don't need the
fully connected layer of this

00:08:33 vgg 16 model we just only need the previous


layers in order to extract the feature results so that's why
i'm just leaving the last layer and getting the before layer
output and assigning it to outputs so this will be our
restructured model and after that we will summarize so
summarize print model dot summary and this now it's
downloading the weights and here you can see so here we
are leaving one layer let's say i'm getting the last layer
means see here we are leaving out this predictions layer
we don't need this we just need

00:09:28 this feature extraction part alone so that's why


i'm just leaving that layer now run this so this is our model
we are going to use in order to extract features from the
image now that's done now here extract features from
image so here i'm going to create a dictionary so in order
to have a key value pair key will be the image id and the
value is the features so features equals a dictionary now
we have to get the directory for the images so i will say
directory equals os dot path dot join off base directory

00:10:25 comma images so this is the directory contains


all the images now we are going to iterate through the
images for image name in tqdm of os dot list that of the
directory now we will be getting each image name now
load the image from file i will call it as file name equals
directory plus slash plus image name you can also use this
os dot path dot join both will give the same results now
i'm just concatenating the image name so in order to get
the whole file path uh you can say it as file path itself

00:11:36 or it will be appropriate if we call it as image


path now let's load the image so image equals load image
of image path image path comma target size equals 224
comma 224 so this is the dimensions of width and height i
am setting so it will be resizing the image and after that it
will be loaded now we will be converting the image pixel
to a numpy array so convert image pixels to numpy array
so that will be image equals image to array of the image
we have to pass now we have to reshape the data for the

00:12:33 model in order to extract the features reshape


data for model i will say image equals image dot reshape
off of one comma image dot shape of zero image dot
shape of one and lastly image dot shape of two so this is
the shape we are going to resize it this is a rgb image
that's why we are having three dimensions and this is like
a single sample now let's prepare the image for the vgg
model so pre-process image for vgg now i'll say image
equals pre-process input of image now that's done let's
extract the features

00:13:45 extract features i will say feature equals model


dot predict of image comma verbose equals zero so it
won't display some status or some additional text so it will
be clean now after extracting the features we will get the
image id so get image id so image id equals image name
dot split i'm just splitting it by dot because if i show you
here so this is the complete image name as you can able
to see if we split it by dot means we can remove that jpg
we just need the zeroth index so i will have the zeroth
index in order

00:14:52 to get the image id so these numbers are the


image id so when everything is done store feature so that
will be features of image id image id equals feature so i
think that's pretty much it it will extract features for all
the image that is present in our data set this would take
some time okay tuple object is not callable okay this is an
array and i just mention it as like a tuple again run this
okay now it started uh processing now you can see we
have around 8k images that is the flicker 8k data set

00:15:57 and it has to extract for all the images and we


are getting around 15 iterations per second if you use only
cpu means it will be a one or two images that's why you
definitely have to go with the gpu as you can able to see
here so this will take a longer time in the meantime we
will complete the remaining things let's say if you are
extracted the features mean so you don't want to lose the
features again or else you have to re-extract it so it will
take a long time so i can store it in a file store features in
pickle

00:16:39 so pickle dot dump off i'm going to dump this


features dictionary open off os dot path dot join off
working directory okay working underscore there and here
i'll specify the file name i will call it as features dot pickle
and i have to pass the argument as wb that is right binary
so this will dump your dictionary into a pickle and you can
reload it again using another snippet i'll show you here
load features from pickle so this will definitely save your
time uh make sure you do this for all of your deep learning
models because

00:17:47 this will save a ton of time for you now with open
as again the same os dot path dot join everything as read
binary so that is the format we are going to read as file
now again features equals pickle dot load of f so this will
reload all the features from the pickle file so this is the
syntax for storing data in the pickle and again reloading
from the file now that's done we will use this after this
completes next we have to load the captions so here load
the captions data now again with open as
00:18:58 again os dot path dot join here instead of
features dot pixel we will say captions dot text and
instead of working directory we will use the base directory
and we have to set the read option and as f now next of f
i'll tell you why i am using next so let's close this list so
open this caption so here in this text file we do have
image comma caption we don't need this first line that's
why i'm using next year so it will skip those line and we
can proceed further i will now i will say caption doc

00:19:57 equals equals f dot rate so it will just read the


whole uh text data so that's done so we just have to wait
until this completes it's still halfway done now similar to
what we have done for the features like for each image id
we do have the corresponding feature right we are going
to do the similar process for the caption for each image id
will be having the corresponding caption so i will say
create mapping of image to captions now for this i'm
going to create a dictionary called mapping equals a
dictionary

00:20:48 now we are going to process the lines process


lines for line in tq dm off caption stock okay i didn't load it
yet so caption doc dot split of a new line i'm just taking uh
line by line so backslash n so this will split it and take line
by line so now we are having like a first line where it will
contain the image and the caption so i am going to split it
split the line by comma so i'll say tokens equals line dot
split off comma so tokens of 0 will be the image id and
tokens of 1 will be the caption
00:21:56 now if lin of line less than 2 continue let's say if
we have some lines with some only single letter or
something like that that may be an error in order to avoid
that we can have this statement if your caption files is
good means you don't need this this is like optional one so
now we are going to split it so image image id comma
caption sorry caption equals tokens of 0 that will be the
image id tokens of 1 that will be the caption now we have
to remove extension from the image id that will have that
jpg

00:22:54 so here remove extension from image id now


image id equals image id dot split of dot of zero this we
already done it before and also if we have multiple tokens
let's say in the captions we are splitting it by this right so
this will be like a set of string we have to specify one
colon so this will get it from first index and the rest of
them after that we will convert this list into a string
convert caption list to string i think the feature extraction
is done now store the pickle let's see if it is done

00:24:08 see this file you can now download it and reuse it
again so it will be easier if you want you can run this and
reload it again i'll just leave it as it is now run the captions
so we will have the captions doc let's also double check so
captions doc so this is the text data we are having we are
splitting it by n okay this is a big one i think i can delete it
so i am splitting it by n so we will be getting line by line
and after that here where we left off convert caption list
to string so that will be

00:24:53 caption equals dot join of caption now that's


done after that a image can have multiple captions again
i'll show you so here these are all the same image name
but that will have multiple caption we are going to store
all the captions in a list so it will be easy for us to do the
mapping and pre-processing so create list if needed if
image id not in mapping then mapping of image id equals
a list so if the list is already are there for this
corresponding image id means we just have to append it
again

00:25:55 mapping dot okay sorry mapping off image id dot


append caption store the caption so that's done this will
be taking now let's run this okay that's loaded very
quickly now let's see how many images we are having len
of mapping so we do have all the 8k images so that's done
now we have to pre-process the captions so i'll so for that
i'm going to create a function called clean so def clean off
mapping so for key comma captions in mapping dot items
so we are going to pass this mapping as an argument and
i am going to iterate

00:27:13 the key that is the image and the corresponding


captions now again for iron range of len of captions
caption equals captions of i i'm just taking one caption at
a time take one caption at a time now let's do the pre-
processing pre-processing steps and i can call this pre-
process text data pre-process text data now for the pre-
processing steps we'll have caption equals caption dot
lower now after that caption equals caption dot replace off
i'm going to pass some rejects so that will be accept
accept

00:28:32 a to z i think you don't have to mention capital


but this is like a standard way if you didn't use the lower
means you need to mention this and after that small a to z
so except uh the characters in the alphabet we are going
to remove everything so that's done i'm just uh converting
everything to a lowercase and removing all the digits
special characters so this is a one-line bray processing
step after that caption equals caption dot replace
backslash s plus if we have multiple uh spaces means we

00:29:24 will be using this and just replace it with a single


space i'll also mention this convert to convert to lower
case replace replace digits special characters with you're
not replacing just delete digits special characters etc and
after that delete additional spaces now that's done and
the last preprocess step is for our model so usually we will
do this for the sequence to sequence models add start and
end tax to the caption so that will be caption equals i am
going to add start so this so this represents like a starting
tag

00:30:45 and after that we will have a space we will have


our caption plus and again we will have a space end so
this is the starting tag and this is the ending tag so it will
be very helpful for the model when to start and when to
end so this will uh help us to stop the prediction if we uh
got this end tag means we will just stop it that's how it
works now here in the caption i'm also going to uh remove
some words with less number of characters like ea yes
and some small words now i'm going to remove a single

00:31:32 characters from the caption let's say a little girl


is walking means i'll just remove that e ah we don't need
that uh smaller birds so instead of having the caption we
will have space dot dot join off a list word for word in
caption dot split i'm just splitting it by space now we will
say check for the condition len of word is greater than one
so if it is greater than one means we will take that word
and concatenate as a string and we will also add this into
the string for the starting and end tag

00:32:22 so after doing this we will have captions of i


equals caption again we are just replacing the caption at
the corresponding index so this is the function for doing
the preprocessing let's run this now for our example we
will take one mapping so that is before preprocess so i will
call it as before reprocess of text i will say mapping off i
need one image id so i'll go to the captions i'll take the
first image id and paste it here let's run this so this is the
captions we are having for this corresponding image

00:33:19 now pre-process the text i'm going to call the


clean function with the mapping so that's done now let's
see after pre-process and again copy and paste this of
text this is not pro process this is pre-process sometimes
the spelling mistakes are happening okay now you can see
the pre-processed text we don't see any special characters
we are not seeing any single letters uh here in this ea and
all its remote and we do have the start tag and the end
tag so it will be easier for us to do the further processing
so this is the

00:34:19 pre-process of text data so that's done now after


this we are going to create a tokenizer and create a cab
size so we will be using this uh tokenizer um for getting
the index of that corresponding word so this will be very
helpful for using in the model let's say all captions i am
going to store all the caption in a single list so that is all
captions now for key in mapping for caption in mapping of
key you can say mapping or mapping dot keys both are
same i'll say all captions dot append of caption
00:35:14 so this will be a full list or the entire list of
caption that are available run this it's quick so that's done
i'll check the length of the captions just to get an idea
what we are dealing with so we have around uh 40 000
unique captions that is like around five captions for each
image so that's good now we will check the first 10
captions so all captions of 10 so this is the first 10
captions we are having so just for an example now let's
start the processing so tokenize tokenize the text now
here i'm going to initialize

00:36:08 tokenizer tokenizer off now tokenizer dot fit on


text fit on text i'll pass the all captions and we will be
getting some vocabulary size or cab size equals len off
tokenizer dot word index that is the total number of
unique words we are having word index plus one run this
now let's check the cap size or cab size so we are having
around 8483 that's good and in the next part we have to
get the maximum length of the caption so here get
maximum length of the caption available so this will be
useful for padding the

00:37:26 sequence and creation of the model so it will be


helpful for us i will call it as max length max length equals
max of lenov caption dot split so here for in all captions so
this will be getting the max length now let's display it so
the maximum length uh for the caption is like 35 so we are
going to use this variable for padding sequence and even
in the model we will be using so now that's done let's
create a markdown for train test split train test split we
have around a 8000 images we can usually uh process
these images
00:38:34 if you do have a a big capacity of ram currently if
you go into the kaggle means we are having around 13 gb
of ram so when i ran without any kind of a data data
generator or without splitting the data and the session
will be crashed so what i'm going to do is i'm going to
create a separate function for taking inputs one by one so
it won't be crashing so this will be very helpful for you and
if you do have a large capacity of memory means so you
don't have to do this so it will be uh simpler for you

00:39:10 and yeah let's do that first we will split the train
and test data i will get the image ids image id sqls list of
mapping dot keys so this will have the image ids and after
that we need to have a split ratio so split equals into len
of image ids i am going to get only uh 90 as training data
so 0.90 let's see the split so 7218 will be used for training
and the remaining will be used for testing now i'll say
train equals image ids of colon split and test equals image
id start of split colon so this is the train test split run this

00:40:25 okay that's done okay now we are going to create


a data generator in order to fetch the image and caption
like in batch size so it will be easy for us to load into the
model and drain it or else it will uh load into the memory
and mostly your session will be crashed so this will be
helpful for the users who have less than 16 gb of memory
so if you have around 32 gigs of memory means so you
don't have to do this now i'll just have it as not create
data generator to get data in batch void session crash

00:41:23 okay now def data generator off we have to pass


multiple uh parameters here first i am going to pass the
data keys that will be the image ids mapping features
tokenizer max length or cap size batch size so using this
we are going to do a lot of processing here we are also
going to do some opery processing like padding of
sequence here let's see here loop over images now i'll
create a list so that will be x1 x2 y equals list list and a list
everything will be an empty list so now n equals 0 so this
is useful for

00:42:36 determining whether we reach the batch size or


not now while 1 this is an infinite loop for key in data keys
so data keys will probably uh this uh train that will be the
image ids of the training now when we get a new data i
will increment n plus 1 and after that i will get the
captions equals mapping of key now we got the captions
now for caption and captions i'm taking each caption
process each caption now i will encode the sequence
encode the sequence so sequence equals tokenizer yep
tokenizer.txt to sequence

00:43:51 is it suggesting or not text to sequences of in a


list i am going to pass this the single caption and i will
take 0 so this encoding means for each word we'll be
assigning a assigning the index so we'll be getting a list of
sequence index and after that i'm going to split the
sequence into x y pairs so this means x is the input y is
the output so i will explain what this step going to happen
so now i will explain how the process gonna happen i'll
have a string i'll have some example uh string okay this is
the smaller one i'll just

00:44:55 copy this paste it here so this is the sequence we


are having so what our objective is to get the input and
the output so for each sequence we'll be getting this is x
and space this will be y i also have the space now first in
the input i will have only the start sequence so based on
the start sequence and after that we will be getting this
girl so that is the output we are going to predict so y will
be girl and after this we'll be having the start space girl
and the next word as the output that is

00:45:44 going and after that again start girl going space
into so this is how the sequence flow works first initially
we will be having only the starting sequence so using this
text sequence and the corresponding image sequence we
are going to predict the next word so that is the goal and
after that we will concatenate uh both input and output
have it as a next input sequence so this is the next input
sequence and again we are going to predict the next word
going and again we are going to concatenate everything
so girl going and

00:46:31 we are going to predict into so this process will


go on this process will go on finally when we have the last
prediction that is the end when we have this we will stop
the prediction so we will be having girl going into wooden
building so that is our answer so that's how the whole uh
sequence splitting works now let's do that for i in range of
one comma line of sequence now split into input and
output pass so in sequence in sequence comma out
sequence equals sequence of colon i first we are going to
get only one word

00:47:41 that is the start and after that we will get the
next sequence of i that is this word that is the next word
let's say if we have is one means the girl will be the
sequence of i so that is the output and the process keeps
on going until the end of the sequence so this is the in and
out sequence now we have to pad the input sequence
because we need to have a common length so pad input
sequence equals input sequence equals pair sequence off
in a list pass the input sequence comma max length
equals

00:48:32 max length max length and here also we'll be


getting that zeroth index there will be additional results
that's why uh we just need to get this zeroth index in
order to get our padded sequence values if you want you
can also print in between to see the intermediate results
so you will get an idea what's apart from the output is
there and we will encode the output sequence so encode
output sequence this is like a one hot encoding so output
sequence equals to categorical to categorical of the
output sequence

00:49:17 comma number of classes that will be the cap


size here also we'll be getting the zero so why we are
using zero because we are passing this as a list so it also
written a list of uh patch sequences when you pass
multiple uh sequences let's say yeah this is in sequence
one in sequence two like that if you have multiple uh
sentences means it will give you a list of padded sequence
so we are just passing one that's why we have to get only
one so that's why we are having all these zeros here you
can also as i said before you can

00:49:59 just uh print it and see so this categorical will be


converting this output sequence into a one hot encoding
so if the word is present means it will be represented as 1
and that's done so we have like a lot of classes let i think
it's around 8 000 so we'll be having like 8000 columns
after this we will store the store the sequences so i will
have it as x1 [Music] dot append so this will be the
features that is the image features features of key of zero
now x2 dot append so here we will have the input
sequence

00:51:03 so we'll be having two inputs that is one is image


features and the second one is the text feature and output
will be out sequence out sequence so this is done this is
the whole step now say this if n equals patch size if we
process some particular number of samples means we will
say x1 equals np dot array we are just converting into
array because the model can process the normal list that's
why and again we will convert this also we can do this in a
single line that will be much easier so x1 comma x2
comma y

00:51:55 and here and b dot array of x2 and np dot array


of y so that's done after that yield so this will return the
collected samples to the generators so this will be
consumed by the model so i will have it as a list x1 comma
x2 this is the two inputs we are having and the output
after this we have to re-initialize because we don't need
that or else we will exhaust the memory equals again the
same initialization part and here again we have to set the
n as 0 so that's done so this is the whole data generator
part

00:52:56 because of this padding sequence so we are just


doing it inside the data generator so it will be clean for us
and i guess that's pretty much it you can track over each
input in order to understand very clearly currently we are
not going to display anything because everything is in the
function so we are going to get these results for the model
to consume speak about the model we have to create one
okay let's run this here model creation okay now we'll be
uh concatenating uh two features and uh going to create a
model and uh we will

00:53:48 plot the model so we can able to clearly see how


the model looks like so inputs 1 equals input of shape of
4096 so this is the shape of the features we are going to
get if you see the vgg model output you can see the
output is like 4096 that we are going to consume here so
that will be the input and we'll be having a feature one
equals dropout off i will have it as zero point four and i
will pass this inputs one and fe two equals dense of 256
comma activation okay activation equals relu and here i'm
going to pass this fv1

00:55:04 so so this is the input we are passing and this is


the output we are getting and again we are passing this
as an input and this is the output we are getting so this is
for the encoder model image feature model and after that
i will have sequence feature i will say layers that will be
better yes here i'm going to have inputs to input off again
shape equals of max length i already told you max length
will be helpful for us now we will have se1 equals
embedding off or cap size or cab size comma 256

00:56:06 comma mask 0 it's not suggesting so mass 0


equals true because we are padding the sequences right
that's why we have to set it as a mass 0. so this is just the
dimensions of the input based on the cap size and we will
be passing inputs to and se2 equals dropout dropout of
0.4 you can set it as 0.425 so that will be better so that
will avoid overfitting now that's done and again we need
to have a lstm layer so se3 equals lstm of 256 of s e 2 so
this is the sequence layers so all these considered as like
a
00:57:10 encoding encoder model and after this we'll be
having decoder model here we are going to concatenate
both this image and the text features so decoder one
equals add off a list fe 2 comma se3 so i am just
concatenating the features decoder 2 equals dense off
dense of 256 comma activation equals relu pass the
decoder input decoder of one that's done finally we'll be
having the outputs outputs equals again dense of ocap
size i already told you we have converted into like a
categorical vocabulary and activation

00:58:20 sorry activation equals soft max this will be used


for categorical variables and decoder 2 that is the input
for this output and that's done now the last step will be
model equals model of inputs equals inputs 1 comma
inputs 2 comma outputs equals outputs so for this model
we are going to have two inputs and one output after that
we have to set the loss and optimizer so model dot
compile of loss equals categorical underscore cross
entropy entropy optimizer equals atom that's done finally
plot the model

00:59:42 so this will give us an image so plot model model


comma show shape equals true you can also save this as
an image that is also possible let's run this show shapes
and this okay now we got the model so currently this is for
the image as you can able to see uh we have an input
layer dropout layer and the dense layer here this is for the
text input layer embedding and the dropout with the lstm
and we are just concatenating both of these things into a
single dense layer and we are going to process it
01:00:39 and this is the 8483 this is the categorical uh one
hot encoding we have done so that is the output if the
word have higher probability means that word we are
going to get so like that we are going to process it and
here we already extracted the features that's why we
didn't use cnn or else we would have used some um some
cnn model in order to extract the image instead of vgg so
that is also possible so we will be having like cn and lstm
concatenation so that's done let's train the model train
the model

01:01:22 if you didn't use the data generator means it will


train quickly but it will take up a lot of ram but if you use
a data generator means you don't need that much ram but
the training process will take some time so i will have
epochs equals 15 you can increase the output score by
increasing the epochs because this needs a lot of time to
train i will set the batch size as 64. that's a reasonable
amount now steps equals length train divided by batch
size step means after each step it will uh do the

01:02:11 back propagation and fetch the next data so for


that only we are using for iron range of epochs now create
data generator now i will call it as generator equals data
generator that we have created before i will pass the
training mapping features tokenizer max length or cab
size finally the batch size okay we got the generator now
we are going to fit for one epoch for one epoch so model
dot fit off generator so you from the generator will be
getting the inputs that is x1 x2 and the y i will set the
epochs equals one

01:03:21 steps per epochs steps per epoch equals the


steps we have found here and finally verbose equals one
so this won't show you any uh validation accuracy because
we don't have any validation generator so we will do the
testing and validation after training the model and after
this you can also save the model if you want so currently
i'm gonna run this so this will take a lot of time as you can
able to see for each epoch it takes around like one minute
so we have to wait for uh 15 epochs let's wait if you want
to save the model afterwards

01:04:20 means you can say model dot save i will say best
model best model dot h5 okay using this you can save the
model i can command this save the model so that's done
now we can come back again after the training completes
we need to create some few more functions for predicting
the captions from the test image and we will also plot the
image and the corresponding actual caption and the
predicted caption so it will be helpful for us so welcome
back guys so we have completed uh our training with the
15 epochs as you

01:05:13 can able to see the loss has been decreased


significantly you can train further if you want to improve
your score we can check our current score using the
prediction if you want to save the model as i said you just
run this thing and your model will be saved okay i think
you have to specify the working directory also let's specify
it now working directory okay underscore plus slash okay i
think that's done yes you can see here we have saved our
base model here if you want to reload your model means
you can use this file in

01:06:07 order to reload your model now after this we will


have generate generate captions for the image so for this
we have to create few functions in order to do this first we
have to convert the id into a word that is index to word so
def i will say index to word here i will pass the integer and
the tokenizer because all the words that we will get from
the model it will be a index we have to transform the
index into a word so for word comma index in tokenizer
dot word word index dot items items if index equals
integer

01:07:16 written word i'm just iterating through all the


words that are available in the tokenizer if the index is
equal to the integer we have predicted from the model we
will be returning that particular word or else it will written
none so run this so this is our helper function now here
generate caption for an image so this is the function we
are going to use in order to generate the caption so diff
predict caption of we will pass the model image tokenizer
and the max length these are the necessary attributes we
need

01:08:08 now for this we have to start with the starting


tag i will call it as in text equals the starting tag that is
start okay add star tag for generation process so for
everything if you want a predict means you have to start
with the start tag so that's why i'm just ascending to the
in text so that is our input text i trade over the max length
of sequence we do have like um we do have like a max
length of 35 so we will iterate over that so for i in range of
max length first we will convert that sequence this
sequence into an integer

01:09:08 so encode input sequence so for this we already


covered this i will call it as sequence equals tokenizer dot
text to sequence it's again not suggesting i don't know
what's happening sequences of in-text in text of zero so
that's done pad the sequence again sequence equals pad
sequences of here i am going to pass the sequence
sequence and we have to pass the max length so max
length of 0 and here predict next word so for this i will say
y hat equals model dot predict we need uh two inputs
right so that will be the image

01:10:30 so this is like a image feature we have to pass


and here we will have the sequence and verbose equals 0
we don't have to display anything now this will give us a
probability of 8000 columns so we will get the index of
higher probability for that convert or get index with high
probability probability i will say again y hat equals np dot
or max so this will give us the um maximum probability or
you can say it will give you the index with the maximum
value i will just pass the y hat here we will get the index in
this y hat

01:11:32 and convert index to word now we are going to


call that function we have created before now word equals
index to word of y hat that is the integer and we will pass
that organizer that's done we will get the word now stop if
word not found for that forward is none means we will
break or else it will append as input for generating the
next word so append word as input for generating next
word so here in text plus equals a space plus the
predicted word that is here so this will be used as the next
prediction so the sequence will be

01:12:46 updated here now after that again stop if we


reach end tag so for that if word equals i will say end then
we will stop break so that's done if the whole process
completes means we will written the in that is our caption
so run this yes it didn't throw any error now let's do the
validation for the test data we also need to find some blue
score for this so here i will call it as validate validate with
test data for this i am going to create actual comma
predicted as a separate list so actual means that is the
actual

01:13:51 captions and the predicted means the captions


we generated from the model now for key in in dqdm of
test so that is the test data we have splitted before get
actual caption so that will be captions equals mapping of
key so we will get the actual caption and after that predict
the caption for image for this i am going to say y hat or
wipe red equals predict caption off i am going to pass the
trend model and features of key so i'm just passing the
image features tokenizer and the max length okay that's
done

01:15:07 so this vibrate will have a caption with a text like


a caption in it now we are going to add this captions and
uh prediction into this actual android predicted before
that we will split the actual captions into multiple robots
and we will have it as caption or actual caption actual
captions equals a list caption dot split off for caption and
captions so this will generate a list with all the words in
the caption so it will be in an order and after this we will
append it actual dot append actual

01:15:58 captions so that's done and after that predicted


predictor dot append y print dot split because we just
need the words in order to do the comparison that's why
i'm just splitting it here or you can split it and the next
line itself like we have done it here split into words so i
will copy this so i spread dot split and here i'll just pass
the wipe red alone append to the list so that's pretty much
it so this will loop all over it finally calculate blue score so
this is the score we have to consider whenever we

01:17:14 deal with the text data so this will have like a
thing engram scores so first print the leu of one colon
percentage f percentage for this i think we have to import
the module corpus blew off actual comma predicted
comma weights equals one zero zero zero so this is for
one and uh similarly we have to do it for like two words so
the weights will be 0.5 and here also 0.5 so that's done it
will print the results before that we have to import it so
let's go to the top our rim or we can import it here itself

01:18:32 from nltk dot translate translate dot blue score


import corpus blue okay that's done okay let's run this it's
having some error data cardinality is ambiguous make
sure our arrays contain the same number of samples okay
here we have to remove this index we don't need it here
let's run this and let's run this again okay it started
running but it will take a long time i guess for processing
this 18 images we will wait for it until then we will create
functions for visualizing the image and visualizing the
captions

01:19:44 so here visualize the results okay so for this i'm


going to import few modules from fill import image so this
is for loading the image import matplotlib dot pi plot as
plot now that's done now here we are going to pass the
image so i will say image name equals empty string so
here we are going to pass it and after that image equals
image dot open of oester path dot join or we can do it in a
separate line so it will be clean i will call it as image path
equals join now i'm going to get it from the base
01:20:58 directory base directory and here from the
images folder i'm going to pass the image name so that's
done and here i'm going to pass the image path let's also
get a image name for this example um i will get it from the
captions itself so i think this is good just take a random
image if you have your own image means you can also
pass this here so currently i'm just close it past the jpg
now we have loaded the image i'll comment it load the
image after this i will get the mappings so i will say
captions

01:22:02 captions equals mapping off okay this i think we


have to split it so image id equals image name dot split
based on dot i'll just get the id and pass the id here so
image id so that's done after this i will have print i will
have like this actual so for caption in captions print the
caption after that we have to predict it so i will say wipe
red equals predict caption of model comma features of
image id image id and following by tokenizer okay it's not
suggesting tokenizer and the max length so we will

01:23:38 predict the caption predict the caption here after


this we will display the predicted so for that again i'm
going to have this predicted okay that's done now print
vibrate finally we will plot the image plot dot im show of
the loaded image here we can also make this as a function
what we can see generate caption yeah i think that suits if
generate caption of i will pass the image name only so this
is just like a sample image we can have and finally
everything remains the same and after this i am going to
call the function
01:24:50 with the image name so this is the image name i
got okay now we have to wait for this to complete so this
will take some time so we have to wait for it in order to
get this score yeah currently we have done it for the test
validation we are getting the blue scope for one gram is
like 0.15 higher the score is better for 2 we are having
0.08 i think we need to increase the number of epochs in
order to get some good blue score so this is the score we
have to use it as a metric or else we have to use the
visualization

01:25:40 as a matrix but this is the best one let's try to


increase the epochs and run it for a while and compare
the results see whether it is improving or it is still the
same you have to increase this blue score it ranges from 0
to 1 so higher is better if you have around 0.30 or 0.40
means that is somewhat better compared to this one so
let's also visualize the result so for this caption okay here
you can see the generator caption and black dog and
spotted dog are fighting so two dogs play with each other
and

01:26:33 it's just having end and dentin i think due to this
only we are having this uh blue score issue maybe uh it's
not having that uh let's also try out a different one to
check whether it's the case um let's go for caption i'll take
this as a id or the whole image okay and paste it here and
this yes i think it's having some issues and the girl is in
the rainbow dress and it's having the end and it's
repeating the same thing again and again let's uh tweak
this thing a little bit in the generator caption

01:27:36 so here again in the start sequence and the end


sequence i'm like facing an issue i'm just run this for a
second time still it's removing this special characters so
what i'm going to do is so i'm going to remove this special
characters and have this start sequence as an entire word
and similarly i'm going to do this for the end sequence so
this is going to be our starting tag and the ending tag so
let's run this let's run the mapping so this is the actual
text after cleaning will be getting like this

01:28:13 now we will run the rest as it is okay that's done


so here again we will just remove these uh start sequence
so it'll be a word instead of using this special characters
so this is the example and here we run this as it is and
again we will create the model and train the model for 20
epochs this will take some time and in generate captions
so this function remains the same and only in the star
sequence we will just remove this additional special
characters and even in the end sequence we will have only
the word

01:29:05 as end sequence and we will have it after this so


this is our predict caption function and after this we can
calculate this w score now for visualization it remains the
same for generate caption we will see how it looks like so
we will wait for a while until this training and the blue
score are calculated so we just give it some time okay now
we have completed our training for 20 epochs and i have
saved the model and this is the blue score we have got for
one gram it's 0.54 it is actually a best score

01:30:02 because if it is a greater than 0.40 means uh that


is considered as a very good score in nlp because we can't
actually predict the exact words it's somewhat difficult to
capture the context of the uh sentence we are predicting
and this is for 2 grams so here we are getting around 0.31
so this is also a good score compared to before because
we didn't have a proper starting and end tag so after
rectifying those issues we are getting a good one now
let's run this and let's generate the caption now you can
see this is the actual and

01:30:43 this is the predicted two dogs are playing tug of


war with the blue ball so that is the caption we have
predicted so it's not generating additional random
sequences after this so it's actually giving us a good
results now let's try this now here we have little girl in
green dress and green dress is laying in the grass with
rainbow balloons in the background actually it's capturing
some of the things like in rainbow uh in the background
and laying in the grass so if you train it more like more
than 50

01:31:23 epochs you may get a pretty good results


compared to what we have now apart from this let's
generate one more so it will be a good for a comparison so
in the captions i'll just go here okay this one seems good
instead of just getting things from the captions you can
also randomly uh generate this uh image id and pass this
as a syntax so let's pass this okay so here this is the
actual and the predicted one is man in haunt backpack is
displaying pictures on skis it's actually you're predicting
this

01:32:10 keys and all let's run this another time okay yeah
it's uh giving the same results but it's actually uh a better
uh compared to the actual uh results so based on the
actual and predicted we are getting this blue score so
that's good maybe what you guys can do is run the
training for more than 50 epochs in this training here and
check the results whether you are getting higher blue
score compare to what we have here so if you got here
scores means please leave it in the comments so it will be
very helpful for others

01:32:47 and that's pretty much it guys we have pretty


much done all the things here extracting uh features and
concatenating with our text features like this and we are
predicting uh some good results as you can able to see
the results compared to the actual and the predicted ones
so other than that if you like this video hit the like button
and don't forget to subscribe the channel for more videos
stay tuned for the next video

You might also like