As we type in what is the weather we already receive some predictions. We can see that certain next words are predicted for the weather. The next word prediction for a particular user’s texting or typing can be awesome. It would save a lot of time by understanding the user’s patterns of texting. This could be also used by our virtual assistant to complete certain sentences. Overall, the predictive search system and next word prediction is a very fun concept which we will be implementing.
Introduction:
This section will cover what the next word prediction model built will exactly perform. The model will consider the last word of a particular sentence and predict the next possible word. We will be using methods of natural language processing, language modeling, and deep learning. We will start by analyzing the data followed by the pre-processing of the data. We will then tokenize this data and finally build the deep learning model. The deep learning model will be built using LSTM’s. The entire code will be provided at the end of the article with a link to the GitHub repository.
Approach:
The Datasets for text data are easy to find and we can consider Project Gutenberg which is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. From here we can get many stories, documentations, and text data which are necessary for our problem statement. The dataset links can be obtained from here. We will use the text from the book Metamorphosis by Franz Kafka. You can download the dataset from here. However, if you have the time to collect your own emails as well as your texting data, then I would highly recommend you to do so. This will be very helpful for your virtual assistant project where the predictive keyword will make predictions similar to your style of texting or similar to the style of how you compose your e-mails.
Introduction:
This section will cover what the next word prediction model built will exactly perform. The model will consider the last word of a particular sentence and predict the next possible word. We will be using methods of natural language processing, language modeling, and deep learning. We will start by analyzing the data followed by the pre-processing of the data. We will then tokenize this data and finally build the deep learning model. The deep learning model will be built using LSTM’s. The entire code will be provided at the end of the article with a link to the GitHub repository.
Approach:
The Datasets for text data are easy to find and we can consider Project Gutenberg which is a volunteer effort to digitize and archive cultural works, to “encourage the creation and distribution of eBooks”. From here we can get many stories, documentations, and text data which are necessary for our problem statement. The dataset links can be obtained from here. We will use the text from the book Metamorphosis by Franz Kafka. You can download the dataset from here. However, if you have the time to collect your own emails as well as your texting data, then I would highly recommend you to do so. This will be very helpful for your virtual assistant project where the predictive keyword will make predictions similar to your style of texting or similar to the style of how you compose your e-mails.
Pre-processing the Dataset:
The first step is to remove all the unnecessary data from the Metamorphosis dataset. We will delete the starting and end of the dataset. This is the data that is irrelevant to us. The starting line should be as follows:
One morning, when Gregor Samsa woke from troubled dreams, he found
The ending line for the dataset should be:
first to get up and stretch out her young body.
Once this step is done save the file as Metamorphosis_clean.txt. We will access the Metamorphosis_clean.txt by using the encoding as utf-8. The next step of our cleaning process involves replacing all the unnecessary extra new lines, the carriage return, and the Unicode character. Finally, we will make sure we have only unique words. We will consider each word only once and remove any additional repetitions. This will help the model train better avoiding extra confusion due to the repetition of words. Below is the complete code for the pre-processing of the text data.
file = open("metamorphosis_clean.txt", "r", encoding = "utf8")lines = []for i in file:lines.append(i)data = ""for i in lines:data = ' '. join(lines)data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '')translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) #map punctuation to spacenew_data = data.translate(translator)z = []for i in data.split():if i not in z:z.append(i)data = ' '.join(z)
Tokenization: Tokenization refers to splitting bigger text data, essays, or corpus’s into smaller segments. These smaller segments can be in the form of smaller documents or lines of text data. They can also be a dictionary of words.
The Keras Tokenizer allows us to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf. To learn more about the Tokenizer class and text data pre-processing using Keras visit here.
We will then convert the texts to sequences. This is a way of interpreting the text data into numbers so that we can perform better analyses on them. We will then create the training dataset. The ‘X’ will contain the training data with the input of text data. The ‘y’ will contain the outputs for the training data. So, the ‘y’ contains all the next word predictions for each input ‘X’.
We will calculate the vocab_size by using the length extracted from tokenizer.word_index and then add 1 to it. We are adding 1 because 0 is a reserved for padding and we want to start our count from 1. Finally, we will convert our predictions data ‘y’ to categorical data of the vocab size. This function converts a class vector (integers) to the binary class matrix. This will be useful with our loss which will be categorical_crossentropy. The rest of the code for the tokenization of data, creating the dataset, and converting the prediction set into categorical data is as follows:
Note: Improvements can be made in the pre-processing. You can try different methods to improve the pre-processing step which would help in achieving a better loss and accuracy in lesser epochs.
tokenizer = Tokenizer()tokenizer.fit_on_texts([data])# saving the tokenizer for predict function.pickle.dump(tokenizer, open('tokenizer1.pkl', 'wb'))sequence_data = tokenizer.texts_to_sequences([data])[0]vocab_size = len(tokenizer.word_index) + 1sequences = []for i in range(1, len(sequence_data)):words = sequence_data[i-1:i+1]sequences.append(words)sequences = np.array(sequences)X = []y = []for i in sequences:X.append(i[0])y.append(i[1])X = np.array(X)y = np.array(y)y = to_categorical(y, num_classes=vocab_size)
Creating the Model:
We will be building a sequential model. We will then create an embedding layer and specify the input dimensions and output dimensions. It is important to specify the input length as 1 since the prediction will be made on exactly one word and we will receive a response for that particular word. We will then add an LSTM layer to our architecture. We will give it a 1000 units and make sure we return the sequences as true. This is to ensure that we can pass it through another LSTM layer. For the next LSTM layer, we will also pass it through another 1000 units but we don’t need to specify return sequence as it is false by default. We will pass this through a hidden layer with 1000 node units using the dense layer function with relu set as the activation. Finally, we pass it through an output layer with the specified vocab size and a softmax activation. The softmax activation ensures that we receive a bunch of probabilities for the outputs equal to the vocab size. The entire code for our model structure is as shown below. After we look at the model code, we will also look at the model summary and the model plot.
model = Sequential()model.add(Embedding(vocab_size, 10, input_length=1))model.add(LSTM(1000, return_sequences=True))model.add(LSTM(1000))model.add(Dense(1000, activation="relu"))model.add(Dense(vocab_size, activation="softmax"))
Model Summary:
Callbacks:
The callbacks we will be using for the next word prediction model is as shown in the below code block:
from tensorflow.keras.callbacks import ModelCheckpointfrom tensorflow.keras.callbacks import ReduceLROnPlateaufrom tensorflow.keras.callbacks import TensorBoardcheckpoint = ModelCheckpoint("nextword1.h5", monitor='loss', verbose=1,save_best_only=True, mode='auto')reduce = ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.0001, verbose = 1)logdir='logsnextword1'tensorboard_Visualization = TensorBoard(log_dir=logdir)
We will be importing the 3 required callbacks for training our model. The 3 important callbacks are ModelCheckpoint, ReduceLROnPlateau, and Tensorboard. Let us look at what task each of these individual callbacks performs.
- ModelCheckpoint — This callback is used for storing the weights of our model after training. We save only the best weights of our model by specifying save_best_only=True. We will monitor our training by using the loss metric.
- ReduceLROnPlateau — This callback is used for reducing the learning rate of the optimizer after a specified number of epochs. Here, we have specified the patience as 3. If the accuracy does not improve after 3 epochs, then our learning rate is reduced accordingly by a factor of 0.2. The metric used for monitoring here is loss as well.
- Tensorboard — The tensorboard callback is used for plotting the visualization of the graphs, namely the graph plots for accuracy and the loss. Here, we will only be looking at the loss graph of the next word prediction.
We will be saving the best models based on the metric loss to the file nextword1.h5. This file will be crucial while accessing the predict function and trying to predict our next word. We will wait for 3 epochs for the loss to improve. If it does not improve, then we will reduce the learning rate. Finally, we will be using the tensorboard function for visualizing the graphs and histograms if needed.
Compile and Fit:
Below is the code block for compiling and fitting of the model.
model.compile(loss="categorical_crossentropy", optimizer=Adam(lr=0.001))model.fit(X, y, epochs=150, batch_size=64, callbacks=[checkpoint, reduce, tensorboard_Visualization])
We are compiling and fitting our model in the final step. Here, we are training the model and saving the best weights to nextword1.h5 so that we don’t have to re-train the model repeatedly and we can use our saved model when required. Here I have trained only on the training data. However, you can choose to train with both train and validation data. The loss we have used is categorical_crossentropy which computes the cross-entropy loss between the labels and predictions. The optimizer we will be using is Adam with a learning rate of 0.001 and we will compile our model on the metric loss. Our result is as shown below:
For the prediction notebook, we will load the tokenizer file which we have stored in the pickle format. We will then load our next word model which we have saved in our directory. We will use this same tokenizer to perform tokenization on each of the input sentences for which we should make the predictions on. After this step, we can proceed to make predictions on the input sentence by using the saved model.
We will use the try and except statements while running the predictions. We are using this statement because in case there is an error in finding the input sentence, we do not want the program to exit the loop. We want to run the script as long as the user wants the script to be run. When the user wants to exit the script, the user must manually choose to do so. The program will run as long as the user desires.
Let us have a brief look at the predictions made by the model. This is done as follows:
Enter your line: at the dullweatherEnter your line: collection of textilesamplesEnter your line: what a strenuouscareerEnter your line: stop the scriptEnding The Program.....
This can be tested by using the predictions script which will be provided in the next section of the article. I will be giving a link to the GitHub repository in the next section. The predictions model can predict optimally on most lines as we can see. The stop the script line will end the model and exit the program. When we enter the line “stop the script” the entire program will be terminated. For all the other sentences a prediction is made on the last word of the entered line. We will be considering the very last word of each line and try to match it with the next word which has the highest probability.
Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.
Observation:
We are able to develop a high-quality next word prediction for the metamorphosis dataset. We are able to reduce the loss significantly in about 150 epochs. The next word prediction model which we have developed is fairly accurate on the provided dataset. The overall quality of the prediction is good. However, certain pre-processing steps and certain changes in the model can be made to improve the prediction of the model.
With this, we have reached the end of the article. The entire code can be accessed through this link. The next word prediction model is now completed and it performs decently well on the dataset. I would recommend all of you to build your next word prediction using your e-mails or texting data. This will be better for your virtual assistant project. Feel free to refer to the GitHub repository for the entire code. I would also highly recommend the Machine Learning Mastery website which is an amazing website to learn more. It was of great help for this project and you can check out the website here. Thank you so much for reading the article and I hope all of you have a wonderful day!