Email: [email protected]
This project focussed on implementing an LLM encoder using parallel distributed computations in the cloud, using Hadoop MapReduce functionality and deploying the application on AWS Elastic MapReduce (EMR).
-
This project consists of 2 parts:
a. First is a simple word counter Map reduce program.
b. Second is a MapReduce implementation of LLM encoder, this will be trained over a huge corpus of input data and will generate vector embeddings of words.
-
Dataset used for training the model Wikipedia Text.
-
For the first part we are
-
For the encoder part, I'm dividing the data into 256MB chuncks, where each shard is handle bt a mapper and pass over that to a reducer.
a. For each Mapper we are splitting the data into each sentence and encoding the sentence using Jtokkit to generate Byte-Pair encodings of words.
b. We are again creating features and output labels using the tockenized sentence by sliding each word, we are mapping the current word as a prediction to previous sentence.
c. Eg: "We are from UIC" => converting into tokens => [123, 23, 45, 12, 49, 11] => features = [123, 23, 45, 12, 49], labels = [23, 45, 12, 49, 11]
d.The model is getting trained using this feastures and labels data.
e. Once the model is trained for each shard, we are mapping the input words to the respective vectors that are getting from the model.
f. I'm seria;izing the vector embedding which is in the form is INDArray to TextWritable.
-
The Reducer receiced the sorted data as the input and will do the processing.
a. The Reducer first deserializing the input I.e. the IteratorOf[TextWtitable] it received into Itetator[Array]
b. After this conversion we are averaging all the embeddings of a word to one, this will reduce the Noice and can have better Contextual Understanding.
** On a High level this is the view of this second application **

OS: Mac
Running the test file Test Files can be found under the directory src/test
sbt clean compile test
- Clone this repository
git clone [email protected]:nithish-kumar-t/building-a-Large-Language-Model-LLM-from-scratch.git
- cd to the Project
cd building-a-Large-Language-Model-LLM-from-scratch
- update the jars
sbt clean update
- Create fat jat using assembly
sbt assembly
# This will create a fat Jar
- we can then run UT's and FT's using below
sbt test
- SBT application can contain multiple mains, this project has 2, so to check the correct main
➜building-a-Large-Language-Model-LLM-from-scratch git:(feature) ✗ sbt
[info] started sbt server
sbt:LLM-hw1-jar>
sbt:LLM-hw1-jar> show discoveredMainClasses
* com.trainingLLM.LLMEncoder
* com.trainingLLM.TokenizationJob
- Create fat jat using assembly
hadoop jar <JarName> <Method to run> <environment>
eg: hadoop jar target/scala-2.13/llm-hw1.jar com.trainingLLM.TokenizationJob env=local
- Create a Library structure like below in S3

-
Start a new cluster in AWS EMR, use default configuration,
-
After a cluster is created open the cluster and we can add our MR job as steps. It will show like below, select the Jar from your s3 and give env values.

- We can also able to chain the steps, once you add step it will start running, based on the order

- Once the Job completes, output will be available under s3, as per pre-configuration in step 1

-
Hadoop: Set up Hadoop on your local machine or cluster.
-
AWS Account: Create an AWS account and familiarize yourself with AWS EMR.
-
Deeplearning4j Library: Ensure that you have the Java Deeplearning4j Library integrated into your project for getting the model.
-
Scala, Java and Hadoop: Make sure Scala, Java and Hadoop (Scala (above 2), Java 11, hadoop 3.3.6) are installed and configured correctly.
-
Git and GitHub: Use Git for version control and host your project repository on GitHub.
-
IDE: Use an Integrated Development Environment (IDE) for coding and development.
Follow these steps to execute the project:
-
Data Gathering: Ensure you have selected a dataset and it will be ready to get processed by you MR jobs.
-
Configuration: Set up the necessary configuration parameters and input file paths.
-
MapReduce Execution:
a. Run the TockenizeMapReduce Job to generate tokens for each word and it's count.
b. Run the LLMEncoder job to create vector embeddings, it will create vector embedings and write that in a file.
-
Results: Examine the results obtained from the MapReduce jobs.
a. Word counter should output the word, it's Unique token and the count of the word.
b. LLM Encoder should output the vector embeddings, word and it's embedding.
-
Deployment on AWS EMR: If required, deploy the project on AWS EMR to train more data.
Code coverage report
