Skip to content

nithish-kumar-t/building-a-Large-Language-Model-LLM-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building a Large-Language-Model (LLM) from scratch

Author: Nithish Kumar Thathaiahkalva

Description

This project focussed on implementing an LLM encoder using parallel distributed computations in the cloud, using Hadoop MapReduce functionality and deploying the application on AWS Elastic MapReduce (EMR).

Youtube Video Link

Project structure

  1. This project consists of 2 parts:

    a. First is a simple word counter Map reduce program.

    b. Second is a MapReduce implementation of LLM encoder, this will be trained over a huge corpus of input data and will generate vector embeddings of words.

  2. Dataset used for training the model Wikipedia Text.

  3. For the first part we are

  4. For the encoder part, I'm dividing the data into 256MB chuncks, where each shard is handle bt a mapper and pass over that to a reducer.

    a. For each Mapper we are splitting the data into each sentence and encoding the sentence using Jtokkit to generate Byte-Pair encodings of words.

    b. We are again creating features and output labels using the tockenized sentence by sliding each word, we are mapping the current word as a prediction to previous sentence.

    c. Eg: "We are from UIC" => converting into tokens => [123, 23, 45, 12, 49, 11] => features = [123, 23, 45, 12, 49], labels = [23, 45, 12, 49, 11]

    image

    d.The model is getting trained using this feastures and labels data.

    e. Once the model is trained for each shard, we are mapping the input words to the respective vectors that are getting from the model.

    f. I'm seria;izing the vector embedding which is in the form is INDArray to TextWritable.

  5. The Reducer receiced the sorted data as the input and will do the processing.

    a. The Reducer first deserializing the input I.e. the IteratorOf[TextWtitable] it received into Itetator[Array]

    b. After this conversion we are averaging all the embeddings of a word to one, this will reduce the Noice and can have better Contextual Understanding.

** On a High level this is the view of this second application **

image

Environment

OS: Mac

IDE: IntelliJ IDEA 2022.2.3 (Ultimate Edition)

SCALA Version: 2.13.12

Hadoop Version: 3.3.4

Running the test file Test Files can be found under the directory src/test

sbt clean compile test

Running the project in local.

  1. Clone this repository
git clone [email protected]:nithish-kumar-t/building-a-Large-Language-Model-LLM-from-scratch.git
  1. cd to the Project
cd building-a-Large-Language-Model-LLM-from-scratch
  1. update the jars
sbt clean update
  1. Create fat jat using assembly
sbt assembly
# This will create a fat Jar
  1. we can then run UT's and FT's using below
sbt test
  1. SBT application can contain multiple mains, this project has 2, so to check the correct main
➜building-a-Large-Language-Model-LLM-from-scratch git:(feature) ✗ sbt
[info] started sbt server
sbt:LLM-hw1-jar>
sbt:LLM-hw1-jar> show discoveredMainClasses
* com.trainingLLM.LLMEncoder
* com.trainingLLM.TokenizationJob
  1. Create fat jat using assembly
hadoop jar <JarName> <Method to run> <environment>
eg: hadoop jar target/scala-2.13/llm-hw1.jar com.trainingLLM.TokenizationJob env=local

Running the Project in AWS

  1. Create a Library structure like below in S3
image
  1. Start a new cluster in AWS EMR, use default configuration,

  2. After a cluster is created open the cluster and we can add our MR job as steps. It will show like below, select the Jar from your s3 and give env values.

image
  1. We can also able to chain the steps, once you add step it will start running, based on the order
image
  1. Once the Job completes, output will be available under s3, as per pre-configuration in step 1
image

Prerequisites

  1. Hadoop: Set up Hadoop on your local machine or cluster.

  2. AWS Account: Create an AWS account and familiarize yourself with AWS EMR.

  3. Deeplearning4j Library: Ensure that you have the Java Deeplearning4j Library integrated into your project for getting the model.

  4. Scala, Java and Hadoop: Make sure Scala, Java and Hadoop (Scala (above 2), Java 11, hadoop 3.3.6) are installed and configured correctly.

  5. Git and GitHub: Use Git for version control and host your project repository on GitHub.

  6. IDE: Use an Integrated Development Environment (IDE) for coding and development.

Usage

Follow these steps to execute the project:

  1. Data Gathering: Ensure you have selected a dataset and it will be ready to get processed by you MR jobs.

  2. Configuration: Set up the necessary configuration parameters and input file paths.

  3. MapReduce Execution:

    a. Run the TockenizeMapReduce Job to generate tokens for each word and it's count.

    b. Run the LLMEncoder job to create vector embeddings, it will create vector embedings and write that in a file.

  4. Results: Examine the results obtained from the MapReduce jobs.

    a. Word counter should output the word, it's Unique token and the count of the word.

    b. LLM Encoder should output the vector embeddings, word and it's embedding.

  5. Deployment on AWS EMR: If required, deploy the project on AWS EMR to train more data.

Unit / Regression Testing

Code coverage report

Screenshot 2024-10-05 at 9 47 50 PM

About

Implementing an LLM encoder using massively parallel distributed computations in the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages