How to generate HTML content with AI using Llama-node and Express

5 min readJun 21, 2023

This tutorial is meant to be a minimal generative AI demo using an open-source stack, without any 3rd party AI service. I hope it will inspire you to create your own useful apps with more advanced integration and design.

TL;DR: Final result available here: https://github.com/jbilcke-hf/template-node-llama-express

The stack

For this project we are going to use:

NodeJS and TypeScript
llama-node to use a local LLM in Node
Express to create the web server
Airoboros-13b for the language model

First step: project setup

Some prerequisites:

Git and NVM
Enough RAM (16 GB or more recommended)

Init the Node project:

To save some time I’ve created a project template which you can clone here:

git clone https://github.com/jbilcke-hf/template-node-express.git tutorial
cd tutorial
nvm use
npm i
npm run start

You should now be able to see a test message on https://localhost:7860

npm run start

You can then install the dependencies, and prepare a folder to keep the LLM:

npm i llama-node @llama-node/llama-cpp
mkdir models

Download the LLM

For this tutorial I recommend to use Airoboros-13b which you can download from here:

curl https://huggingface.co/TheBloke/airoboros-13b-gpt4-GGML/resolve/main/airoboros-13b-gpt4.ggmlv3.q4_0.bin > ./models/airoboros-13b-gpt4.ggmlv3.q4_0.bin

👉 Here are some tips if later you want to use another model:

Make sure it is compatible with llama-node (see compability table)
You need a model capable of coding tasks so that it can generate web content (HTML, JS..)

Calling the LLM

llama-node is a library to asynchronously generate a stream of text from a LLM. The basic usage is as follow (this is a non-working example):


// initialize the LLM backend engine
// (that's because llama-node supports multiples engines)
const llama = new LLM(LLamaCpp)

// load the language model from a path
await llama.load({
  modelPath: "path/to/model",
  ... // mandatory import options for the LLM (more on this later)
})

// start generating some text
await llama.createCompletion({
  prompt: "My favorite food is",
  ... // mandatory LLM generation options (more on this later)
}, (response) => {
  process.stdout.write(response.token) // do something with the partial result
})

To make the code work we need to pass configuration settings, and of course a real model.

For a working example, edit src/index.mts and write:


import { LLM } from "llama-node"
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js"

const llama = new LLM(LLamaCpp)

await llama.load({
    modelPath: "models/airoboros-13b-gpt4.ggmlv3.q4_0.bin",
    enableLogging: false,
    nCtx: 1024,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
    nGpuLayers: 0
})

await llama.createCompletion({
  prompt: "My favorite movie is",
  nThreads: 4,
  nTokPredict: 2048,
  topK: 40,
  topP: 0.1,
  temp: 0.3,
  repeatPenalty: 1,
}, (response) => {
  process.stdout.write(response.token)
})

The completion output is streamed, so we need to use process.stdout.write and not console.log (otherwise we would see many line returns)

Then open the terminal and type:

npm run start

After a few moments (this may take quite some time, more than 20 seconds) you should see the model begin to stream an output.

Generating content with a LLM is very resource-intensive, so I strongly suggest to close all your other programs to accelerate the generation.

Your code editor should be able to give you information about each parameter on mouse hover. Feel free to play with the completions settings, they will influence the output speed and quality (but it may also break things).

Putting things together

Now that we have our basic text generation working, we can use a simple web server library to stream the output, such as Express.

Here is the simplified application structure you can use:


import express from "express"
import { LLM } from "llama-node"
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js"

const llama = new LLM(...)
await llama.load(...)

// initialize the web server
const app = express()

// define a new HTTP endpoint ("https://localhost:7860/")
app.get("/", async (req, res) => {
  // on each new request, we start a completion..
  await llama.createCompletion(..., (response) => {
    // on each new token, we send it out through the HTTP stream..
    res.write(...)
  })
  // finally, when the completion stream ends we close the output stream
  res.end()
})

// begin accepting requests
app.listen(7860, () => { console.log(`Open http://localhost:7860`) })

Now, the matter becomes a question of what and how you want to prompt and generate your content. One possibility can be to directly ask for HTML output, like this:

import express from "express"
import { LLM } from "llama-node"
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js"


const llama = new LLM(LLamaCpp)
await llama.load({
  modelPath: "./models/airoboros-13b-gpt4.ggmlv3.q4_0.bin",
  enableLogging: false,
  nCtx: 1024,
  seed: 0,
  f16Kv: false,
  logitsAll: false,
  vocabOnly: false,
  useMlock: false,
  embedding: false,
  useMmap: true,
  nGpuLayers: 0
})

const app = express()
const port = 7860
const timeoutInSec = 5 * 60 // 5 min before killing a request

app.get("/", async (req, res) => {
  // we give this bit of HTML to help the LLM with completion,
  // but it won`t repeat it back, so we need to do it ourselves
  res.write("<html><head>")

  const options = {
    prompt: `# Instructions
Generate an HTML webpage about: ${req.query.prompt}
Use English, not Latin!
To create section titles, please use <h2>, <h3> etc
# HTML output
<html><head>`,
    nThreads: 2,
    nTokPredict: 1024,
    topK: 40,
    topP: 0.1,
    temp: 0.3,
    repeatPenalty: 1,
  }
      

  await llama.createCompletion(options, (response) => {
    res.write(response.token)
  })
  res.end()
})

app.listen(port, () => { console.log(`Open http://localhost:${port}/?prompt=a%20webpage%20recipe%20for%20making%20chocolate%20chip%20cookies`) })

One last tip: aborting the stream

To save resources you should kill the stream whenever a user disconnects, and I also strongly suggest to not let it run forever.

You can end the completion stream from anywhere in your code by passing an abort controller as a third parameter:

app.get("/", async (req, res) => {

  const abortController = new AbortController()

  await llama.createCompletion(..., (response) => {
    ...
  }, abortController.signal)

  // this will kill the stream when a user disconnects
  req.on("close", function() {
    abortController.abort()
  })

  // this will also kill the stream after 5 minutes
  setTimeout(() => {
    abortController.abort()
  }, 5 * 60 * 1000)

  ...

Demo project

I have setup a working demo repository that you can find here:

GitHub - jbilcke-hf/template-node-llama-express

Contribute to jbilcke-hf/template-node-llama-express development by creating an account on GitHub.

github.com

It is ready for deployment using Docker, but I recommend running it first on your machine (see the README.md).

Be warned however, depending on the configuration of the server to which you deploy the code that you may have to wait for a couple of minutes before you see your content. This might be the case for instance if you use an entry level machine like the free option from Hugging Face Spaces.

To go further

Turning this proof of concept code into a project suitable for production will require more work and customization from your side, depending on your use cases.

I suggest you take the time to find the best solution adapted to your project (and budget!), ultimately scaling such an app requires managing multiple GPU instances, implement rate limits, quotas..

Let me know in the comments if you have tips or solutions to share regarding scaling Node and AI apps in production!

Dev Genius