+The new framework of LightLLM is shown in the figure below. We have retained the previous three-process architecture design and designed a request object that can be accessed across processes via ctypes. We store the metadata of the requests in shared memory, ensuring that only a minimal amount of necessary data is communicated between processes. Additionally, we have implemented the folding of scheduling and model inference, and nearly eliminated the communication overhead between the scheduler and modelrpc in the previous version of the router. We implemented the CacheTensorManager class, which takes over the allocation and release of Torch tensors within the framework. This maximizes the cross-layer sharing of tensors at runtime, as well as memory sharing between different CUDA graphs. On an 8x80GB H100 machine, with the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without running out of memory (OOM). We will subsequently publish a series of blog posts introducing the architecture of LightLLM.
0 commit comments