Handling Multiple Conversations and Prompts in Custom Main Loop #14520

MirkoDeVita98 · 2025-07-03T15:51:15Z

MirkoDeVita98
Jul 3, 2025

Hello! I'm new to llama.cpp and I'm trying to understand how to write my own main loop to process multiple conversations from a file.

This is my current setup: I loop over each conversation (outer loop), and then iterate through the prompts within each conversation (inner loop). While transitioning between conversations works fine, I'm seeing strange results when processing multiple prompts within the same conversation.

I based parts of this code on the save-load-state.cpp example.

Could you please take a look and let me know if my logic makes sense? Also, I'm unsure whether I should instantiate a new context and sampler for every conversation, or if there's a more efficient way to handle that. Any advice on best practices would be appreciated!

llama_token eot = llama_vocab_eot(vocab);

for(size_t c = 0; c < conversations.size(); c++){
    for(size_t p = 0; p < conversations[c].size(); p++){
        std::string buffer = conversations[c][p];
        std::string result;
        auto tokens = common_tokenize(ctx, buffer, true);
        llama_batch batch = llama_batch_init(tokens.size(), 0, 1);
        for (size_t i = 0; i < tokens.size(); i++) {
            common_batch_add(batch, tokens[i], i, {0}, false);
        }
        batch.logits[batch.n_tokens - 1] = true; // generate next token
        // evaluate prompt
        llama_decode(ctx, batch);
        n_past += batch.n_tokens;
        for (auto i = 0; i < params.n_predict; i++) {
            auto next_token     = common_sampler_sample(smpl, ctx, -1);
            auto next_token_str = common_token_to_piece(ctx, next_token);

            printf("%s", next_token_str.c_str());
            result += next_token_str;

            common_batch_clear(batch);
            common_batch_add(batch, next_token, n_past, {0}, true);

            if (llama_decode(ctx, batch)) {
                fprintf(stderr, "\n%s : failed to evaluate\n", __func__);
                llama_batch_free(batch);
                return 1;
            }
            n_past += 1;

            // Check for EOS token and break if found
            if (next_token == eot) {
                break;
            }

        }

        llama_batch_free(batch);
        n_past = 0;
    }
    ctx = llama_init_from_model(model, common_context_params_to_llama(params));
    llama_attach_threadpool(ctx, threadpool, threadpool_batch);
    smpl = common_sampler_init(model, sparams);
}

slaren · 2025-07-03T17:11:19Z

slaren
Jul 3, 2025
Maintainer

I only skimmed through your code, but generally speaking if you want to process multiple conversations in the same KV cache, you need to assign each one a different sequence id. That's the {0} parameter that you are passing in common_batch_add. Alternatively, if you only want to process one conversation at a time, you can clear the KV cache completely before processing each conversation by calling llama_kv_self_clear / llama_memory_clear.

4 replies

MirkoDeVita98 Jul 4, 2025
Author

Thanks a lot for the help! I know about the llama_kv_self_clear, but I did not use it because I thought that creating a new context would also start a new kv cache. I'll add it to the end of the outer loop then!

Edit: Should I also create the batch differently if I use multiple sequences? Would the total size be the sum of all the tokens' sizes?

slaren Jul 4, 2025
Maintainer

Creating a new context also starts a new kv cache, but you still need to clear the KV cache between each prompt, setting n_past to zero is not enough.

MirkoDeVita98 Jul 8, 2025
Author

I think there's a misunderstanding. I have a list of conversations, where each conversation contains multiple prompts. For each conversation I'd like to use the same KV Cache to sort of mimic a chat. Once a new conversation starts, the KV cache should be cleared and that's why I was creating a new context for every new conversation. How can I do that in the fastest way possible?

slaren Jul 8, 2025
Maintainer

Then don't reset n_past to zero after each prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling Multiple Conversations and Prompts in Custom Main Loop #14520

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Handling Multiple Conversations and Prompts in Custom Main Loop #14520

Uh oh!

MirkoDeVita98 Jul 3, 2025

Replies: 1 comment · 4 replies

Uh oh!

slaren Jul 3, 2025 Maintainer

Uh oh!

Uh oh!

MirkoDeVita98 Jul 4, 2025 Author

Uh oh!

slaren Jul 4, 2025 Maintainer

Uh oh!

Uh oh!

MirkoDeVita98 Jul 8, 2025 Author

Uh oh!

slaren Jul 8, 2025 Maintainer

MirkoDeVita98
Jul 3, 2025

Replies: 1 comment 4 replies

slaren
Jul 3, 2025
Maintainer

MirkoDeVita98 Jul 4, 2025
Author

slaren Jul 4, 2025
Maintainer

MirkoDeVita98 Jul 8, 2025
Author

slaren Jul 8, 2025
Maintainer