-
Notifications
You must be signed in to change notification settings - Fork 423
Kernel Memory is broken with latest nugets #305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, which version of kernel-memory did you install? |
The new version of the semantic kernel has added some breaking change, and the latest package has been updated in the PR #306 |
I also notice that there is no way to add grammar to this right now so AskAsync can't force good output of JSON as an example. Can this be added? Same with MainGPU and the other settings on the context. And there are a few optoins missing (TOP_K) for the executer as well that you can't get to with Microsoft's With(new TextGenerationOptions). Perhaps an extended one for LlamaSharp that has all of the rest of the settings that are available? And I'm noticing, now that I have it hacked to work(ish) that a memory.AskAsync works, but if you do it again (i.e. memory.AskAsync) for a second question against the data, I just get a ton of \n characters and no valid response. I'm not sure if that's kernel-memory doing that, or if it's this that's causing the issue. (If it helps, I'm in GPU only mode with dual nvidia 8 gb cards.) |
Yes, the current version of Kernel Memory does not provide more customizable options.
It seems to be related to the state management of the model. @AsakusaRinne any idea? |
On the first issue I was able to take your code, and manually add a grammar to it. Is there a way that we can just expose Grammar in the LLamaSharpConfig for right now that will get passed through? Same with MainGPU, and TOP_K? |
👍 Good idea, I seem to have a solution to the issue #289. 😃 Thank you for your suggestion. |
On the issue with getting the endless \n on the second AskAsync, it also does it with embeddings. If you import 2 documents in a row, the second one will fail. It appears that it fails because it's failing in the text generation the same way and it appears that it is something in the context, because if I regen the model and the executor every time, it doesn't fix the problem I also noticed that EmbeddingMode is not being set on the embedder ModelParams which makes it much slower. |
Update: It isn't the context, and the native handle shows that it's still valid. I've tried to replace all of the shared versions of the context and model with new versions spun up per request, and there was no difference. |
It needs a deep dive. Did you use
It's absolutely an unexpected behaviour! Could you please share your code with us to re-preoduce it? (a minimal example is better) |
Yes, I used WithlllamaSharpDefaults. I'm using a console app
The second AskAsync will return a ton of \n\n no matter the model at least when using CUDA. The second part about EmbeddingMode is just in the setup of the WithLLamaSharpDefaults it doesn't split out the parameters so they both use the same params for text creation and embedding instead of optimizing out the embedding params. |
@xbotter I think we should watch the prompt put into our model in the second run with context size reduced to 4096. Could you please take a look? I'm on duty this weekend. |
This can only be resolved from the kernel memory . I have already submitted an issue microsoft/kernel-memory#164 and waiting for further updates. |
If it matters I’m not changing the context length and it still happens. |
KernelMemory author here, let me know if there's something I can do to make the integration better, more powerful, easier, etc :-) |
Not related to this, but since you offered, the biggest single issue with kernal memory (and LLMs in general) is context length. In kernal memory it's worse because you're trying to analyse documents which can be very long and if you Ask on many documents the context runs out and causes issues. I'd love to see KernelMemory port or otherwise adopt some of the winding techniques of memgpt natively and expose methods for private LLMs to impliment incremental analysis etc. I.e. keeping the system message and query in context, and then taking part by part of documents in a loop to pull facts out of and accumulate answers. |
Thanks for the feedback, we merged a PR today that allows to configure and/or replace the search logic, e.g. defining token limits. And this PR microsoft/kernel-memory#189 allows to customize token settings and tokenization logic. Would appreciate if someone could take a look/let us know if it helps. This snippet shows how we could add LLama to KernelMemory: public class LLamaConfig
{
public string ModelPath { get; set; } = "";
public int MaxTokenTotal { get; set; } = 4096;
}
public class LLamaTextGenerator : ITextGenerator, IDisposable
{
private readonly string _modelPath;
public LLamaTextGenerator(LLamaConfig config)
{
this._modelPath = config.ModelPath;
this.MaxTokenTotal = config.MaxTokenTotal;
}
/// <inheritdoc/>
public int MaxTokenTotal { get; }
/// <inheritdoc/>
public int CountTokens(string text)
{
// ... count tokens using LLama tokenizer ...
// ... which can be injected via ctor as usual ...
}
/// <inheritdoc/>
public IAsyncEnumerable<string> GenerateTextAsync(
string prompt,
TextGenerationOptions options,
CancellationToken cancellationToken = default)
{
// ... use LLama backend to generate text ...
}
} |
This looks good from what I can see! Is there a roadmap for memgpt style functionality? Really missing this from python. |
you mean a chat UI? Things on the roadmap are supporting chat logs and some special memory views. I always wanted to create a simple UI for demos, maybe as a side project, no timeline though |
About LLamaSharp, could you point me to how to count the tokens for a given string? is there some example? |
If you have a LLamaContext context;
int[] tokens = context.Tokenize("this is a string");
int count = tokens.Length; If you only have a model and no context (i.e. LLamaWeights weights;
int[] tokens = weights.NativeHandle.Tokenize("this is also a string");
int count = tokens; |
I started a draft to integrate LlamaSharp into KernelMemory here microsoft/kernel-memory#192 I'm using
Also, the prompt in the test seems to generate some hallucination, do you see anything that could be causing that? the kind of hallucinations I used to see in the old GPT3 in 2022: Temp 0, max token 20, prompt:
result:
(getting the same result with |
dluc: Generally I see this when:
Generally what you want is the context_length to be the same as the model's length. And you want max_tokens to either be short but long enough for the answer you're expecting because LLAMA has a bad habit of repeating itself, or System Message + all user messages + assistant responses including max_tokens <= context_length. I use TikSharp to calculate the number of tokens for all prompts, add 10 or so just to be safe and subtract that from context_length and make that the max token length, then set antiprompts to AntiPrompts = ["\n\n\n\n", "\t\t\t\t"] which gets rid of 2 of the cases (especially when generating json with the grammar file) of repetition instead of ending. This technique also works when using ChatGPT 3.5+ so you don't get errors since it hard refuses and costs you money, to produce more than the context_length so you have to do this math or risk it blowing up and running up your bill. |
PS: Kernel-Memory is broken again with the latest update to Kernel-Memory. |
@dluc , as you are developer of kernel memory, can you provide some sample of MemoryServerless based on LLamaSharp? I am trying to make it work (by getting code for text generator from microsoft/kernel-memory#192), but don't have much luck.AskAsync goes to infinite state Here is my code: open System type Generator(config: LLamaSharpConfig) = interface ITextEmbeddingGenerator with
type TextGenerator(config: LLamaSharpConfig) = interface ITextGenerator with let mb = memoryBuilder let kernelMemory = mb.Build() task{ Console.ReadLine() |> ignore test.txt itself is very simple: I am using Bloke model for openchat 3.5 |
Update: LLamaSharp 0.8.1 is now integrated into KernelMemory, here's an example: https://github.com/microsoft/kernel-memory/blob/main/examples/105-dotnet-serverless-llamasharp/Program.cs There's probably some work to do for users, e.g. customizing prompts for LLama and identifying which model works best. KM should be sufficiently configurable to allow that. |
@dluc Thank you a lot for this great work! Since kernel-memory has supported integration of LLamaSharp internally, I think we'll give up maintaining |
The example is missing A LOT and it appears that it doesn't handle stuff properly either. I.e. it uses Azure openai to do tokenization/embeddings instead of LLama like the LLamaSharp.kernel-memory does and there isn't even any code to make it work. And even then, I'm not sure how the sample would even work, because OpenAI uses a different tokenization length than LLama so the results will be vastly different if it works at all. I'd suggest before LLamaSharp.kernel-memory is legacied, that the sample needs to be updated to use LLamaSharp for the entire roundtrip. |
@AsakusaRinne I would take the opportunity to thank you all for LlamaSharp, making it so straightforward to integrate Llama into SK and KM. Before removing KM from LlamaSharp, I'd just highlight that we didn't add LLamaEmbeddings in KM, because text comparison tests were failing, because cosine similarity for similar strings is off. For example comparing the similarity of these strings, using embeddings: string e1 = "It's January 12th, sunny but quite cold outside";
string e2 = "E' il 12 gennaio, c'e' il sole ma fa freddo fuori";
string e3 = "the cat is white";
|
This is kinda required. And AskAsync is REALLY slow. |
We had some trouble with embeddings before, I think last time we investigated it we found we got the same values from llama.cpp and there was nothing to fix on the C# side. I'd suggest trying those same tests with llama.cpp to see if you get the same values there or if it's maybe an issue on our end this time. |
[Maybe we can move this conversation to KM repo if it's out of scope here] @JohnGalt1717 could you help understanding if
From my tests:
|
It appears there may be a bit of confusion regarding the distinction between text indexing and RAG text generation. The process of text indexing operates independently and does not necessarily require the use of the same model utilized for text generation. It's perfectly reasonable to use one model for embedding and indexing, while employing a different model for text generation. As for the test in the repo, we have executed it on numerous occasions with both text and files, using a blend of OpenAI/Azure OpenAI for embeddings and LLama for text generation, which has consistently resulted in expected outcomes. If you're able to provide the steps to reproduce any errors you're experiencing, we'd certainly appreciate the opportunity to investigate them further. Could you kindly share some additional details about your scenario? |
May I ask for a way to reproduce it? I'm not sure if it's because the model does not support the language in the second sentence. |
I have a machine that I've run with and without GPU (2070 super). If I use the Kai 7b 5_S_M model from hugging face, it's 3:30 and 1:40 respectively run directly from the text of the partitions that a search returns. if I use AskAsync it has been running for over 45 minutes and still hasn't returned. |
Thanks for all of the hard work integrating KernelMemory with LLamaSharp, I have been able to get a working RAG prototype going in Godot with these two libraries. I noticed that for some reason, the IKernelMemoryBuilder doesn't actually load any layers onto the GPU, no matter how many layers I have specified. This happens with my desktop that has an RTX 3080 12GB and my laptop with an eGPU (RTX 3060 12GB). I can confirm that I only have the Cuda12 backend added, and when I am using just LLamaSharp to ChatAsync it loads the model just fine onto the GPU (for both setups). Perhaps I'm doing something wrong, but if other users can look at their Task Manager to see if the IKernelMemoryBuilder actually utilizes the GPU or not that could explain the slow speed. It runs my 5800X3D at ~66% for all file embeddings and queries. I am also trying to figure out if there is a way to interact with the loaded model in a standard chat fashion, but I do not see any methods in IKernelMemory that interact with the model without searching the database. From what I can tell, using an IKernelMemoryBuilder utilizes the standard LLamaSharp weights/context/executor/embedder but does not make them available after creation of the IKernelMemory. It would be great to have access to the standard LLamaSharp session.ChatAsync method when utilizing a KernelMemory, if possible. |
Using the 0.8 release of LlamaSharp and Kernal-Memory with the samples there is an error because the LlamaSharpTextEmbeddingGeneration doesn't implement the Attributes property.
I took the source and created my own and added this:
public IReadOnlyDictionary<string, string> Attributes => new Dictionary<string, string>();
So it wouldn't error.
But no matter what model I use I get "INFO NOT FOUND." (I've tried kai-7b-instruct.Q5_K_M.gguf, llama-2-7b-32k-instruct.Q6_K.gguf, llama-2-7b-chat.Q6_K.gguf and a few others)
I've tried loading just text, an html file, and a web page to no avail.
The text was updated successfully, but these errors were encountered: