This paper, which is refreshingly brutal and honest, summarises why I have been having problems getting good results from RAG.
Basically, in its current incarnation, RAG simply doesnāt work, period. Yes, I know there are good examples showing impressive results with RAG. However, these examples donāt necessarily point out that it requires tweaking. Parameters like context size, overlap and top_k may be the difference between a good result, and a horrible one. The prompts also need to be tweaked, sometimes on a per query basis. Even things like the choice of an embedding algorithm or vector database can sometimes make a difference.
To quote from the paper:
The problem here isnāt that large language models hallucinate, lie, or misrepresent the world in some way. Itās that they are not designed to represent the world at all; instead, they are designed to convey convincing lines of text. So when they are provided with a database of some sort, they use this, in one way or another, to make their responses more convincing. But they are not in any real way attempting to convey or transmit the information in the database. As Chirag Shah and Emily Bender put it: āNothing in the design of language models (whose training task is to predict words given context) is actually designed to handle arithmetic, temporal reasoning, etc. To the extent that they sometimes get the right answer to such questions is only because they happened to synthesize relevant strings out of what was in their training data. No reasoning is involved [ā¦] Similarly, language models are prone to making stuff up [ā¦] because they are not designed to express some underlying set of information in natural language; they are only manipulating the form of languageā (Shah & Bender, 2022). These models arenāt designed to transmit information, so we shouldnāt be too surprised when their assertions turn out to be false.
To summarise:
Investors, policymakers, and members of the general public make decisions on how to treat these machines and how to react to them based not on a deep technical understanding of how they work, but on the often metaphorical way in which their abilities and function are communicated. Calling their mistakes āhallucinationsā isnāt harmless: it lends itself to the confusion that the machines are in some way misperceiving but are nonetheless trying to convey something that they believe or have perceived. This, as weāve argued, is the wrong metaphor. The machines are not trying to communicate something they believe or perceive. Their inaccuracy is not due to misperception or hallucination. As we have pointed out, they are not trying to convey information at all. They are bullshitting.
As Iāve discovered, LLMs have huge coherence issues with long context length. Itās no secret many of the techniques for RAG and summarisation rely on breaking down the context into smaller chunks - chain of density, MapReduce etc. By the time an LLM has finished summarising a paragraph, it may forget about it a few paragraphs later. How can we hope that an LLM will translate a DN sutta maintaining context and coherency, even if we apply chunking techniques? Some say this is only a problem with current generation LLMs, new models are coming out with 32K-128K context lengths. However, my limited testing shows this is simply a marketing number - models with large context sizes donāt seem to be significantly better at maintaining coherence and context.
Given the above, I think there are real issues with using LLMs on anything related to Buddhism. These models have no regard for the truth, they do not care about realisation, or extinguishment, or the cessation of suffering. They have a huge potential for creating more suffering, and delaying the achievement of the soteriological goal.