Simple RAG Application on Sutta Question Answering

I found this thread because I am experimenting in this domain.

A natural language search engine, a la Perplexity, which generates simple lists of links and only β€˜connective’ text would be a good example of a relatively safe use of a Sutta-based Retrieval-Augmented-Generation (RAG) application.

In the case of building a dedicated LLM, for chat or other uses, it is interesting to consider aligning it with [the values of the] Dhamma using the technique of Constitutional AI developed by Anthropic (the PBC, or public benefit corporation, that develops Claude). As is the case with any technologies in this whole field, only rigorous development & testing can ensure the safety of use.

In my case, I am working with the Pali texts directly, building a semantic knowledge graph database, leaving translation and interpretation to others. Such a graph DB could be used in many applications, with or without the Generative AI component of LLMs.

The first step in text processing for a KG and many other NLP applications is the creation of lexical tokens, aka tokenization. A fundamental purpose of tokenization is to handle word parts, such as word roots and compound parts, subwords, etc., to enable NLP systems to handle β€˜similarity’ relationships between them. Here’s an example:

Unbelievably, the reforestation project’s semi-annual report had been overemphasized.

Sentence
β”œβ”€ Adv
β”‚  └─ Unbelievably
β”‚     β”œβ”€ Pfx: un
β”‚     β”œβ”€ Root: believe
β”‚     β”œβ”€ Sfx: able
β”‚     └─ Sfx: ly
β”œβ”€ NP
β”‚  β”œβ”€ Det
β”‚  β”‚  └─ the
β”‚  β”œβ”€ CN
β”‚  β”‚  β”œβ”€ reforestation
β”‚  β”‚  β”‚  β”œβ”€ Pfx: re
β”‚  β”‚  β”‚  β”œβ”€ Root: forest
β”‚  β”‚  β”‚  β”œβ”€ Sfx: ate
β”‚  β”‚  β”‚  └─ Sfx: ion
β”‚  β”‚  └─ Poss
β”‚  β”‚     └─ project's
β”‚  β”‚        β”œβ”€ project
β”‚  β”‚        β”‚  β”œβ”€ Pfx: pro
β”‚  β”‚        β”‚  └─ Root: ject
β”‚  β”‚        └─ Poss: 's
β”‚  β”œβ”€ CA
β”‚  β”‚  └─ semi-annual
β”‚  β”‚     β”œβ”€ Pfx: semi
β”‚  β”‚     β”œβ”€ Root: ann
β”‚  β”‚     β”œβ”€ Sfx: ual
β”‚  β”‚     └─ Punct: -
β”‚  └─ N
β”‚     └─ report
β”‚        β”œβ”€ Pfx: re
β”‚        └─ Root: port
└─ VP
β”œβ”€ V + Aux
β”‚  β”œβ”€ had
β”‚  └─ been
└─ CV
└─ overemphasized
β”œβ”€ Pfx: over
β”œβ”€ Root: emphasis
β”œβ”€ Sfx: ize
└─ Sfx: d
  • Each leaf node represents a lexical token.
  • Punctuation (Punct) is treated as a separate token where applicable (e.g., the hyphen in β€œsemi-annual”).
  • The possessive β€œ's” is treated as a separate token attached to the noun β€œproject”.
  • Compound words are further tokenized into their constituent prefixes, roots, and suffixes.

Currently, no tokenizer exists for the Pali language. Tokenizers are frequently written in Python using word splitters, Regex, and other programming operations, e.g. tokenizers from the packages NLTK, spaCy, Hugging Face Transformers, StanfordNLP, etc.

Part of the guts of such a tokenizer might look like:

class AdvancedPaliTokenizer:
    def __init__(self):
        self.compound_word_pattern = re.compile(r'\b(\w+?)(aαΉƒ|o|ā|Δ«|Ε«|e|ena|ehi|esu|assa|ānaαΉƒ|āya|ato|asmā|amhā|āhi|ābhi|anā|ati|āti|āyaαΉƒ)\b')

I’m here seeking collaborators for this project of writing a Pali language tokenizer, most likely using the common Hugging Face Transformers library.