I found this thread because I am experimenting in this domain.
A natural language search engine, a la Perplexity, which generates simple lists of links and only βconnectiveβ text would be a good example of a relatively safe use of a Sutta-based Retrieval-Augmented-Generation (RAG) application.
In the case of building a dedicated LLM, for chat or other uses, it is interesting to consider aligning it with [the values of the] Dhamma using the technique of Constitutional AI developed by Anthropic (the PBC, or public benefit corporation, that develops Claude). As is the case with any technologies in this whole field, only rigorous development & testing can ensure the safety of use.
In my case, I am working with the Pali texts directly, building a semantic knowledge graph database, leaving translation and interpretation to others. Such a graph DB could be used in many applications, with or without the Generative AI component of LLMs.
The first step in text processing for a KG and many other NLP applications is the creation of lexical tokens, aka tokenization. A fundamental purpose of tokenization is to handle word parts, such as word roots and compound parts, subwords, etc., to enable NLP systems to handle βsimilarityβ relationships between them. Hereβs an example:
Unbelievably, the reforestation projectβs semi-annual report had been overemphasized.
Sentence
ββ Adv
β ββ Unbelievably
β ββ Pfx: un
β ββ Root: believe
β ββ Sfx: able
β ββ Sfx: ly
ββ NP
β ββ Det
β β ββ the
β ββ CN
β β ββ reforestation
β β β ββ Pfx: re
β β β ββ Root: forest
β β β ββ Sfx: ate
β β β ββ Sfx: ion
β β ββ Poss
β β ββ project's
β β ββ project
β β β ββ Pfx: pro
β β β ββ Root: ject
β β ββ Poss: 's
β ββ CA
β β ββ semi-annual
β β ββ Pfx: semi
β β ββ Root: ann
β β ββ Sfx: ual
β β ββ Punct: -
β ββ N
β ββ report
β ββ Pfx: re
β ββ Root: port
ββ VP
ββ V + Aux
β ββ had
β ββ been
ββ CV
ββ overemphasized
ββ Pfx: over
ββ Root: emphasis
ββ Sfx: ize
ββ Sfx: d
- Each leaf node represents a lexical token.
- Punctuation (Punct) is treated as a separate token where applicable (e.g., the hyphen in βsemi-annualβ).
- The possessive β'sβ is treated as a separate token attached to the noun βprojectβ.
- Compound words are further tokenized into their constituent prefixes, roots, and suffixes.
Currently, no tokenizer exists for the Pali language. Tokenizers are frequently written in Python using word splitters, Regex, and other programming operations, e.g. tokenizers from the packages NLTK, spaCy, Hugging Face Transformers, StanfordNLP, etc.
Part of the guts of such a tokenizer might look like:
class AdvancedPaliTokenizer:
def __init__(self):
self.compound_word_pattern = re.compile(r'\b(\w+?)(aαΉ|o|Δ|Δ«|Ε«|e|ena|ehi|esu|assa|ΔnaαΉ|Δya|ato|asmΔ|amhΔ|Δhi|Δbhi|anΔ|ati|Δti|ΔyaαΉ)\b')
Iβm here seeking collaborators for this project of writing a Pali language tokenizer, most likely using the common Hugging Face Transformers library.