Wildcard word search of entire Pali canon

S/w off-topic but I don’t know where to look, where to ask…

How to do a word search (ideally with “wildcard” character capability) of the entire Pali Canon (or any book or collection there-in) – thinking in Pali, romanized script?

(Perhaps not so OFF-TOPIC, as this would be useful in compiling a dictionary approaching or surpassing the scope of the PTS one – of 100+ years ago.)

It’s occurred to me it’s possible to copy-out entire text-files (contents of each opened sub-window) from the CST 4.0 into MSWord files – all the formatting, diacritics, etc. do come through; then put these files in a tree of directories following the structure in CST 4.0. I’ve done this with some volumes.

Upside: searchable for text (but only exact text); can be otherwise manipulated for personal use e.g. adding comments, color-emphasis, etc.; also be usable on Apple OS as well as Windows OS;
Downside: lacking the menu-buttons to call up commentary and sub-commentary texts.

Further, could use a GREP-type (unix) function to search multiple files-directories at once. (I have a WGREP app running on WIndows, but it’s from the 1990’s, doesn’t deal nicely with file-directory names > 8 chars long. Could work with a cloned directory-file structure of the CST – as described above – using 8-char, uppercase (coded) names.)

If this (CST -> MSWord files) would duplicate some already given capability, please let me know – to save time & effort.

You can, of course, do a search across the Pali canon on SuttaCentral. Use * for wildcards. But it seems like you want to do this for a set of Word documents? There you’ve lost me, I’m afraid.

I keep copies of the Pali texts locally as HTML/XML files. To do much specific kinds of search, for example regexes, I usually just use my favorite text editor, Sublime Text. It allows you to search across multiple nested folders. You can install it across platform for free.

For very powerful and flexible searches, nothing beats the command line, but I find that I rarely have to use it for this.

Thanks for replying, Bhante.

To clarify: Looking to search for Pali words in the Canon – with options to search all,
or, e.g., just Nikiyas,
or, e.g., just Digha Nikiya,

and capture the results to a file (for study, and for manipulation – culling, extracting, etc.).

(The super-search would be, as above, AND with wildcards (pattern-matching, as in the ultimate string manipulating prog. lang. SNOBOL), AND given a word root/stem, find all inflected forms, or some specifiable sub-set… One can always dream…)

Using MSWord documents (could as well be OpenOffice files) is just because that’s the one way I’ve found to capture and search Pali from CST 4.0, and can be used on Apple OS as well as Windows. (And I prefer to work off-line, or at least have that capability.)

So, next is to check out SuttaCentral searching…

By CST 4.0 I assume you’re referring to the app released by VRI, yes? Doesn’t that have search?

On SC we currently don’t have the ability to filter search by collection, but it is on our 2-do list.

If it’s of any help, here’s the VRI text in plain text files. Many text editors—I mentioned Sublime Text above—will support search across multiple files and folders.

pali_canon.7z (3.5 MB)

1 Like

Yes, Chaṭṭha Saṅgāyana Tipitaka Version 4.0 (CST4) at http://www.tipitaka.org/cst4, that runs only on Windows platforms. It has search only by selecting one or more from the list of whole collections (nikiya, abhidhamma, etc.). I’ve tried it a couple of times, and gave up on it.

Thanks, “pali_canon.7z (3.5 MB)” I was able to download, and unzip on Apple. Looks like fewer collections than in CST 4.0 – perhaps not having the commentaries, subcommentaries? vism? Will give it a whirl (take a closer look).

Also downloaded Sublime Text – see if I can still handle what looks like a geek-level editor (I did 30+years software, but not for the past 16 years, and that world has changed beyond recognition since – anicca in the face.)

I left out the later texts, let me know if you want them.

Sublime Text is so good, it actually lives up to its name.

Wildcards are just a small variant of regular expressions. GNU Grep should be able to handle all the string matching, and of course it can do recursive searches too, using the “-r” flag. Just make sure you are using the “-E” flag to enable POSIX Extended Regular Expressions.

If you are using Windows rather than Linux or BSD, the command line environment will not be as rich. However, you can download Cygwin to get a Linux-like command line environment that includes all the standard utilities including GNU Grep.

1 Like