SLTP print on demand project

I’m working on a clean, print-oriented Pali edition based on the SLTP Tipiṭaka source files. The immediate goal is readable print-on-demand formatting, I’m building a reproducible cleanup/parser pipeline behind it. This is taking the source files, like found on Access to Insight, and putting them into a cleaner format based on rules.

I’m not using SuttaCentral texts as my source, and this is not intended as a replacement for SuttaCentral.

At this stage I’m looking for informal feedback: whether headings and line breaks look structurally sensible, or headings being handled incorrectly. I am wanting to make a numbering system for the lines of the suttas, some input about what is sensible there also would be helpful. Suggestions of what makes sense to constitute a line, what doesn’t, etc.

I’m not asking for edited text, translation work, or formal contributions. If anyone is willing to glance at a sample PDF and point out obvious issues, I’d be grateful. I figured SC might be the best place to reach out for assistance since many people here are familiar with Pali, and I am not.

majjhima-nikaya-02-majjhimapannasa.pdf (8.1 MB)

Then why in tarnation are you doing this project?

1 Like

Sorry, what edition is this?

Ooh, excellent use of tarnation!

But yeah, if you’re trying to segment Pali texts, you need to know Pali.

1 Like

I believe that this is one of the early attempts to publish a Romanized edition of the Buddha Jayanti Tripitaka.

I hate to be the bearer of bad news, but that edition is so full of typos that it is near useless, if memory serves me.

1 Like

Oh yeah, the same text is on GRETIL. Yes, it’s useless as a mainline text, but can occasionally be consulted for variants. People have recorded corrections to it, but ASAIK none have been published.

I should probably clarify the scope of what I’m attempting.

I’m not trying to produce a new scholarly or critical edition of the Pāli Canon, nor am I attempting doctrinal or linguistic revision of the text itself.

The project is mostly infrastructural and print-oriented. My interest is in:

  • reconstructing cleaner formatting from public-domain digital source material,

  • expanding existing abbreviations for readability,

  • standardizing layout and structural presentation,

  • correcting obvious transcription/OCR artifacts where they can be identified mechanically,

  • and producing a cleaner print-format option.

I completely understand that authoritative textual criticism and philological editing require substantial Pāli expertise. That is outside the scope of what I’m trying to do here.

If there are better public-domain Romanized source texts than the older SLTP corpus, I would prefer to use those instead. My main concern is using source material with clear reuse rights while keeping editorial changes to a minimum.

1 Like

As I understand it, from an accuracy stand point almost any other digital edition would be better.

And as far as I know, any errors would not be due to OCR, because the original project was first a transliteration into Roman characters from the Sinhala print edition. So the errors would be human in nature. Those days I doubt that there would have been OCR for Sinhala. Don’t know if that matters, but wanted to give the history that I know of.

1 Like

That is helpful context, thank you.

I am genuinely curious what the specific issues are that make the corpus considered so unreliable as a mainline text. I’m trying to understand whether the problems are fundamentally unfixable, or something that could be cleaned up through specific rules.

It’s just riddled with typos. It was a noble project as it was one of the very first attempts to make the Pali texts available in a digital form. However it wasn’t proofread and corrected.

I think Bhante Anandajoti might know more about it. You can reach out to him through his website. (If you do, please report back to us (with his permission)).

When I was learning Pali I was pulling texts from the SLTP but the problems were enough that I stopped using it. As I recall there could be at least one typo in each paragraph.

Which source files are you using exactly? maybe someone has undertaken to fix the originals. It’s a monumental task.

I do think it is a worthwhile project to make a good PoD edition of the Pali, especially for monasteries where people don’t have access to computers. I just wouldn’t pick (the original) SLTP data.

1 Like

The version of SLTP is the error filled one from GRETIL.

If it is an issue of just typos, that may be straight forward to automate?

What is Bhante Anandajot’s website?

It really isn’t. It’s so bad, the title is sometimes misspelled (“Mijjhimanikaya”).

3 Likes

I’m sorry, but you still haven’t really answered my question @math3matica . If you don’t read Pāli, then who are you doing this project for? And why not use the SuttaCentral segmented data? The only restriction Bhante @sujato has placed on it is “no AI” if I recall correctly. A print-on-demand edition of our Pāḷi text should be fair game as a derivative work (no, Bhante :folded_hands:?)

1 Like

This really depends on who your audience is. There is no universally accepted line numbering system as there is in the Christian Bible. Heck, for the SN and AN the sutta numbering system isn’t universal.

If you are starting your own Pali university, then I can imagine that you might invent your own line numbering system if you were forcing all the students to buy the same text.

Otherwise a new numbering system is kind of pointless. It’s only needed when someone needs to refer someone else to a specific spot. It would make much more sense to just keep the PTS numbers already in the manuscript:


At least that system is well known and present on SuttaCentral.

Also, I question your luxurious use of white-space. There is no need to have the sutta citation (e.g. MN 1) as part of every single line number. Better to have running headers that indicate the sutta. And in general there is just so much white space and so little text on each page. Is the idea that students would want to do their own annotations?

With my librarian hat on, I just think about how much shelf space would be needed to hold even just the first four nikayas.

So I come back to the question… who is your audience? We can’t give the best feedback if we don’t know that.

1 Like

@sujato I understand that you have strong reservations regarding AI in this area, but the type of project I am describing is specifically the kind of editorial workflow where modern AI techniques can assist when constrained by deterministic rules and transparent pipelines. I am not proposing an opaque system which silently rewrites the canon. The original SLTP text would remain preserved unchanged, while corrections and normalization layers would be non-destructive and reviewable. My point is mainly that some categories of editorial work which would historically require enormous human labor may now be tractable at much lower cost.

Also, from what I understand, the SLTP was not some minor side effort. It was a substantial digitization project reflecting the BJT canon. While it clearly contains many errors, my interest is precisely in whether those issues can be systematically normalized in a transparent way while still preserving the original source separately.

Regarding the size concerns: yes, a fully expanded edition would become physically enormous. But one advantage of a parser-based approach is that both abbreviated and unabbreviated editions could be generated from the same structured source depending on the audience and use case. Modern POD and digital distribution change the economics of this significantly. A fully expanded edition can exist without requiring someone to warehouse and print an entire run of it.

The line IDs are also not intended to become a replacement for PTS references or a new universal scholarly citation system. They are mainly intended as stable internal anchors for this specific corpus and related tooling, in much the same way that SC segment IDs function within SC. I have another project in mind where the text would be associated with classification, organization, and cross-linking with other Buddhist media and teachings, and stable internal IDs are important for that.

And finally, one of my strongest motivations here is that I want the resulting work to remain entirely public domain. My intent is to host:

  • the original SLTP,

  • the parser/tooling,

  • and the normalized corpus

without ownership, attribution requirements, or downstream restrictions.

It was not a minor effort. It was simply a failed major effort. This is very common in the real world with such a massive project.

The SLTP is highly flawed. Nothing you do to “normalize” it will make it correct according to the BJT which is what it purports to be. Even if you keep it as two layers all you will have is.

Layer 1: A failed attempt to transcribe the BJT
Layer 2: Imagined corrections to the failed transcription

I can’t imagine anyone would want to work with this text.

I’m going to keep assuming you have good intentions, but you don’t seem to understand how flawed your source material is in the SLTP text, and you don’t seem open to the advice you are getting. No amount of AI magic can fix that. If you really cared about a digitized Roman edition of the BJT, you would be much better off using modern OCR techniques to generate a fresh text.

Maybe I missed something, but what is a “fully expanded” edition? What do you mean by “abbreviated and unabbreviated”?

3 Likes

Yeah, this is a wrong-headed project. You can’t “fix” Pali texts with AI, nor should you even want to. My recommendation is to drop this project. AI has no place in Buddhist studies.

5 Likes

I respectfully disagree with the implication that AI-assisted work is somehow opposed to the preservation or transmission of the Dhamma.

Throughout history, Buddhist texts have continually moved through new mediums and technologies: oral recitation, palm-leaf manuscripts, woodblock printing, modern print editions, digital, and online databases. Each transition increased accessibility and changed how people interacted with the teachings. I see AI as part of that same continuum.

To be clear, I am not claiming AI replaces knowledge of Pāli, monastics, scholars, or careful human judgment. It obviously does not. But I do think these tools can meaningfully assist in making the teachings more accessible to ordinary people by helping with formatting, comparison of parallel passages, structural analysis, indexing, searchability, error detection, readability, and large-scale organization of texts.

My intention here is not to produce a new authoritative canon or replace projects like SuttaCentral. I think SuttaCentral being human based is a good choice, but I also believe there is a place for AI translation and utilization in other projects. I am trying to create another freely available iteration of the canon based on the SLTP source files, while building reproducible tooling around the cleanup and formatting process so the transformations are transparent and inspectable.

I also do not think increased accessibility to the teachings should be viewed negatively. If AI tools help more people encounter, search, read, organize, or engage with the Dhamma who otherwise never would have, I think that is something worth exploring responsibly rather than dismissing outright.

At the end of the day, isn’t the goal having the teachings as accessible to as many people as possible?