One of my uni end of year break projects is to try and create a website hosting the World Tipitaka Edition of the Pāli Tipiṭaka. The World Tipitaka edition is the same edition used by Suttacentral.
As I understand it, the website used to exist but has been long defunct. Fortunately, yuttadhammo on Github saved a version of it before it went away.
I have decided to resurrect this site, convert it from a bespoke XML format to modern website using Bootstrap (yes, I know Bootstrap is no longer regarded as fashionable, but I wanted to a quick and dirty solution - the current version represents about 2 days of work by me).
Why resurrect this site? I wanted a clean, minimalist website hosting the Tipitaka with no dictionaries, translations, commentary etc.
I also plan to convert the source of the texts from HTML to Markdown (the universal document format). Why Markdown? So that I can copy and paste the texts into my second brain:
The website is completely in the public domain - I welcome volunteers to improve it (let me know your Github username, and I can give you access).
More importantly, particularly since Suttacentral has implemented a No AI policy, this website can be used for AI because of its CC0 licence. This means it can be used by anyone doing AI research. Of course, if you really want to, you can use it for nefarious purposes such as creating fake suttas, fake AI translations, etc… That’s up to you - I do not wish to comment on such usage.
Thanks for this. Do you plan on sharing the github repo?
I realize it would be a big project, but it would be wonderful if the links were symantic as they are on SuttaCentral. Might make it easier to integrate into other projects. So instead of
It’s currently just a bunch of HTML files, there isn’t even a static site generator. Once I convert to markdown, I’ll need to implement an SSG. Not sure which, I want something super lightweight.
In terms of semantic links rather than the weird numbers, please bear in mind this is a straight conversion of the original web site, using a Python script. I converted the XML back into HTML and then cleaned up all the javascript junk, but preserved the file names.
Interestingly, the site does have a semantic naming system (eg. /tipitaka/1V/…) but this naming system, although persisting in the code (as HTML name attributes in the links), is not actually implemented in the filesystem. I tried writing a script to detect the proper hierarchy of the files from the HTML name attributes, but wasn’t sure my code worked, and abandoned it.
Like I said, I welcome volunteers to clean up the site, and move files to proper names etc. If someone wants my half written script to rename the files semantically, they are more than welcome to.
The other option is to access the Suttacentral json files (in bilara-data) and convert them into HTML (this is trivial to implement using a few lines of code, merging the root and html cognates). I also implemented this, and it is working, but unfortunately I will then be burdened by Suttacentral’s no AI policy, which I wanted to avoid. This option will allow me to retain Suttacentral segment numbering. If @sujato is willing to allow me to transform in this manner and republish as CC0, then I am happy to spend more work.
OK, I’ve done a bit more investigation whether it’s possible to “recover” a semantic hierarchy of files based on the HTML name attributes, and unfortunately there are too many exceptions.
For example file no “263054” can be inferred as “/tipitaka/37P1/12/12.1/12.1.2/12.1.2.2/Suddha” because there is a backlink with that HTML name attribute associated with that file.
However, not all files have backlinks, which means I need to find another way of inferring the potential location of these files in the hierarchy. Looking at the file contents, it is possible in some cases to infer the hierarchy, but it is not easy (requires mangling of unrelated contents in the file, so not something I can create a regex pattern to search for).
Also, in some cases, the name attribute appears to be corrupted, ie. have “data/” prepended to the attribute.
So, the short answer is it may be doable, but not easy and requires manual intervention. Given the amount of effort, I am going to leave it as an “exercise for the industrious potential volunteer”.
Well, after sleeping, my subconscious (which is far smarter than me) found a way around the exceptions.
Basically, what I did was write a python script that inferred the semantic structure as much as possible, using various sneaky methods and, then generate a list of exceptions to be fixed manually.
The main exceptions are files with non ASCII characters in them, which I needed to convert to ASCII by manual editing.
After this, the script was able to read the manually fixed exceptions and generate a complete semantic tree.
So, v2 of site uses a semantic file structure for the contents. To access DN1 for example the link is now
This may not seem like much of an improvement, but as you progress down the tree, you will realise all files follow a semantic hierarchy, which is my best guess as to the original semantic structure of the database.
Please let me know if you notice any broken links - there should be any, but I often make mistakes and may not have detected an edge case (or my manual exceptions list has a mistake).
Previously, I asked for your guidance regarding the Buddhist scripture data that I wanted to use on my website.
Now that my site is more or less complete, I was about to show it to you.
Here is my site:
I am building a system where users can vote on translations so that better translations remain.
I have a question about this data.
When I look at GitHub, I see a tree structure like 10M and 11M, but I can’t tell which directories belong to which sutta.
Like your website, I’d like to have data based on the Tipitaka structure.
I would appreciate it if you could tell me how to reconstruct data based on the Tipitaka structure from the GitHub data.
Wow, your site looks interesting. Best of luck for the project!
In terms of the semantic structure on tipitaka2500, it kind of works like this
1V to 5V = Vinaya
6D - 8D = DN
9M-11M = MN
etc.
The number is just a sequential number for the ordering of collections, the letter signifies the collection.
For example, if you look in 6D, it has 13 parts corresponding to the first 13 DN suttas (Sīlakkhandhavagga), 7D would correspond to Mahāvagga etc.
So the new semantic structure is quite easy to understand and corresponds directly to the arrangement of texts in the Tipiṭaka.
Hope this makes it clear. Suttacentral has exactly the same arrangement, so if in doubt compare the website structure to the structure in Suttacentral.
Thank you! Your explanation is very clear.
I will refer to the information you provided and try to reconstruct the GitHub data structure while comparing it with Suttacentral.
If there’s anything else, please let me know!
I wish you the best in terms of using the information in your project. 頑張ってくださいね!
Before you start using the information, please wait for me to convert the texts to Markdown (which I will do in a few days time, and I will announce when I have finished). I will do this in a separate repository, which preserves the structure in this website, but all the content will be in pure Markdown.
Because Markdown is easily ingestible into AI LLMs without further processing, this will greatly facilitate the use of the texts for AI Research. For example, applying textual analysis or cluster analysis to group the texts by linguistic similarity etc.
OK I have made some changes to the website. I finally learnt BeautifulSoup yesterday, so now I am armed and dangerous. Previously I was doing refactoring using regex replacement (because I am old school - that is all I know). BeautifulSoup just makes it so much easier to make complex transformations.
I have completely refactored the HTML from bespoke CSS to vanilla Bootstrap. As a result, I think the website now looks much nicer, less daggy.
I have changed the colour scheme to pink and purple, which are my favourite colours. I make no apologies for this. The colour scheme is actually consistent with all my other websites, and is called Rosely (https://rosely.hellotham.com).
I have also enabled breadcrumbs so that you can visually see where you are in the semantic hierarchy on every page. Curiously, this feature was implemented on the original website, but disabled (probably because the website author couldn’t get the semantic naming working properly). Now that I have a semantic structure, the code can be re enabled.
That’s it for now. Time to move on to generating Markdown. I am quite pleased with how the website has turned out, I think I will keep it for posterity as it is a very useful reference to the World Tipitaka Edition, preserving their semantic hierarchy. Now that I am used to the hierarchy, I prefer it to Suttacentral’s segment numbering system, but that is of course personal preference.
Just a little note of something you might already know about, but the code you’re using implements a light-mode/dark-mode/auto control that appears in the lower right corner. It currently seems to have no effect.
I have created an initial version of the Tipitaka 2500 content as Markdown. I have left the current site as is, and have created a new repository hosting just the Markdown files.
If you are interested, you can browse through the repository - it is organised in the same structure as the website, so if you find anything interesting you will like to investigate further, you can find the equivalent Markdown in the corresponding location in the file structure.
I have also fixed a number of minor formatting issues on the website - I found these issues whilst creating the Markdown files so overall the website has improved as well. It seems the authors of the content have used a variety of non standard markup which I have tried to accommodate in both HTML and Markdown.
I was going to use a static site generator to build the Markdown files into a working web site, when I realised I don’t need to - Github will render Markdown files so all the files can just be reviewed in Github itself with no additional work required.
So I have removed the front matter to make the files more readable, and converted all links to relative links so you can clone the repository and the files are navigable (using any Markdown editor that can traverse links such as Obsidian) no matter where you put them on your computer.
If you visit the repository now, I have linked all the content in the README so you can start browsing immediately.