I would like to contribute to this project, starting with a general edit and correction of my old translations from the Madhyama Agama. Please let me know how to submit the revisions I’d like to make. Mainly, I’d like to correct simple typos and any accuracy issues I can find 15 years later. Also, if there are any Chinese texts in particular that the project would like translated, I can pick one up and work on it in my spare time.
Hello and a warm welcome to the forum! This is wonderful! Thanks so much!
Regarding the revisions, can I check what you tend to work with? If you happen, to use GitHub, you should be able to find your translations within these folders. If you fancied you could make the edits to these files and make a pull request (I should be automatically alerted, but just to make sure you could also add a mention of my GitHub username, Aminah-SC, and I will merge the change).
If GitHub is totally foreign, no problem at all, you can just send the updated translations to me and I’ll get them re-coded and replaced.
With respect to new translations, indeed, perhaps Bhante Sujato has some thoughts on that.
Thanks for the welcome. I’m happy to help out. I’ve been revisiting my old translations and updating them, and I had an interest in attempting just this type of project back in the early 2000s. It’s wonderful to see old daydreams take actual shape.
I’m familiar with Github, as I’ve learned to code in the past couple years, so I’ll take a look at the link you shared and see what I can do next week.
Wow, so amazing to see you here! Thanks so much for the pioneering work you did on Agama translations, and I am so happy to hear you would like back in to the game.
As I’m sure you know, the field is evolving, especially with the new translations of the Dirghagama, Madhyamagama, and Samyuktagama either completed or in progress. I confess that in the last few years, with my focus on the Pali, I have taken my eye off the ball a little when it comes to the Chinese texts, so I am not 100% up to date with the state of play with all these.
My primary interest going forward will be to create segmented translations of the Chinese texts, in the same was as our Pali texts are segmented. You can see an example of our modern approach to Pali texts here:
I’d dearly love to segment our Chinese texts and use them as the basis of translations. But it is no trivial task. The key is to divide the texts up semantically, rather than line-by-line as per the Taisho edition. For the Pali texts, we did this by relying primarily on the punctuation, corrected by hand. But the Taisho texts have very erratic punctuation so this will not work.
I discussed it briefly with Ven Analayo some time ago, and there is, or was at that time, no obvious solution to do this, other than simply adding the segments one by one.
However, recently Marcus Bingenheimer drew my attention to this site:
This provides a means to automatically punctuate Chinese Buddhist texts. I have been in touch with the developers, but remain a little unclear exactly how it works. However, assuming that it can work well, this could be a basis for creating a segmented Chinese text as a basis for translation.
If this is something you’d be interested in, we’d love to work together on it.
CBETA has punctuated most of the Taisho texts by hand, but it’s very haphazard, and the editors tended to create run-on sentences. It looks like you’re segmenting the Pali by sentence final punctuation, so I can see why the CBETA punctuation would be an issue. The Taisho is unusable for this as well, I would agree. The trouble is that Chinese is a highly implicit language. Originally characters served the purpose of indicating hard-stops. Just determining where words begin and end is ambiguous at times since they aren’t spaced like other languages. I’ve attempted to write scripts that automate drafting Chinese, but it’s difficult when you have overlapping terms.
May I ask what the ultimate goal of this segmentation is? Are you attempting to prepare input for a translation script? It’s certainly possible to repunctuate CBETA’s source texts more carefully.
The segmentation is incredibly useful and powerful in all sorts of ways that we are only just discovering.
Originally we did it to prepare them for use on Pootle, our translation software. This allows for things like reusing repeated phrases.
Later, we realized we can use the segments to separate the text from the translation, and indeed, to separate out the markup, notes, variant readings, and everything else. If you look at the repo I posted above, you can see how incredibly clean everything is.
We can mix and match the different entities, all based on the segment ID. Currewntly we can:
show original text next to the translation,
resue translated segments
and other things. In future, we will be able to:
show a note on the English translation for readers of the Italian translation,
play the audio of the Pali text for the specific phrase of the Spanish translation
match multiple editions of the same source text and automagically diff the variants
do a search for text and translation simultanously (eg. search for cases where “dhamma” has been translated as “thing”)
match segments across original text (eg. find all cases of the formula for the four jhanas in Chinese and Pali)
and so on.
It’s an incredibly powerful system, and it’s based on the idea that if you break the text up semantically—as opposed to line by line as in the Taisho—the translations can (more or less) be matched with the text, in any language. Of course it’s not 100%, it’s still natural language and there are always exceptions. But most of the time we can match up text and translation pretty well.
Going forward, our efforts will be based on this sytem, and all other texts are treated as “legacy”. If you see the incredible work being done by our team on Ven Brahmali’s new Vinaya translation, you can see it is possible to adapt a legacy translation to use the new system: but it takes a heck of a lot of work. It is much, much easier to make new translations directly based on the segmented original text.
In Voice.suttacentral.net, we use text segments to provide visually impaired users with a semantic segment-by-segment spoken sutta experience for Pali/English. We also use text segments for semantic search within Voice and it provides very efficient exhaustive searches for segmented content. Although I am Chinese, I do not know Chinese, but I would hope that we could make use of segmented Chinese text as well.
CBETA has converted the Taisho into XML files that then can be transformed into a paragraph or line-by-line format when you read the texts in their reader. They also, as I said, added modern punctuation to most of the Taisho texts, though not all of them. The Agamas have been punctuated.
So, the Chinese Agamas can be segmented in this way, but it would require some manual prep-work to punctuate them better. Then I would think that a script could use to transform them into a segmented file of whatever format is needed.
My main question is whether there is a set of rules to be used when breaking the text up into segments. It looks like you’ve broken the Pali texts down into sentences, but occasionally I see a sentence divided into clauses.
I can try doing this with MA.1 as I edit it and we can use it as a test case. I would just need a little more detail on the formatting and the rules for segmentation. I’ll be re-punctuating the source text to match my translation in any event.
Have you considered hosting Ven Yinshun’s corrections of the Taishō Canon? I don’t know if they are under copyright. That’s might help some of the issues with punctuation and other editorial issues with the Taishō.
Well, more guidelines than rules. The basic principles:
Break segments on major punctuation (.;:?!—)
Where the same passage is punctuated differently, reconcile to whichever seems best (but don’t change the punctuation).
However! Prefer to keep stock passages in the same segment, especially doctrinal and like formulas. (One reason for this is that they are frequently abbreviated. So for example, comparing AN 3.58 and AN 3.59, the section on recollecting past lives takes up a single segment in both cases, even though there is much more text in the unabbreviated version at AN 3.58.)
Ellipsis involves two quite distinct kinds of case. It’s not possible to distinguish these programmatically, so they must be dealt with by hand.
Sometimes an abbreviation is within a segment:
Mind is the forerunner … things.
Sometimes it marks the break between segments:
Mind is the forerunner …
Mind is their chief …
When segmenting Chinese texts, we can add the additional guidline:
Keep it similar to the Pali so far as is reasonable.
It may well end up being the case that simply translating and segmenting as you go is the best approach.
We have, and in fact we have discussed this at some length previously.
There are a few things to bear in mind. One is that there are a lot of Chinese texts, so using multiple sources can get complicated.
Another detail, and this is an issue in any case, is that the good folks at CBETA are in the process of updating their files to the latest Unicode standard. Since they made their files, thousands more Chinese characters have been added to Unicode, many of them at the suggestion of the SAT people, who run the other digital version of the Taisho in Japan. The problem at the moment is that the SAT version is more updated as far as the Unicode is concerned, but their files have a much less well-structured form than CBETA’s. So last time I checked, a couple of years ago, I was preparing to wait until CBETA had updated their files. We can check with thm to see how their timeline is going.
Regardless, the Yin Shun version will not have the latest Unicode, and probably never will. Probably best to simply refer to it when doing our work.
I’ve got a revision of the translation for MA.1 ready, and I submitted it as a pull request to the sc-data repo, but I realized that I’m not sure where I should put the segmented files I’ll create next. Should I post them here?
Okay, that sounds good. I’ll start creating the segmented files for future use when we know where to place them in the Sutta Central project. Regarding my name in the system, feel free to fix that. I see that happen in other data systems like mailing lists as well, probably just some artifact from importing data. Thanks!