Wishlist for Virtaal

Jhanarato · May 4, 2015, 2:24am

Hmm, interesting stuff. I’m glad we went ahead with the dev category.

J.R.

blake · May 4, 2015, 8:41am

Okay I’ve pushed an update which fixes the empty html problem.

alexcc0 · May 8, 2015, 12:09am

These raw translation tools, formats, and scripts look and sound amazing…

Can Virtaal prevent/suggest/default normalized selected text, removing non-alpha characters, for example msgid “Seyyathāpi” instead of ““Seyyathāpi,”?

I could imagine importing an extant English translation and then reverse-engineering the Pali translation “segment control” without generating an alternate version of the Pali. Is this straight-forward enough?

How does Virtaal handle elipses, such as when the Pali repeats a theme, in a multi-level structure, and the translator decides to summarize? For the web, I would much rather see [+] collapse/expand buttons, rather than “… and so on”.

From the few PO examples I’ve seen, you have what looks a bit like XPATH down to the paragraph id ((a id)). Would it be possible/reasonable to identify the sentence (an implied ((span id)) )?

mnx7.html+html.body.div.section.article.p.div:18

such as (paragraph 18, sentence 5):

mnx7.html+html.body.div.section.article.p.div:p18,span:s5

Also, and probably more importantly, is it possible to preserve the section numbers from the Pali?

mnx7.html+html.body.div.section(@s3).article.p.div(@p18).span(@s5)

**(Markdown didn’t like the brackets, hash, colons, and m-n-7’s. )

sujato · May 8, 2015, 12:18am

Not out of the box. But it would be easy enough to script in Python. I’m not sure why you’d want to, though. Is this for NLP purposes?

Virtaal doesn’t handle these things, it just gives you whats there. I agree, some flexibility in these things would be ideal; Ven Anandajoti has some nice implementations on his site. But for now, I am going to pretty much just translate what the Pali text has. Fancy things with repetitions can be done as a later enhancement. I will, however, try to avoid abbreviations that require seeing another text.

Good luck. I don’t see how this would be possible automatically, as the sentence breaks in the Pali and English won’t always be the same, not to speak of paragraphs and so on. You could probably approximate it and then fix it manually. But it would be messy, and again, I’m not sure of the benefit.

All the sentence and paragraph identification will be there. We keep the Pali markup as comments in the po file.

Markdown loves these things: built by geeks for geeks! Use ticks for inline code, and indent with four spaces for blocks.

alexcc0 · May 8, 2015, 12:50am

My agenda is to expose these fancy translation features to the reader/practitioner. When I am reading an English translation, I nearly always want to know what the various Pali words were (was ‘origination’ Sambhava, Samudaya, or Samuppaada; misleading translations of Sankhara; etc). I’d like to click on a sentence, see the Pali, alternate English translations, and other suttas where the exact same Pali exists. I assume all of this would be possible with a little scripting and retaining the raw translation data (PO, XLIFF). The quotes, commas, and even spaces may get in the way. However, I’m guessing you’re converting the PO directly to HTML and need all of the non-alphas, so what I’m suggesting would likely be an intermediate step.

Could you like to Ven Anandajoti’s site? It sounds like he’s into some interesting things.

The benefit again is the linking. I greatly appreciate translation, but I don’t trust it, and no one should. Perhaps I need to invent a new file format. Maybe it would be called RDF or HTML with linked IDs.

sujato · May 8, 2015, 12:54am

This should be possible, including fuzzy matches.

http://www.ancient-buddhist-texts.net/

I think all you want to do should be possible with PO.

The problem is just reverse engineering an existing translation with the pali. It could be partly automated, but any system you use you’ll need to spend a lot of time fixing up by hand. That would take time, but hey, why not? If you were able to do this, it would be great to integrate existing translations with our system here on SC.

alexcc0 · May 8, 2015, 1:10am

As long as any new change doesn’t break all prior effort, then manual editing is acceptable. I’m imagining a file format that has the original texts (such as the Pali) and all translation is annotation. In fact, RDF already makes this possible and decentralized as long as HTML texts are littered with IDs (ideally RDFa 1.1).

This an example of (tedious if not educational) manual work. Note that the sections are linked to the Pali [1] as well as alternate English translations, section by section. But I’d rather get the annotation out of the way. The user should hover over or click on the text (like your sutta links, example MN19 ). The user would see the Pali in the popup, along with links to the alternatives, sentence or phrase by phrase.

alexcc0 · May 8, 2015, 1:26am

#: </h1></div><p><a class="sc" id="1"></a> msgid "“Cattārome, bhikkhave, puggalā..." msgstr "\"Monks, these four kinds of people..."

Is an4.92.po an example of HTML-to-PO? This is excellent. As for annotation, I would think only the openning tag would be necessary. And why not just:

#: <p class="sc" id="1"> msgid "etc"

At this point, I expect you want to stick with the paragraph ID numbers, but I’d recommend that in general the ID indicates its type, so that id="p1" and id="sec1" and id="sent1" and id="phrase1" can be distinguished.

sujato · May 8, 2015, 6:44am

Yes, I think that’s what we are thinking of. Not so much the alternative translations, for the aforementioned reason of time, but the basic idea. Since the Pali text is the “real thing”, it makes sense to key everything off that. So as long as everything co-ordinates with IDs or whatever that are in the Pali, all this can be done.

PO doesn’t natively support more than two things (text and translation), so far as I know, so we would have to use IDs in the Pali txt that are common between each PO file.

Can you explain how you would implement RDF in this context?

sujato · May 8, 2015, 6:48am

Not entirely sure what you mean. But the reason the HTML is there is because this is how we preserve it in the PO files, as comments. Normally PO works with just plain text. So far it’s just a first run, but this is the basic idea.

In fact @blake has suggested we do something like this, but we have not implemented it.

yes, this is quite right. We haven’t yet discussed how to implement this.

alexcc0 · May 9, 2015, 1:46am

Oh, I see the tight one-to-one relationship between the original HTML and the PO file. I’m not sure how far you’ll be able to extend that hack with your wish-list, however. My sympathies and admiration go out to @blake.