Copying Pali & English verses

sujato · November 24, 2020, 10:43pm

That’s exactly what we do. I was working off your proposal.

Here is the current HTML spec as found on staging.

<header>
	<ul>
		<li class="division">
			<span class="segment" id="sn1.1:0.1">
				<span class="reference">
					<a class="sc" id="0.1" href="#0.1" title="SuttaCentral segment number">0.1</a>
				</span>
				<span class="translation" lang="en">
					<span class="text">Linked Discourses 1  </span>
				</span>
				<span class="root" lang="pli">
					<span class="text">Saṃyutta Nikāya 1  </span>
				</span>
			</span>
		</li>
		<li>
			<span class="segment" id="sn1.1:0.2">
				<span class="reference">
					<a class="sc" id="0.2" href="#0.2" title="SuttaCentral segment number">0.2</a>
				</span>
				<span class="translation" lang="en">
					<span class="text">1. A Reed  </span>
				</span>
				<span class="root" lang="pli">
					<span class="text">1. Naḷavagga  </span>
				</span>
			</span>
		</li>
	</ul>
	<h1 class="sutta-title">
		<span class="segment" id="sn1.1:0.3">
			<span class="reference">
				<a class="sc" id="0.3" href="#0.3" title="SuttaCentral segment number">0.3</a>
			</span>
			<span class="translation" lang="en">
				<span class="text">1. Crossing the Flood  </span>
			</span>
			<span class="root" lang="pli">
				<span class="text">1. Oghataraṇasutta  </span>
			</span>
		</span>
	</h1>
</header>

I mean, who am I to judge your lifestyle choices? Whatever floats your boat Karl!

But it’s HTML, not XML. I have never found any need to use any XML parser, nor has the question ever arisen among developers in ten years of processing texts on SC. We have always striven to maintain valid, modern code, and my regrets have always come from not enforcing that strictly enough rather than the reverse.

karl_lew · November 25, 2020, 1:33pm

@snowbird, I’ve updated bilara-verse with:

a README
all the Thig and Thag HTML for Bhante’s translations from bilara-data

Let me know of any HTML changes you need and I can re-run the build. For example, Bhante’s HTML is richly nuanced and it’s straightforward to generate that HTML or whatever you need.

Snowbird · November 25, 2020, 2:09pm

Thanks!

Seems to be a problem, though. In the Thag the verses are missing on many of them.

Book of ones seems to be all except the first. Other times just the last verse is missing.

Let me know if you need me to make a better report.

Gillian · November 26, 2020, 4:03am

I’m a little hazy about the level of activity that this thread is discussing. Is it still well-housed in the Watercooler? @karl_lew @sujato would it be good to relocate the thread to a different category?

karl_lew · November 26, 2020, 2:20pm

…perhaps…Translations?

karl_lew · November 26, 2020, 6:12pm

@snowbird, I have updated all bilara-verse html files. Apparently the script omitted the last verse in a classic “off by 1” error. Thank you for finding this serious issue.

sujato · November 26, 2020, 8:54pm

Ahh, the two hard problems in computer science: cache invalidation, naming things, and off-by-one errors.

Sorry, I couldn’t help myself!

Snowbird · November 27, 2020, 1:22pm

@karl_lew,

I’ve been working on turning the files you so kindly created into epubs.

It looks like the Pali verses summarizing things at the end of chapters have gotten mixed up into the verses. In the repository, are they coded differently so that we can manipulate them as a group? I realize that they don’t have an English counterpart, but they clearly aren’t part of the Pali verses that get translated. Any way to corral them off into their own div?

I also notice the lack of the “Book of the x is finished”.

And that the Pali for the name of the Thera is merged with the last pali verse.

20-11-27 18_50_13-TheragathaPali-EnglishBhante-Sujato--002.epub - epub3.0 - Sigil

Is it possible to get that separated or marked up in a way I can hide it?

Let me know if that makes sense. If it’s too difficult, I can certainly just do it manually.

karl_lew · November 27, 2020, 3:57pm

It makes sense, but there is no JSON indication of this distinction, so the script has nothing to trigger it. At best I could put the Pali verse after the English verse so that the last thing read would be the Pali conclusion. SuttaCentral places Pali to the right or below the translation:

Should I change the Pali/ENglish order?

Snowbird · November 28, 2020, 3:06am

No, no. I will just remove that manually.

Bhante @sujato, is it true that there is no way to differentiate the summary verses?

sujato · November 28, 2020, 9:34am

Again, only via the HTML. There they have a special markup.

Snowbird · November 28, 2020, 11:16am

OK, I think the penny just dropped for me on the whole json vs xml thing. The Pali and the English correspond with each other through the id’s, but each line is without any meaning, other than what might be guessed from the pattern in the id’s? The html is what actually tells what is a heading, what is an ending line, etc? Wouldn’t the data be more usable for other purposes if there was a corresponding json file that labeled each line? I’m really just curious about the whole system. I’m not implying that you should do this. (I’m so grateful for all you have done; please don’t read this as a complaint!)

So, @karl_lew, were you building the file purely going by the id’s for each lines? That would make sense, now that I think about it.

I’ve gone ahead and created the epubs that I wanted to make, so for my purposes there is no need to re-do things. I really appreciate your help!

sujato · November 28, 2020, 12:51pm

Exactly.

Yes. Remember that all such information is strictly extraneous to the text itself; neither in manuscript nor in recitation is there a concept of a “heading” or a “paragraph”.

Maybe. I’ve been thinking about it a little. Not sure how it would work, though, or what the aims would be.

The thing is, you can associate text with any amount of metadata depending on what it is that you are trying to achieve.

For example, you’re interested in structural information: when does a verse start and stop?

Someone else might be interested in the speaker: whose voice is the text in?

Someone else might be interested in the theme or the literary style.

Now, the beauty of the system is that anyone can make a set of corresponding files and use them together with the original data, mix and match as you like. You just have to, you know, do the work.

But I’m not sure how much easier it would make your job. Currently the logic is “look in the html folder and when you come across uddana ignore everything after that in the file”. If we defined everything in json you’d still have to reference a separate set of files, and the logic couldn’t be much simpler.

karl_lew · November 28, 2020, 1:26pm

Yes. And the SC HTML is also available with more effort also associated with matching IDs. However, the HTML you’ve chosen bears little resemblance to SC HTML, so it proved expedient to simply generate what you need directly from JSON of the text.

Snowbird · November 28, 2020, 1:32pm

That makes sense. Thanks so much!