I mean, who am I to judge your lifestyle choices? Whatever floats your boat Karl!
But it’s HTML, not XML. I have never found any need to use any XML parser, nor has the question ever arisen among developers in ten years of processing texts on SC. We have always striven to maintain valid, modern code, and my regrets have always come from not enforcing that strictly enough rather than the reverse.
all the Thig and Thag HTML for Bhante’s translations from bilara-data
Let me know of any HTML changes you need and I can re-run the build. For example, Bhante’s HTML is richly nuanced and it’s straightforward to generate that HTML or whatever you need.
I’m a little hazy about the level of activity that this thread is discussing. Is it still well-housed in the Watercooler? @karl_lew@sujato would it be good to relocate the thread to a different category?
@snowbird, I have updated all bilara-verse html files. Apparently the script omitted the last verse in a classic “off by 1” error. Thank you for finding this serious issue.
I’ve been working on turning the files you so kindly created into epubs.
It looks like the Pali verses summarizing things at the end of chapters have gotten mixed up into the verses. In the repository, are they coded differently so that we can manipulate them as a group? I realize that they don’t have an English counterpart, but they clearly aren’t part of the Pali verses that get translated. Any way to corral them off into their own div?
It makes sense, but there is no JSON indication of this distinction, so the script has nothing to trigger it. At best I could put the Pali verse after the English verse so that the last thing read would be the Pali conclusion. SuttaCentral places Pali to the right or below the translation:
OK, I think the penny just dropped for me on the whole json vs xml thing. The Pali and the English correspond with each other through the id’s, but each line is without any meaning, other than what might be guessed from the pattern in the id’s? The html is what actually tells what is a heading, what is an ending line, etc? Wouldn’t the data be more usable for other purposes if there was a corresponding json file that labeled each line? I’m really just curious about the whole system. I’m not implying that you should do this. (I’m so grateful for all you have done; please don’t read this as a complaint!)
So, @karl_lew, were you building the file purely going by the id’s for each lines? That would make sense, now that I think about it.
I’ve gone ahead and created the epubs that I wanted to make, so for my purposes there is no need to re-do things. I really appreciate your help!
Yes. Remember that all such information is strictly extraneous to the text itself; neither in manuscript nor in recitation is there a concept of a “heading” or a “paragraph”.
Maybe. I’ve been thinking about it a little. Not sure how it would work, though, or what the aims would be.
The thing is, you can associate text with any amount of metadata depending on what it is that you are trying to achieve.
For example, you’re interested in structural information: when does a verse start and stop?
Someone else might be interested in the speaker: whose voice is the text in?
Someone else might be interested in the theme or the literary style.
Now, the beauty of the system is that anyone can make a set of corresponding files and use them together with the original data, mix and match as you like. You just have to, you know, do the work.
But I’m not sure how much easier it would make your job. Currently the logic is “look in the html folder and when you come across uddana ignore everything after that in the file”. If we defined everything in json you’d still have to reference a separate set of files, and the logic couldn’t be much simpler.
Yes. And the SC HTML is also available with more effort also associated with matching IDs. However, the HTML you’ve chosen bears little resemblance to SC HTML, so it proved expedient to simply generate what you need directly from JSON of the text.