Handling volpage in unsegmented Chinese root texts

G’day,

I’ve been working on our process for importing unsegmented texts, both legacy translations and root texts. The main reason for this is to reduce CPU load and generally improve the code quality. This part of the code base hasn’t been revisited for several years and is affecting the reliability of the website.

I’ve come across an inconsistency in our handling of volume-and-page markers for our texts and would like some feedback before I try to fix it. So, have a look at these two API responses:

https://suttacentral.net/api/suttaplex/ma1

https://suttacentral.net/api/suttaplex/ma43

(Sorry about the JSON, I’m not familiar with the front end code)

The first has been segmented by @cdpatton while the second hasn’t.

Each has a “volpages” attribute, “T i 421a12” and “T i 485b19” respectively.

Each then has a list of translations, with the root text first in the list. OK, not actually a translation, that’s just how we do things. These in turn have their own “volpage” attribute. null and “T 0485b21” respectively. The rest of the translations, which are actual translations, have null for their volpage attributes.

I can see why this is happening. When we load unsegmented texts and the language code is ‘lzh’ the volpage is extracted from the HTML. If we just set the volpage to null for every item in the translation we can delete the volpage extraction code and reduce the CPU load.

Sorry for the technical nature, I hope our users understand what I’m talking about.

2 Likes

If that’s all too complicated, I can just make the optimisation. Should be a pretty small blast radius if I’ve got it wrong. Just keep an eye on Volume and Page details when I’m done. I’ll let youse know.

2 Likes

I’d wait till you hear something from @cdpatton. Other than Bhante and @HongDa , he may be the only one who has a sense of what’s going on with this.

I know nothing about this area. But of course I greatly appreciate all your work!

2 Likes

I’m not sure what affect it would have on anything, I know nothing about the programming of SC. I did notice while poking around in my attempt to update the structure tree for the Dirgha Agama that unsegmented texts have a data file that holds data like the volpage in the structure folder called “text_extra_info.json”. I’m guessing that would probably be the source to use for the volpage data instead of the actual root text files … But it is just a guess. I didn’t program the website.

2 Likes

Hey @cdpatton, thanks for getting involved.

Let us start with the suttaplex card for MA43:

The volume & page reference is T i 485b19.

As you discovered, this reference is supplied by the text_extra_info.json file:

Do you know if this reference is shown elsewhere? I couldn’t see it on the page with the root text:

When you view the suttaplex card for an unsegmented root text, it also serves up a second volume and page reference:

 {
        "lang": "lzh",
        "lang_name": "Chinese",
        "is_root": true,
        "author": "Taishō Tripiṭaka",
        "author_short": "Taishō",
        "author_uid": "taisho",
        "publication_date": null,
        "id": "lzh_ma43_taisho",
        "segmented": false,
        "volpage": "T 0485b21",
        "has_comment": false
      },

Note that this time the volume/page is T 0485b21, not T i 485b19.

The reason for this is that we extract T 0485b21 from the HTML file itself:

In particular, this tag:

<a class='ref t' id='t0485b21' href='#t0485b21'>

I’m not sure, but getting this second volpage reference may be taxing the server quite a bit. I’ll be sure to check.

4 Likes

Bhante, give me some time. Legacy text processing already existed before I took over the development of sc, so there are some details I need to look at the code.

1 Like

Sure, no rush at all. Thanks for having a look.

1 Like