@sabbamitta, @aminah. As I look at this, I see that indeed this is AI in pain. Those segments are HUGE and AWS is definitely complaining. The code has two ways of shortening the text presented to AWS to observe its text limitation:
- it reduces the size of the XML without loss of information.
- it eliminates the XML which forced AWS to treat every word separately to avoid HIndi phrasing. The loss of intelligibility therefore arises out of Aditi’s attempts to apply Hindi phrasing to Pali.
The second measure is rarely necessary, since it is draconian and desperate, and reduces intelligibility. And that is the sad result we hear today in MN12 segment 31. I have a hope that AWS may raise its limit in the future, which would allow us to avoid using measure #2. For this reason I have moved this bug out of v0.9.1 and into the SCV Backlog. We thereby acknowledge the bug but have no practical solution at this time. Long segments will be less intelligible until this is fixed. If this issue continues to arise and be disruptive, we may have to research a costly solution in advance of any action taken by AWS. That would be a PM decision.