SuttaCentral

Smart word analysis for Thai


#1

I just came across this:

Basically it is a preprocessor for Thai text, which can handle LaTeX or HTML. It is not trivial in Thai to accurately determine word breaks, as there are no spaces between words. So usually browsers do it quite primitively. SWATH does it better, inserting <wbr> tags where needed.

I’m wondering whether we should run our Thai texts through this. @blake, @Dheerayupa, any thoughts?


#2

Dear Bhante @Sujato,

I’m not sure I understand how it works as my IT knowledge is really limited, but yes, word boundary is an issue in the Thai language: ตา-กลม or ตาก-ลม

Mac was hopeless when it came to word boundaries several years ago — don’t know if they have improved. However, even Microsoft Word can sometimes be ignorant.

If you think said program is good, I’m happy to help with the database or whatever requires a native Thai speaker to do so. I should be available to do this after August. :slight_smile:

With great respect,

Dheerayupa


#3

Since the original post, I have successfully installed Swath and used it for LaTeX. It was fiddly to install, but worked great. I installed from the repo, and had to fuss around getting the dependencies right. However, apparently Blake installed it straight from Ubuntu repositories which was easier.

Regardless, it is a good thing and should be used as a precprocessor. In LaTeX, it inserts {\wbr} tags where needed, and for HTML it will do the same with <wbr>.

Note that this is a good thing not just for the texts, but for all the Thai in the interface, including the GUI. Thus I propose that we automate a preprocessor for Thai on export from Pootle, as well as covering existing texts.

No doubt there will be other languages/scripts that have similar problems, so this may be extended if there are appropriate preprocessors.


#4

@sujato I’m sure you must have 27 hour days :slightly_smiling_face: