Is there anyone around who wants to make a decent PDF out of the scanned Jayawickrama articles?

sujato · October 25, 2022, 8:54am

Jayawickrama’s series of articles for the Pali Buddhist Review are an excellent resource for study of the Suttanipata. But it is only available, so far as I know, in a scanned PDF of demonic intent. I have tried to fix it by extracting the images, eliminating the literally thousands of useless bits of overlay (?), which gives you white-on-black images. Then I tried convert -negate but it won’t batch process them. So there I am stuck.

What I’d like to do is:

pull out the proper images
convert to black-on-white
make reasonable and consistent margins
split files so there is one image per page.
recombine to a new PDF.
run OCR so they are searchable
make a ToC (? not sure how this is done).

There are about 84 pages.

Anyway, I have too many things on right now! Is there anyone who might be interested?

Alternatively, of course, if someone has access to the original editions, it’d be easier to just re-scan them.

Khemarato.bhikkhu · October 25, 2022, 12:52pm

Yeah, that’s a pretty accurate description of the Frankensteinian monster I patched together from the scans on their website.

Here’s the link to it if someone wants to make a go of it: A Critical Analysis of the Sutta Nipāta (1947) - N A Jayawickrama.pdf - Google Drive

If someone does manage to clean it up, please let me know so I can replace the file on Drive.

Nibbida · October 25, 2022, 7:26pm

I could probably do this relatively easy with an image to text tool that I regularly use and convert the whole thing to live text, then a pdf if desired. The only issue would be that I’m sure there would be a lot of missed letters plus zero formatting. So it would take some work to go back through and edit.

sujato · October 25, 2022, 7:47pm

You can try, but the thing is, academic papers like this have a lot of special formatting and the like, so you’ll probably end up with a lot of work needing to be done. Still, really that’s the best way, if someone has the time, make a proper HTML document or something. Just a lot of work is all.

Suvira · October 25, 2022, 9:10pm

I can help too if needed.

sujato · October 25, 2022, 9:30pm

Go for it if you want.

olastor · October 26, 2022, 8:24am

I tried uploading the pdf to a service (mathpix) that is very good at converting images of formulas to latex code. I uploaded the results in latex, markdown and html into the folder below. It seems to be decent, but probably much needs to be cleaned up, still. I might be able to do some cleaning in the coming weeks, but I think 84 pages would be a bit too much for one person. If someone wants to take this as a starting point, please feel free.

https://drive.google.com/drive/folders/1idSA4Jn5esLIu3xe2l2x9gWfefdldlcj?usp=sharing

Suvira · October 26, 2022, 9:09am

I am happy to clean the whole thing on the basis of the scans and convert to html (assuming that Bhante doesn’t need it within a particular timeframe?). The files look useful, thanks.

Snowbird · October 26, 2022, 9:32am

This looks like the kind of thing that Bhante Ānandajoti would republish in HTML on his website.

sujato · October 26, 2022, 10:39pm

Hmm, well that’s pretty good actually, definitely a nice starting point.

Check Olaster’s files, that’s a good place to start. One thing to bear in mind, if we’re converting to HTML, don’t try to mimic every detail of the original typesetting. The main thing is to have a nice clean, readable file. A bit of regex will go a long way.

One thing that has to be done is fixing all the places the converter treated as maths. In the HTML file, for example, we have:

<span class="math-inline "><mathml style="display: none"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>P</mi>
  <mi>j</mi>
</math></mathml><mathmlword style="display: none"><math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>P</mi>
  <mi>j</mi>
</math></mathmlword><asciimath style="display: none;">Pj</asciimath><latex style="display: none">P j</latex><mjx-container class="MathJax" jax="SVG" role="presentation" style="position: relative"><svg style="vertical-align: -0.462ex" xmlns="http://www.w3.org/2000/svg" width="2.631ex" height="2.007ex" role="img" focusable="false" viewBox="0 -683 1163 887" aria-hidden="true"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"><g data-mml-node="math"><g data-mml-node="mi"><path data-c="50" d="M287 628Q287 635 230 637Q206 637 199 638T192 648Q192 649 194 659Q200 679 203 681T397 683Q587 682 600 680Q664 669 707 631T751 530Q751 453 685 389Q616 321 507 303Q500 302 402 301H307L277 182Q247 66 247 59Q247 55 248 54T255 50T272 48T305 46H336Q342 37 342 35Q342 19 335 5Q330 0 319 0Q316 0 282 1T182 2Q120 2 87 2T51 1Q33 1 33 11Q33 13 36 25Q40 41 44 43T67 46Q94 46 127 49Q141 52 146 61Q149 65 218 339T287 628ZM645 554Q645 567 643 575T634 597T609 619T560 635Q553 636 480 637Q463 637 445 637T416 636T404 636Q391 635 386 627Q384 621 367 550T332 412T314 344Q314 342 395 342H407H430Q542 342 590 392Q617 419 631 471T645 554Z"></path></g><g data-mml-node="mi" transform="translate(751, 0)"><path data-c="6A" d="M297 596Q297 627 318 644T361 661Q378 661 389 651T403 623Q403 595 384 576T340 557Q322 557 310 567T297 596ZM288 376Q288 405 262 405Q240 405 220 393T185 362T161 325T144 293L137 279Q135 278 121 278H107Q101 284 101 286T105 299Q126 348 164 391T252 441Q253 441 260 441T272 442Q296 441 316 432Q341 418 354 401T367 348V332L318 133Q267 -67 264 -75Q246 -125 194 -164T75 -204Q25 -204 7 -183T-12 -137Q-12 -110 7 -91T53 -71Q70 -71 82 -81T95 -112Q95 -148 63 -167Q69 -168 77 -168Q111 -168 139 -140T182 -74L193 -32Q204 11 219 72T251 197T278 308T289 365Q289 372 288 376Z"></path></g></g></g></svg><mjx-assistive-mml role="presentation" unselectable="on" display="inline"><math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mi>j</mi></math></mjx-assistive-mml></mjx-container></span>

All of which is literally just “Pj”. You can see which of the three sources is easiest to work with. If it were me, I’d probably start with the markdown.

Maybe. Anyway, an HTML file can go anywhere.

Thanks for all this! I’ve been referring to this file for ages, and it’s so poorly done, it’s just a hassle to find anything. So this was really just to make my life easier. But hopefully will benefit others too.

Snowbird · October 27, 2022, 4:20am

Since folks in this thread seem to be interested in manipulating PDFs, I’ll share this bit of software for cropping pdfs. It’s especially useful for taking a two-up page and making it one-up:

olastor · October 28, 2022, 1:26pm

I am happy to clean the whole thing on the basis of the scans and convert to html (assuming that Bhante doesn’t need it within a particular timeframe?). The files look useful, thanks.

I noticed the archive.org version of this PDF also links a plain text version (scanned with " ABBYY FineReader 8.0"). This might be useful, too…

Suvira · October 29, 2022, 3:58am

Thanks so much! I’m already 10-20% done (just battling the seasonal hayfever brainfog), and the archive.org version seems to be about on a par with the scan you provided already.

Nimal · June 16, 2024, 8:59pm

If this PDF is ready, would you mind sharing it with others who might be interested in reading it?
With Metta
Nimal Fernando

Suvira · June 19, 2024, 9:33am

My laptop crashed and to date, I haven’t recovered the files sorry.

If somebody else wants to take this forward, please don’t wait for me. Another person can take care of this project.

Nimal · June 19, 2024, 3:54pm

Oh, that is too bad. I am not proficient enough to undertake a task of this nature. I hope someone will volunteer.
With Metta