Jayawickrama’s series of articles for the Pali Buddhist Review are an excellent resource for study of the Suttanipata. But it is only available, so far as I know, in a scanned PDF of demonic intent. I have tried to fix it by extracting the images, eliminating the literally thousands of useless bits of overlay (?), which gives you white-on-black images. Then I tried convert -negate but it won’t batch process them. So there I am stuck.
What I’d like to do is:
pull out the proper images
convert to black-on-white
make reasonable and consistent margins
split files so there is one image per page.
recombine to a new PDF.
run OCR so they are searchable
make a ToC (? not sure how this is done).
There are about 84 pages.
Anyway, I have too many things on right now! Is there anyone who might be interested?
Alternatively, of course, if someone has access to the original editions, it’d be easier to just re-scan them.
I could probably do this relatively easy with an image to text tool that I regularly use and convert the whole thing to live text, then a pdf if desired. The only issue would be that I’m sure there would be a lot of missed letters plus zero formatting. So it would take some work to go back through and edit.
You can try, but the thing is, academic papers like this have a lot of special formatting and the like, so you’ll probably end up with a lot of work needing to be done. Still, really that’s the best way, if someone has the time, make a proper HTML document or something. Just a lot of work is all.
I tried uploading the pdf to a service (mathpix) that is very good at converting images of formulas to latex code. I uploaded the results in latex, markdown and html into the folder below. It seems to be decent, but probably much needs to be cleaned up, still. I might be able to do some cleaning in the coming weeks, but I think 84 pages would be a bit too much for one person. If someone wants to take this as a starting point, please feel free.
I am happy to clean the whole thing on the basis of the scans and convert to html (assuming that Bhante doesn’t need it within a particular timeframe?). The files look useful, thanks.
Hmm, well that’s pretty good actually, definitely a nice starting point.
Check Olaster’s files, that’s a good place to start. One thing to bear in mind, if we’re converting to HTML, don’t try to mimic every detail of the original typesetting. The main thing is to have a nice clean, readable file. A bit of regex will go a long way.
One thing that has to be done is fixing all the places the converter treated as maths. In the HTML file, for example, we have:
All of which is literally just “Pj”. You can see which of the three sources is easiest to work with. If it were me, I’d probably start with the markdown.
Maybe. Anyway, an HTML file can go anywhere.
Thanks for all this! I’ve been referring to this file for ages, and it’s so poorly done, it’s just a hassle to find anything. So this was really just to make my life easier. But hopefully will benefit others too.
Since folks in this thread seem to be interested in manipulating PDFs, I’ll share this bit of software for cropping pdfs. It’s especially useful for taking a two-up page and making it one-up:
I am happy to clean the whole thing on the basis of the scans and convert to html (assuming that Bhante doesn’t need it within a particular timeframe?). The files look useful, thanks.
Thanks so much! I’m already 10-20% done (just battling the seasonal hayfever brainfog), and the archive.org version seems to be about on a par with the scan you provided already.