The Possibility of Developing OAI-PMH

Hello Developers of SuttaCentral,

I hope this message finds you well. My name is Lianghao Lu, and I am a metadata editor at Atla, formerly the American Theological Library Association. I am posting here at the recommendation of Bhante Sujato to ask about the possibility of setting up OAI-PMH, the Open Archives Initiative Protocol for Metadata Harvesting, for SuttaCentral.

Atla is currently developing a new project called Eureka: https://www.atla.com/eureka/. Eureka aims to enhance the accessibility and discoverability of valuable content in religion and theology. It is intended to become a centralized hub for unique and often hard-to-find resources sought by researchers, practitioners, instructors, students, and others. It will serve as an information search and retrieval platform designed specifically for theology and religious studies. Therefore, it would be wonderful if Eureka could incorporate SuttaCentral’s resources and promote them to the wider academic community.

I noticed that SuttaCentral has already developed a comprehensive API, and we hope that SuttaCentral might consider building an additional OAI-PMH layer on top of its existing API. Specifically, we are interested in harvesting metadata for SuttaCentral’s published English translation texts, rather than full text.

The OAI-PMH implementation could support metadataPrefix=oai_dc, with records limited to published English translations. Each record could use an identifier pattern such as oai:suttacentral.net:{uid}.en.{author_uid} and include title, translator name and UID, SuttaCentral UID, canonical hierarchy, root language, translation language, publication title, first-published date, license, and the public SuttaCentral URL. Including subject headings or topical keywords, where available, would be especially helpful for improving discovery in theological and religious studies research platforms such as Eureka.

Useful OAI sets might include lang:en, type:translation, pitaka:vinaya, pitaka:sutta, collection sets such as collection:dn, collection:mn, and collection:sn, and creator sets such as creator:sujato or creator:brahmali.

For each record, the primary public URL should be the SuttaCentral website URL, for example, https://suttacentral.net/pli-tv-bu-vb-pc1/en/brahmali , rather than a GitHub source URL. GitHub or API paths could remain internal source data, but the harvested dc:identifier should direct users to the public SuttaCentral page. Full text does not need to be exposed through OAI-PMH. We only need lightweight metadata records with stable links to the corresponding public SuttaCentral pages.

Thank you very much for considering this possibility.

3 Likes

Thanks Lu. Just tagging a few of our volunteer developers here to see if anyone’s interested. Unfortunately at the moment I’m full time working with our main developer on a large scale project I don’t want to interrupt. In fact I have a meeting now!

@Jhanarato @Vimala @Pasanna @Snowbird

Anyone else?

1 Like

Yeah, make sure to include @agilgur5. He’s currently looking at API testing.

Our current API is documented in swagger. However, our REST API documentation is generated with Flasgger. (Last updated in 2023) It derives the swagger docs from comments in the Python source. These have not been kept in sync with the code. These days FastAPI is popular. It uses Pydantic and ensures that what is served matches the documentation.

@agilgur5 has plans to rectify the situation and can tell you more.

3 Likes

:waving_hand: Yes, specifically automated property tests against the OpenAPI spec with schemathesis (c.f. Review API Tester · Issue #3573 · suttacentral/suttacentral · GitHub ). I should be getting to that pretty soon.

That testing as part of CI will ensure that the API responses match the spec. We’ll see then how bad the current divergence is. For non-compliant cases, we’ll have to decide whether to change the spec (and update versions accordingly for consumers) or the API response and corresponding client-side.
Given that I haven’t seen much feedback from API consumers, it might just make sense to update the spec and not worry too much about spec breakage, but TBD.

A typed validation layer will make it easier to see these kinds of diffs during design-time of APIs. That may require migrating off Flask (which we should to something asyncio native anyway), so might take a while to get there.
Property tests with schemathesis will be very useful in the interim and after can still be useful for finding edge case spec non-compliance.

Regarding OAI-PMH specifically, it’s maybe the second time I’ve heard of this protocol, so I’m not too knowledgeable about it.
From some quick searches, I couldn’t find any Flask or ArangoDB plugins that could be easily configured, so it seemed like we’d have to implement our own /oai route or configure pyoai and forward requests to it.
Since it’s for metadata only, I was hoping it could use statically generated XML similar to a sitemap.xml, but it appears to have various optional querying properties. So it may require dynamically constructing AQL queries against the DB and transforming the responses to XML to fit the OAI-PMH specifications. Possibly a first version could leave some of the optional pieces out. Possibly we could also pre-generate some of the XML responses, but I think there may be too many permutations to make that sustainable in a static fashion.

Given that, it seems like a sizable feature to implement, unless there were a specific subset of queries Atla were most interested in that are less complicated.
Given Ven. Jhanarato and I are in the weeds with lots of maintenance and refactoring, I’m not sure either of us would have time to tackle that soon.
If someone else wants to contribute though, some of us volunteers might be able to spare some time to help show how the API and DB work. From there one could start crafting XML transformations against them.

As an alternate implementation, an SC OAI-PMH server might be possible to build purely as a consumer of the API, and therefore be it’s own project that depends on the SC API and not require an internal implementation.
I’m not sure exactly what the OAI-PMH query space looks like compared to what is available in the SC API to know if that option would cover all the bases neatly (compared to an internal implementation against the DB).

1 Like

Also note that there are API routes that aren’t specified at all:

2 Likes

Thank you so much for your reply! It seems like that an OAI-PMH endpoint might be something SuttaCentral would develop in the long term, but not in the near future. In that case, there is an alternative for Atla Eureka to ingest metadata information, which is by accessing information via the SuttaCentral API.

Eureka currently has a CSV temple with the following fields:
id, model, parents, identifier, title, URL, contributing_institution, alternative_title, contributor, creator, date, description, extent, format_digital, format_original, language, place, publisher, rights, rights_holder, subject, time_period, types, remote_files, delete

The first id field is generated by Eureka. The model field has two parameters: collection and work. Usually a group of works belong to one collection. The identifier field should be the SC uid. Many other fields such as title, contributor, and language can be found through SuttaCentral API. But I am not sure about the URL, because Atla would want Eureka users to be able to link back to the actual content. So I wonder if it is possible for SuttaCentral to add this information to its API?

We can leave the other fields blank for now, and SuttaCentral can update them later. Eureka would periodically update its ingestions.

I believe Urls to the text itself can be built from the data in the API. It’s always just

https://suttacentral.net/<uid>/<language id>/<author id>

So, for example…

https://suttacentral.net/mn47/en/sujato

2 Likes