Complete Archive and History of the Website and Discussions

sirinath · April 13, 2019, 8:33am

It would be great if there is a web spider which can archive the website on updates to:

Any attachments and external links to a certain depth should also be archived.

Some of the files might be PDF/Doc files from which the URLs can be extracted so outside links are preserved.

The service can be extended to archiving other Buddhist sites also which is desirable to be archived. E.g.:

Following shows how often the site is archived:

Wayback Machine as of the time of writing in 2019 there has been on one archival.

This is in case SC infrastructure fails or any discussion links go stale needed copy is maintained.

sujato · April 14, 2019, 10:35pm

Hey thanks. I agree, this would be a great thing. I’ve looked in the past at getting archives for the main site, but not for D&D. Some of these services look interesting, it certainly can’t hurt to back things up.

I’ll discuss this with the team, but meanwhile, if any developers out there are interested to have a look at this, it would be most appreciated! It might be as simple as submitting our URL.

sirinath · April 15, 2019, 4:35am

It is simple but might be a bit more complicated.

This can be done in 2 fronts:

Ping based - when the site updates submit
Crawler based - crawl the site and submit - this would be the only way for external sites

In the 1st case submitting needs to happen when the site is updated.

Another special case will be attachments. you might need another library to get the URLs. E.g. for PDFs: GitHub - metachris/pdfx: Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs. and for MS files oletools - python tools to analyze OLE and MS Office files | Decalage some other utility

Also the pining part might need to be handled within the site.

Also external links might need to be t least 1 level deep for it to be useful.

SC definitely has the best infrastructure. Hopefully it could back up some other important Buddhist sites also.

Also the archived links can be displayed for reference purposes so if the URL scheme changes in the future the sites can link to the snapshot in time links.

Some are lists of other sites which does archiving:

musiko · April 15, 2019, 5:57am

This might not be as easy as it seems since Discourse is not a static html web site, it is a pure JavaScript one.

There have been reports of Wayback Machine having problems archiving Discourse forums in the following topic, and similar issues could affect any or all archiving sites:

From the last update in the above topic there still seem to be unresolved issues.

Edit: some additional resources here:

musiko · April 15, 2019, 6:06am

“Archival” might be misleading, since the archiving crawler only took a snapshot of the first page, all links to categories actually return " Oops! That page doesn’t exist or is private."

Please see my previous post for the possible cause.

sirinath · April 15, 2019, 6:13am

The lowest hanging fruit maybe the main site 1st then. With archive on every update.

musiko · April 15, 2019, 6:25am

I’m not sure I understand this, content on the forum updates continually? Or did you have SuttaCentral main site in mind?

Maybe a more effective approach would be to self host a static html archive (e.g. archive.discourse.suttacentral.net with say, monthly snapshots) generated according to suggestions in this topic Improving Discourse static HTML archive - feature - Discourse Meta and then feed this static site to numerous archiving sites.

Aminah · April 15, 2019, 7:00am

That was what I understood by the proposition. I think there’s real value in archiving SuttaCentral. Conversely, I’d be very unsure about the merits of archiving D&D.

sirinath · April 15, 2019, 9:01am

Start with Main site as it is easy then the forum. Also other Buddhist sites later on. Any external links and attachments in the forum and the sites to about 1 level deep at regular intervals or if there is a mechanism to know if there was an update later on.

Why external links need to be archived to some depth is e.g.
in the following post Online sources for study of Pali manuscripts the value is in having the manuscripts archived incase something goes wrong.

sirinath · April 15, 2019, 9:07am

I don’t think any site can have a 100% uptime guarantee. It can go down due to infrastructure or other issues. If something goes wrong this ensures this there is minimal loss. Also if a link goes bad or there are edits the history is saved for reference and citation purposes.

D&D infrastructure can fail losing part of the content. External links links outside SC control can change. URLs from external sites to SC and D&D go stale if URLs change. In this case the archive copy can serve as a snapshot in time permalink.

Even sites like StackExchange / StackOverflow does a backup which can have great infrastructure due to being a commercial company. Also Wikipedia does it to including all external links.

musiko · April 15, 2019, 9:45am

This is what backups are for.

This part is governed by change management (where changes are retained for audit or versioning purposes) and link redirections (in case of changing the underlying infrastructure).

Discourse has this capability already built-in (every edit is available in post edit history and all admin actions are also logged in the back end).

As for the main site, I believe everything is based on Git and every change is also retained on GitHub. I believe this data is also backed-up as part of infrastructure maintenance, perhaps Aminah can shed more light on that.

The purpose af an archive is to have access to a fixed (usually aggregated, meaning without edit trails) and read only snapshot of a point in time. With content of the (rendered) html pages we are talking MB of data, inclusion of attachments easily leads into GB or TB territory (there are literally tens of gigabytes of attachments on this forum alone—I know for sure being responsible for a large portion of them ) and I very much doubt any internet archiving machine is actually saving them all with every snapshot.

This holds for every serious piece of software or service, but archive is not a replacement for backup or vice versa, they each fulfill a very distinct purpose.

And they are both vital for the longevity and availability of data.

musiko · April 15, 2019, 9:50am

There are some valuable assets here, for example Bhante’s essays on translation process and terminology, and many other useful resources .

Aminah · April 15, 2019, 11:44am

Indeed! When I commented, I also did pause to think of the brilliant work you’ve done to share precious Dhamma talks (incidentally, thanks so much!). With respect to these kinds of things, no question it would be great to have them archived, too. But that’s only one aspect of the forum. Maybe it is just because I’m a mod vet that I tend not to think about D&D so much as a resource like that. As maybe a slightly less mischievous point, I guess I’m quite warm towards the “right to be forgotten” principle.

sirinath · April 15, 2019, 12:50pm

The Archivers systems above save them if you submit it to be saved though their API. All this is saved outside SC infrastructure so no need to worry about size and storage.

Backups and and github are not good at preserving the site snapshot and citations. If the site does go down the the content will still be inaccessible until it is brough online again even from a backup. What if the site maintenance and outage is aggravated by internal issues like: Ajahn Brahm Has Resigned From the BSWA. One reason SO / SE backs up a snapshot to 3rd party archives sites is a hedge against organisational issues. What if at time of outage SC has lost it’s volunteers to bring back the site up again. What if 3rd party can’t put the needed effects to bring it up again from Github. What if you come to a point here is not enough donations to keep the infrastructure running. There are lot of ifs here but these are all possibilities. No would have expected USSR to collapse in the 1980s. There are a lot of active sites and projects which suddenly go dormant. An average Internet will not be able to access the backups or use github to bring the site up again, but will readily available to browse the Wayback Machine and Archive.today snapshots.

Though general discussion may not merit archiving, resources (attachments, URLs, etc) do merit being archived.

Also it might be a good idea to mirror Github repo in Gitlab also. It wasn’t long ago that BerliOS and Codehaus were the most active code hosting sites.

So while the project is at its peak, invest in preservation also as a contingency in case anything unforeseeable happens. This may not pay dividends in this generation would be useful in generations to come provided the archive sites also survive.

sujato · April 15, 2019, 10:34pm

Thanks so much to both of your for your contributions, I appreciate it very much.

To summarize, there are two separate issues for the two sites: in fact three, because we should consider audio as well.

We obviously need a long-term strategy to handle all this.

A few notes from my perspective.

The most important thing is the data, especially the sutta texts, translations, and parallels. This is in the sc-data repo. Web interfaces and the like change all the time, it is no big deal.

So for serious long-term archiving that should be the focus. Here I would consider specialized tools such as tape backups, in addition to options discussed above.

Keeping things on Github for the medium term will be, I think, fine. Github isn’t going anywhere. But one of the great things about Git is how easy it is to fork and maintain up to date backups. Perhaps we could start a program to get people to create multiple backups of sc-data on their local computers.

For archiving D&D, it seems that we need to look into the methods for preserving Discourse sites in general.

sirinath · April 16, 2019, 2:34am

Great ideas Ven. Sir.

As per my initial idea is it possible to browsable sites triggered on updates in case there is a outage and include some selected other important Buddhist Websites also.

Aminah · April 16, 2019, 7:24am

And then blockchain?

https://www.ccn.com/uk-national-archives-to-test-blockchain-tech-for-official-record-keeping

sujato · April 17, 2019, 9:34pm

If we are to use blockchain, then something like the IPFS seems like a good fit. It’s designed to preserve content and be resilient against any single point of failure. But while technically interesting, the real resilience arises from use. Unless it’s widely adopted the benefits will be limited. But it’s definitely worth keeping an eye on.

Media · April 19, 2019, 4:44pm

Just chiming in from the side here.

In the UK “culturally significant” sites are archived (hardware, software and content) and maintained by the British Library, which is the national library of the UK. A Buddhist example is the website of The Buddhist Society in London apparently, which has a snapshot taken at regular intervals all managed and paid for by the state. I presume other countries national archivers do likewise and it might be an idea to contact the relevant authority.

sirinath · May 5, 2019, 4:00pm

It might be a idea to have the git repos mirrored: Attackers Wiping GitHub and GitLab Repos, Leave Ransom Notes