I'm closing a ten year old internet forum. Are there any best practices for the preservation of web forums?

by snake_a_leg

I hope this doesn't violate the rules, but I'm not sure who to ask about this. It is a question directed at historians.

I'm planning to close a web forum I made ten years ago. Am I supposed to preserve it somehow as part of internet history? I'm not saying I consider it super important to historians, but I wanted to check before deleting 300k posts about an obscure hobby.

vinylemulator

You should not just delete it.

There is a good chance that the Internet Archive's Wayback Machine has already crawled your site, but you can instruct it to take a manual crawl before you shut it down.

Step 1: Create an account at https://web.archive.org/

Step 2: Visit https://web.archive.org/save

Step 3: Insert the index URL of your site and make sure "Save outlinks" is clicked

There is also an IRC bot which does this and allows you to monitor progress: https://wiki.archiveteam.org/index.php/ArchiveBot

If I were you I would do both before shutting down

orders1-65

Not a historian but I'll try to answer as a technology professional with a lot of data archival experience.

The wayback machine is great but for things like forums it really loses some of the important functionality like searching and linking across threads. The best option really is to try and continue hosting the site in some way - you could see what it would take to migrate to a cloud environment like aws or gcp. If you set the site to read only it likely would not cost more than $20 a month or so. Raising that amount with donations may be simple enough.

If that fails, the best archive is of course a direct database dump so anyone else could spin up phpbb or whatever and run their own copy. However, that isn't an easily searchable medium.

If you can't host a pared down version of the live site, at least the wayback archive will be a decent option to keep as much content accessible as possible. Perhaps you could have the domain name redirect to the wayback portal or even compile a list of the top 10/20/50/100 threads and make links to each of their wayback URLs and set that list to be the new home page of the site.

Schrankwand83

Historians and other historical researchers need data (=information, or simply put, the actual content) and metadata (who posted what and when in relation to what else - data needed to put other data into context). So you want to preserve basically everything because you don't know what future historians are up to: are they interested in technical details of discussing conspiracy theories involving matchbox cars and flight fishing (to give the most obscure hobby I can make up right now)? Are they interested in a particular VIP user? Are they researching the language used from a linguistic point of view? Do they investigate the techniques used of web admins to preserve web forums to the future? Do they perform an analysis of emoji circulation?

You don't know, so it's best to keep everything and try to avoid any possible data loss. You want to make sure you preserve the frontend, the data and the metadata. From a technical point of view it may also be interesting to preserve the backend, which means that you shouldn't touch the web forum, the database, the backend code or the server at all (if you host it on your own server like Apache or nginx). You could offer everything via VM image, which would be the best up-to-date option to conserve everything, making it not impossible (but very hard) to alter the state of conservation.

​However, hacking incidents against archived web forums could also be part of a historical research some day, so it also makes sense to change nothing at all. Think of hacked sites as of internet ruins.

I think it's best to use different ways of archiving for different use cases. A VM image is not very user-friendly to most historians because most of them are not tech-savvy and a VM is sandboxed. Any person worried about corrupting data from influence that didn't exist prior the time of preservation would want to use a sandboxed VM, but if you think of web-based AI-driven OSINT research, a simple archive of the file tree (code + other files + database) might be better since data handling and scraping is so much easier.

Given low life expectancy of electronical data carriers, a level 4 or higher RAID server is the best bet. Cloud services offering this, if you just want to upload and forget. Tape libraries if you really want to go down the longevity rabbit hole without carving everything in stone.