I was asked on the DailyDiapers forum how I managed to move all these posts over since they made a backup of the story subforum. I’ll also post it here ![]()
So I coded everything myself and used nodejs, and a mysql db.
For the first stage I created a table for the raw html (and some additional basic fields like threadIds etc.) and requested the index, then I looped through all subforums and all their pages and collected the links to to the threads. Then I looped through all the threads and their pagination. I used a proxy rotation to avoid running into Cloudflare. Now I had around 9GB of raw html.
For the next stage I extracted and prepared the data. I created tables for threads, posts, attachments, users, likes, etc. Then it was basically days of building cheerio selectors to find and extract all the data I needed. The problem is not to just find the data, but handle all the weird edge cases that happen when you process a forum with a million posts that exists for 20 years. (Users that got deleted and don’t have an ID anymore so you have to make one up, threads that are randomly missing posts, post IDs that are not in the correct order or are missing numbers in between, etc.)
Then I had two side missions where I extracted attachments from posts and avatars of users and then wrote a downloader that collected all of them.Another side mission was collecting all the likes. They either have a single user, 2 users, 3 users or 3 users + “x others” mentioned. So I wrote another crawler that creates all these relations between users and posts and sends requests for the list of “others”, which was not included in the html.
The third stage was to get the post into a format, that would make sense for the Discourse forum. The primary problem here was the post content. XenForo uses BBCode and Discourse markdown. So I converted most of the basic stuff from BBCode to markdown. That was kinda easy, but the problems began when I got to the quotes. Discourse requires special notation with it’s own userid and postid in order to reference the quoted post correctly. So I had to prepare them and then re-edit them after they were imported and all the userids and postids of Discourse were assigned.
Then I setup a local Discourse dev instance, wrote an import script in Ruby (used some AI here as I never used Ruby before), that imported the prepared data into Discourse. The import process alone took around 22 hours. After finishing that I rented a server, got the domain, setup Cloudflare, bought an smtp plan for mails, installed Discourse and then made a backup of my local Discourse instance, uploaded it to my server and imported it there. Made a logo, wrote the intro post, went through all the settings and here we are
Lines of code wise I guess it’s around 10k, the biggest part being the html extraction part. I started to work on it one day after the shutdown was announced (Sep 3), mostly the whole days and nights until Sep 19th. I managed to finish everything that required access to adisc.org until about 2 hours before it shutdown, but I needed some time to set everything up, so it went online 3 days later.