How did I create ADISC.CC? (Technical)

I was asked on the DailyDiapers forum how I managed to move all these posts over since they made a backup of the story subforum. I’ll also post it here :slight_smile:

So I coded everything myself and used nodejs, and a mysql db.

For the first stage I created a table for the raw html (and some additional basic fields like threadIds etc.) and requested the index, then I looped through all subforums and all their pages and collected the links to to the threads. Then I looped through all the threads and their pagination. I used a proxy rotation to avoid running into Cloudflare. Now I had around 9GB of raw html.

For the next stage I extracted and prepared the data. I created tables for threads, posts, attachments, users, likes, etc. Then it was basically days of building cheerio selectors to find and extract all the data I needed. The problem is not to just find the data, but handle all the weird edge cases that happen when you process a forum with a million posts that exists for 20 years. (Users that got deleted and don’t have an ID anymore so you have to make one up, threads that are randomly missing posts, post IDs that are not in the correct order or are missing numbers in between, etc.)
Then I had two side missions where I extracted attachments from posts and avatars of users and then wrote a downloader that collected all of them.

Another side mission was collecting all the likes. They either have a single user, 2 users, 3 users or 3 users + “x others” mentioned. So I wrote another crawler that creates all these relations between users and posts and sends requests for the list of “others”, which was not included in the html.

The third stage was to get the post into a format, that would make sense for the Discourse forum. The primary problem here was the post content. XenForo uses BBCode and Discourse markdown. So I converted most of the basic stuff from BBCode to markdown. That was kinda easy, but the problems began when I got to the quotes. Discourse requires special notation with it’s own userid and postid in order to reference the quoted post correctly. So I had to prepare them and then re-edit them after they were imported and all the userids and postids of Discourse were assigned.

Then I setup a local Discourse dev instance, wrote an import script in Ruby (used some AI here as I never used Ruby before), that imported the prepared data into Discourse. The import process alone took around 22 hours. After finishing that I rented a server, got the domain, setup Cloudflare, bought an smtp plan for mails, installed Discourse and then made a backup of my local Discourse instance, uploaded it to my server and imported it there. Made a logo, wrote the intro post, went through all the settings and here we are :D

Lines of code wise I guess it’s around 10k, the biggest part being the html extraction part. I started to work on it one day after the shutdown was announced (Sep 3), mostly the whole days and nights until Sep 19th. I managed to finish everything that required access to adisc.org until about 2 hours before it shutdown, but I needed some time to set everything up, so it went online 3 days later.

3 Likes

The forum threads are still missing all of their images, right? None of my old threads have images, even though I know they did on ADISC proper (I checked before it got shut down).

So except for the stories category, most threads and posts should have their images. The images from the stories category got lost, because I didn’t had access to the stories forum on adisc.org when I saved the data.
The stories were collected from a user on DailyDiapers and I added them yesterday, but his backup didn’t include images.

If you’re talking about posts outside of stories, I can look into it if you remember which specific post is missing an image.

I don’t really care about saving these particular images, but I know that all of my old diaper reviews from like 2015-ish no longer have their photos. I don’t really mind, though; they’re pretty antiquated, and better reviewing methods exist nowadays.

I’ll investigate, because it also might affect other posts. The only possibility I can think of is, that there were a bunch of images, that were on a subdomain of adisc.org, that was offline when I downloaded the images. But if you say they were in there before shutdown it might be a different problem.

Actually, I’m just realizing that I hadn’t checked those in a while. Might have been an ADISC.org issue, then. I know the .onion archive shows that there were images there, though, but they’re still working on restoring all the images to their proper threads.

Hi,

Is there any way to contact you (admin) privately? I have emailed 2 adisc.cc accounts and been ghosted for weeks now.

Thanks.