Get the URLs of all 4chan.org textboards (dis.) and scrap and archive them. Use archive.org
Name:
Anonymous2014-05-22 2:01
I'm going to find ways to compress the /prog/ db. I'll start by compacting the tags. Substituting spoiler should give good results. After that, a representation for repeated posts will help compress the spam. If I can get it below 500MB then heliohost can host it, which is the only cool free webhost. The deadline is eventually.
Name:
Anonymous2014-05-22 22:06
>>41 The archives aren't that big, and archive.org is fine. Why do you want to compress them?
Name:
Anonymous2014-05-23 11:14
>>42 I want to host a readable writable old world4ch, but am too cheap to pay for hosting that provides more than 500MB of storage.
replacing all the spoiler tags on old /prog/ with <span class="spoiler">...</span> saves 817 MB. That's more than half the size of the uncompressed db.
compressing the markup reduced the 1.5 GB prog.db to around 390 MB. I could host the old prog on heliohost now, but I want to fit all of world4ch. Any recommendations for using data compression in a database is welcome. Right now I'm thinking of serializing each thread into a flat file and then gzipping them.
Name:
Anonymous2014-07-02 8:18
serializing all threads to flat text and DEFLATEing them by thread gave good results. All of world4ch fits in 200~ MB like this and can be randomly accessed efficiently enough. With an uncompressed caching layer for frequently accessed threads the overhead shouldn't be too bad.
>>47 here is an idea, make your own format for prog that uses a weird for of bbcode where every tag is 1 leter and when you close the tag you write [/]
Name:
Anonymous2014-07-02 13:29
>>55 If every tag is one letter, why bother keeping the square bracket syntax? The only reason the tag names need delimiters is so the parser (and the user) can instantly tell where they end. So you might as well switch to \b \i \o \u or something.
>>55-57 I tried substituting tags with shorter representations in >>53. <sub> became something like <s and </sub> could have become <S. I didn't want to do an encoding that dependended on balanced tags because of the malformed html. The scheme gracefully handled malformed tags. Decoding was easy. The parser seeked to the next < and used the look ahead to determine the substitution. There were savings but they didn't compare to >>54. I left the spoiler tags in their original form and the spoiler spam is so low in entropy it isn't a problem.
When submitting a new thread in shiichan, the thread id is part of the post request. So the thread id is the timestamp of when you loaded the page to submit the thread, not when the thread is submitted. And if you generate the post body yourself, you can put in whatever thread id you want. This explains the threads, -2147483648, 1, 3, 4, 1337, 7357, and 2147483648.
Name:
Anonymous2014-07-13 6:20
Posting works now. The post creation page is still a debug page, so after posting, just hit back and refresh. BBCode doesn't work yet. The heliohost server w5ch is on has been down all day, so I've created a hidden service for the site.
The >'s become <span class="quote">, the newlines after the quote become </span>, the [o ] becomes <span class="o">, and [/o ] becomes </span>. The result is