Get the URLs of all 4chan.org textboards (dis.) and scrap and archive them. Use archive.org
Name:
Anonymous2014-04-19 21:44
I have a fair amount of data that was collected last august. We can split up the boards and collect them using a well tested script. Preferrably using the json api to reduce the burden on their servers. Also, do it slowly.
I have copies of threads from newpol, newnew, lang, lounge, and sjis in html
>>2 Why are you concerned about straining joot's servers? A single popular /b/ thread probably exceeds the bandwidth of all of w4c a hundred times over.
Name:
Anonymous2014-04-20 3:25
>>4 Downloading all of /prog/ raw is 1.5 GB, so it's kind of a lot. I just want to be polite.
Name:
Anonymous2014-04-20 7:26
/lounge/ has been archived. Scraping went much more smoothly. There was only one instance of malformed html. /prog/ was infested with it. Either someone pwned world4ch and was inserting raw data into the database, or someone with access to it was messing around. There are poster dates from 1969. Stuff like that is hard to explain otherwise.
By the way, if you guys find any really good old threads, it might be a fun idea to replay them in /lounge/ (with proper introductions, so that people don't get confused). These are like scripts of old plays, and it would be nice to put on a performance once in a while.
>>27 From what I've gathered, an imageredditor got obsessed with some random slut and hacked 4chan for that reason. That doesn't seem like something a true /frog/anus would do.
>>27 If I had access to 4chan I would have done something more interesting than brag and leak ip addresses. Still, it was fun to see moot likes cocks from the account he uses to admin post on 4chan attention whore from a mass of moot worshippers. Inserting obfuscated code to world4ch to continue working but appear frozen from ips used by staff would have been nice. But again, this just continues reliance on a site operated by people who have no respect for us.
>>31 The one who gained access (or one pretending to be him) has stated that his access was limited to a few actions, and that he did as much as he figured out how to do. He didn't even have access to board-creation.
Name:
Anonymous2014-04-26 0:22
>>32 He should have been more patient. If you get stuck you don't just stop and blow everything. You wait until you observe more helpful information.
Name:
Anonymous2014-05-04 21:20
all of world4ch has been uploaded to archive.org here's a comprehensive list of all uploaded boards:
I might write a script for automatically downloading all of these, extracting them, and merging the databases together for a single world4ch db. It made me sad to see a lot of the other boards still had recent activity in them.
>>35 No problem. I'm glad it's over. It took a long time. I stopped posting updates after the subject.txt page went down temporarily. I was afraid a fagshit working for moot for free might have found this thread and was fucking with us, so I thought it was better to provide the illusion that the project was abandoned until it was completed. Now if world4ch is taken down in spite we still have the data, so it doesn't matter. We (someone) can even host a replacement and enable posting. Are there any cool free web hosts that allow a db that's circa 2-3 GB?
Name:
Anonymous2014-05-04 23:05
>>36 Sadly, Heliohost doesn't, and any paid SOLUTION will probably cost you more than $10/mo.
Is anyone willing to continue world3ch on their VPS or seedbox? Keep in mind you're hosting fresh spammer bait.
>>37 heliohost could work for each board with multiple fake accounts. Though I would feel guilty for exploiting them like that.
Name:
Anonymous2014-05-04 23:09
>>38 It's pretty hard to create accounts on Heliohost. They seem to have previous attempt at this and limited the maximum number of new accounts per day to 2x10-5.
>>39 Just create a new account right after midnight pacific time a couple days in a row.
Name:
Anonymous2014-05-22 2:01
I'm going to find ways to compress the /prog/ db. I'll start by compacting the tags. Substituting spoiler should give good results. After that, a representation for repeated posts will help compress the spam. If I can get it below 500MB then heliohost can host it, which is the only cool free webhost. The deadline is eventually.
Name:
Anonymous2014-05-22 22:06
>>41 The archives aren't that big, and archive.org is fine. Why do you want to compress them?
Name:
Anonymous2014-05-23 11:14
>>42 I want to host a readable writable old world4ch, but am too cheap to pay for hosting that provides more than 500MB of storage.
replacing all the spoiler tags on old /prog/ with <span class="spoiler">...</span> saves 817 MB. That's more than half the size of the uncompressed db.
compressing the markup reduced the 1.5 GB prog.db to around 390 MB. I could host the old prog on heliohost now, but I want to fit all of world4ch. Any recommendations for using data compression in a database is welcome. Right now I'm thinking of serializing each thread into a flat file and then gzipping them.
Name:
Anonymous2014-07-02 8:18
serializing all threads to flat text and DEFLATEing them by thread gave good results. All of world4ch fits in 200~ MB like this and can be randomly accessed efficiently enough. With an uncompressed caching layer for frequently accessed threads the overhead shouldn't be too bad.
>>47 here is an idea, make your own format for prog that uses a weird for of bbcode where every tag is 1 leter and when you close the tag you write [/]
Name:
Anonymous2014-07-02 13:29
>>55 If every tag is one letter, why bother keeping the square bracket syntax? The only reason the tag names need delimiters is so the parser (and the user) can instantly tell where they end. So you might as well switch to \b \i \o \u or something.
>>55-57 I tried substituting tags with shorter representations in >>53. <sub> became something like <s and </sub> could have become <S. I didn't want to do an encoding that dependended on balanced tags because of the malformed html. The scheme gracefully handled malformed tags. Decoding was easy. The parser seeked to the next < and used the look ahead to determine the substitution. There were savings but they didn't compare to >>54. I left the spoiler tags in their original form and the spoiler spam is so low in entropy it isn't a problem.
When submitting a new thread in shiichan, the thread id is part of the post request. So the thread id is the timestamp of when you loaded the page to submit the thread, not when the thread is submitted. And if you generate the post body yourself, you can put in whatever thread id you want. This explains the threads, -2147483648, 1, 3, 4, 1337, 7357, and 2147483648.
Name:
Anonymous2014-07-13 6:20
Posting works now. The post creation page is still a debug page, so after posting, just hit back and refresh. BBCode doesn't work yet. The heliohost server w5ch is on has been down all day, so I've created a hidden service for the site.
The >'s become <span class="quote">, the newlines after the quote become </span>, the [o ] becomes <span class="o">, and [/o ] becomes </span>. The result is
>>81 I'm so fucking confused. What does [br] do? Just inline another <br/>? I wish I had a list of bbcode inputs and html outputs to refer to. This project became a nightmare as soon as I entered the actual frontend of shiichan. Oh and I've implemented the over 1000 thread thing. If you look at the source of the threads with over 1000 threads, the post form is still there and you can unhide it by editing the html with firebug. I've seen over 1000 necros from as late as 2013, so there was someway to bypass the limit in shiichan. My current implementation hides the post form after 1000 posts and actually stops accepting posts at 1111 posts so you can get quints with a script if you want to. Maybe I'll introduce some randomness to get the same mysteriousness as shiichan.
it used to insert line BReak. That was really retarded.
Name:
Anonymous2014-07-15 20:42
>>82 I don't follow the thread, I just look at the latest replies, so I hope I'm not off-topic here. I think you are discussing old /prog/'s bbcode, right? In which case, to multiquote, the technique used was this: > foo[br]bar
which would appear as
foo bar
By the way, to escape bbcode, there's [#][/#] (which I just used twice to show to you once).
I need help /prog/. how can I write a regular expression in python that will find all matches for N regular expressions and perform a customized replacement for each given expression? I can't iterate the substitutions one after the other because the replaced strings will be matched.
tldr; I need to lex in python. How can I using the standard library?
Name:
Anonymous2014-07-17 22:46
>>91 why do python programmeres think everything about the world changes when using python?
I can never tell if the posts about "<exteremely common thing> in python" are trying to appeal to people who think anything written in python is somehow better or if they're just proud of themselves for getting anything done in that shitheap of a language.
tl;dr - lexing with regex = two problems, like everywhere else (except perl 6) - stop being a baby and lex how you would lex in any other language
Name:
Anonymous2014-07-17 23:45
[math] was a tag that just put the text in a <div>. If jsMath hadn't been totally broken because of an undefined variable (if I recall correctly, it loaded before the page body, thus had nothing to work with) it would have done interesting things. It worked fine in my unreleased experimental SuperW4ch userscript, where I loaded the imagereddit jsmath script instead.
Name:
Anonymous2014-07-18 0:23
>>92 If I was writing in c I would lex using a statemachine that was machine generated with lex or flex. But I'm using python instead of c. I could write the statemachine myself in python but it would be slow as fuck to do that kind of processing at the script level. The only thing I can think of would be to use multiple regular expressions and perform multiple passes and craft the expressions so they somehow don't substitute within each other, but it's difficult and opens doors for html injection if things go wrong. It's frustrating to have a library that can do regex but can't do this conveniently.
>>96 I took the easy route and decided to only detect hyperlinks beginning with a protocol. The range expressions are functioning as closely as I could emulate. I'm going to take a break from development for a while. If you have any changes you'd like to put in post a diff here or somewhere and I'll merge it in when I see it. You are also welcome to pull the code off the site and host your own. I believe the site is feature complete except for the subject page and json feed. I have no plans to implement a moderation interface and no ip addresses or identities are recorded in the database. Ip addresses still appear in logs though, so be aware of that. I'm not looking but someone else might.
fyi heliohost took away the sqlite3 module so w5ch.heliohost.org is broken until that dependency is removed. The admin user name is ``w5ch''. You can type that in to reactivate the account if it has been suspended from me not logging in once a month and looking at heliohost's advertisements in the admin interface.
There was a thread on /vip/ I was going to bump today. OP said he would check back in 3 years to see if his thread was on top. Now he's just going to see that the board is gone ;_;