Page 4 of 5

Re: Posts from the old board

Posted: Mon Mar 16, 2020 4:23 pm
by zompist
Brad has the data, and apparently can read it fine!

Re: Posts from the old board

Posted: Mon Mar 16, 2020 5:16 pm
by bradrn
zompist wrote: Mon Mar 16, 2020 4:23 pm Brad has the data, and apparently can read it fine!
I can confirm this: I’ve loaded the data into my MySQL installation (albeit with a slightly different method than Ser recommended) and connected it to a local phpBB installation, and it seems to be working fine. zompist told me in a PM that it’s fine to run those queries we agreed on and share the results, so I’ll do that when I get some time.

(One thing I’m not entirely sure about is how I’ll make the resulting data available — I don’t have a website or anything I might use to host it. I do have GitHub though, so I might post the data there, but this doesn’t feel ideal; does anyone have any better ideas?)

Re: Posts from the old board

Posted: Mon Mar 16, 2020 5:46 pm
by zompist
Once we have a solution I can host it.

I applaud you guys for getting this far, but I don't think we're done. Most people won't be able to use a raw database. I think people will need some sort of filtering:

* a list of forums
* for a forum: list of threads
* all the posts for a thread

Since the data will be static, I don't think you need phpbb for this. Each page is basically a representation of a single SQL query, and links can include the query for the pages below it. (I.e. in the list of forums, for each forum you grab the name, and make a link that includes the forum ID as a parameter.)

If you can just chug out the above data, it can be prettified later-- e.g. paginating the results or interpreting the BBCode. But in an era when merely loading the front page of CNN downloads megabytes of data, I don't think presenting hundreds of text posts on one page will be terrible.

(Also, I'd be happy to work on the prettifying bit once the database wrangling is in place.)

Re: Posts from the old board

Posted: Mon Mar 16, 2020 6:28 pm
by bradrn
zompist wrote: Mon Mar 16, 2020 5:46 pm Once we have a solution I can host it.

I applaud you guys for getting this far, but I don't think we're done. Most people won't be able to use a raw database. I think people will need some sort of filtering:

* a list of forums
* for a forum: list of threads
* all the posts for a thread

Since the data will be static, I don't think you need phpbb for this. Each page is basically a representation of a single SQL query, and links can include the query for the pages below it. (I.e. in the list of forums, for each forum you grab the name, and make a link that includes the forum ID as a parameter.)

If you can just chug out the above data, it can be prettified later-- e.g. paginating the results or interpreting the BBCode. But in an era when merely loading the front page of CNN downloads megabytes of data, I don't think presenting hundreds of text posts on one page will be terrible.

(Also, I'd be happy to work on the prettifying bit once the database wrangling is in place.)
Totally agree with this. In fact, the raw database is pretty unusable even for me, since all the BBCode is stored in a pretty weird format which will need to be decoded (e.g. [b][u]Table of Contents[/u][/b] gets stored as [b:3v569udf][u:3v569udf]Table of Contents[/u:3v569udf][/b:3v569udf]). I’m thinking that we could convert the database to a set of HTML files (possibly one per topic, plus one per forum and one for the top-level), which would give a read-only static website. My university has just brought the mid-semester break forward (they’ve announced it literally as I was writing this) due to the ongoing pandemic situation, so I’ll have plenty of time to help (although there are probably people who are much better at HTML and CSS than I am).

Re: Posts from the old board

Posted: Mon Mar 16, 2020 6:35 pm
by zompist
bradrn wrote: Mon Mar 16, 2020 6:28 pm all the BBCode is stored in a pretty weird format which will need to be decoded (e.g. [b][u]Table of Contents[/u][/b] gets stored as [b:3v569udf][u:3v569udf]Table of Contents[/u:3v569udf][/b:3v569udf]).
I hope no one has used 3v569udf as a word in their conlang... :?

Re: Posts from the old board

Posted: Mon Mar 16, 2020 6:49 pm
by bradrn
zompist wrote: Mon Mar 16, 2020 6:35 pm
bradrn wrote: Mon Mar 16, 2020 6:28 pm all the BBCode is stored in a pretty weird format which will need to be decoded (e.g. [b][u]Table of Contents[/u][/b] gets stored as [b:3v569udf][u:3v569udf]Table of Contents[/u:3v569udf][/b:3v569udf]).
I hope no one has used 3v569udf as a word in their conlang... :?
It looks like this encoding changes depending on the post. But however this works, I’m sure that phpBB encodes it in a way such that it doesn’t get in the way of posts. (Proof: the post you quoted includes those strings, and it’s working fine.)

Re: Posts from the old board

Posted: Mon Mar 16, 2020 7:07 pm
by bradrn
Using Ser’s instructions’s, I’ve successfully managed to make a stripped database dump of the old board. Unfortunately it’s too large for GitHub to handle (~136 MiB), so I’m not entirely sure how to make it available… does anyone have any ideas?

Re: Posts from the old board

Posted: Mon Mar 16, 2020 8:18 pm
by Richard W
Does compression (e.g. by plain old zip for maximum portability) help?

Re: Posts from the old board

Posted: Mon Mar 16, 2020 8:36 pm
by Kuchigakatai
bradrn wrote: Mon Mar 16, 2020 6:49 pm
zompist wrote: Mon Mar 16, 2020 6:35 pm
bradrn wrote: Mon Mar 16, 2020 6:28 pmall the BBCode is stored in a pretty weird format which will need to be decoded (e.g. [b][u]Table of Contents[/u][/b] gets stored as [b:3v569udf][u:3v569udf]Table of Contents[/u:3v569udf][/b:3v569udf])
I hope no one has used 3v569udf as a word in their conlang... :?
It looks like this encoding changes depending on the post. But however this works, I’m sure that phpBB encodes it in a way such that it doesn’t get in the way of posts. (Proof: the post you quoted includes those strings, and it’s working fine.)
Actually, I just read that page you linked to about parsing phpBB text with the phpBB3 library, and it does imply that you need some other information from the database about bbcodes, possibly the phpbb_bbcodes table (and perhaps other things as well). You might need to add that table to the second stripped database. The bbcode_bitfield and bbcode_uid fields in phpbb_posts are actually optimizations to make printing posts into HTML faster at the expense of disk space. I think that if I try to use the library on your sample data, maybe I'll get an error that will tell me what is needed. Let me try...

EDIT: Also, I just noticed that phpbb databases have a table called phpbb_poll_options and another one called phpbb_poll_votes. I guess this means our stripped database, as it currently is, may have lost all poll options and votes. Normally I wouldn't care, but I wonder if the post parsing function of the library might complain about the absence of these two tables once it hits a poll...
bradrn wrote: Mon Mar 16, 2020 7:07 pmUsing Ser’s instructions’s, I’ve successfully managed to make a stripped database dump of the old board. Unfortunately it’s too large for GitHub to handle (~136 MiB), so I’m not entirely sure how to make it available… does anyone have any ideas?
Ah, 136MiB! Was the original dumpfile you got from zompist 350MiB or 850MiB? Good to hear my instructions actually worked with just a slight adaptation regardless.

I could put up a "secret" webpage on a website of mine for you to upload it, and then I'll just put a download link from there. Then zompist could get the file from my website and host it himself on zompist.com or verduria.org. I'll send you a PM in some minutes.

Re: Posts from the old board

Posted: Mon Mar 16, 2020 9:24 pm
by zompist
There are very few polls in the data, since it wasn't a generally accessible function.

(IIRC the 800M figure is what the admin page reported, and included attachments, which aren't in the dump file.)

Re: Posts from the old board

Posted: Mon Mar 16, 2020 9:38 pm
by bradrn
Richard W wrote: Mon Mar 16, 2020 8:18 pm Does compression (e.g. by plain old zip for maximum portability) help?
I can’t believe I didn’t think of that! Maybe it’s because I’m not used to committing zip files to git… But I’ll try that when I get a chance.
Ser wrote: Mon Mar 16, 2020 8:36 pm
bradrn wrote: Mon Mar 16, 2020 6:49 pm
zompist wrote: Mon Mar 16, 2020 6:35 pmI hope no one has used 3v569udf as a word in their conlang... :?
It looks like this encoding changes depending on the post. But however this works, I’m sure that phpBB encodes it in a way such that it doesn’t get in the way of posts. (Proof: the post you quoted includes those strings, and it’s working fine.)
Actually, I just read that page you linked to about parsing phpBB text with the phpBB3 library, and it does imply that you need some other information from the database about bbcodes, possibly the phpbb_bbcodes table (and perhaps other things as well). You might need to add that table to the second stripped database. The bbcode_bitfield and bbcode_uid fields in phpbb_posts are actually optimizations to make printing posts into HTML faster at the expense of disk space. I think that if I try to use the library on your sample data, maybe I'll get an error that will tell me what is needed. Let me try...
That’s what I said earlier! I had assumed that your queries included these columns based on what I had said about that. I’ll have to remake the stripped database then — hopefully it won’t take too long.

EDIT: Sorry, it looks like I misinterpreted you! It looks like you have indeed included the relevant columns, but there’s an extra table which needs to be included as well. In that case, I won’t need to remake the stripped database — I simply need to add that table to the dumpfile. I’ll do that now.
bradrn wrote: Mon Mar 16, 2020 7:07 pmUsing Ser’s instructions’s, I’ve successfully managed to make a stripped database dump of the old board. Unfortunately it’s too large for GitHub to handle (~136 MiB), so I’m not entirely sure how to make it available… does anyone have any ideas?
Ah, 136MiB! Was the original dumpfile you got from zompist 350MiB or 850MiB? Good to hear my instructions actually worked with just a slight adaptation regardless.
The original dumpfile was 339 MiB. So the stripping reduced the file size to 40% of its original size.
I could put up a "secret" webpage on a website of mine for you to upload it, and then I'll just put a download link from there. Then zompist could get the file from my website and host it himself on zompist.com or verduria.org. I'll send you a PM in some minutes.
I’m a bit confused by this. Which ‘it’ do you want me to upload?

Re: Posts from the old board

Posted: Tue Mar 17, 2020 1:22 am
by Kuchigakatai
bradrn wrote: Mon Mar 16, 2020 9:38 pm
I could put up a "secret" webpage on a website of mine for you to upload it, and then I'll just put a download link from there. Then zompist could get the file from my website and host it himself on zompist.com or verduria.org. I'll send you a PM in some minutes.
I’m a bit confused by this. Which ‘it’ do you want me to upload?
The script that mysqldump would dump from the reduced public database (after the last bit of the steps I posted).


also, oh gods of humanity why is server config so confusing I've been struggling with a damn upload page for two hours now this is just surreeeeal

Re: Posts from the old board

Posted: Tue Mar 17, 2020 2:05 am
by bradrn
Ser wrote: Tue Mar 17, 2020 1:22 am
bradrn wrote: Mon Mar 16, 2020 9:38 pm
I could put up a "secret" webpage on a website of mine for you to upload it, and then I'll just put a download link from there. Then zompist could get the file from my website and host it himself on zompist.com or verduria.org. I'll send you a PM in some minutes.
I’m a bit confused by this. Which ‘it’ do you want me to upload?
The script that mysqldump would dump from the reduced public database (after the last bit of the steps I posted).


also, oh gods of humanity why is server config so confusing I've been struggling with a damn upload page for two hours now this is just surreeeeal
Thanks for trying to do that for me! But you don’t need to: I’ve just released it in a GitHub repo. (If I’d known you were struggling so much I would have done this sooner, but I just saw your post now…)

EDIT: Actually, I’ve discovered that my Dropbox account has enough space, so I’ve stored it there instead: https://www.dropbox.com/s/jyvi030td7dje ... p.sql?dl=0. Dropbox is probably a more appropriate place for it than GitHub, given that this sort of file can’t really be version-controlled.

Re: Posts from the old board

Posted: Tue Mar 17, 2020 7:57 am
by bradrn
I managed to parse a post to HTML! I’m actually really excited about this, given that I had become so frustrated by this that I was becoming convinced it was impossible. Thankfully, I managed to find this post by ToonArmy, which finally gave me the correct incantations to persuade phpBB to give up its secrets. (This may sound melodramatic, but it actually feels pretty appropriate: the amount of errors PHP and phpBB can produce is truly astounding!) Anyway, here’s the script, slightly edited and annotated:

Code: Select all

<?php

// mysterious definition to convince phpBB to load the files
define('IN_PHPBB', true);

// include necessary phpBB files
$phpbb_root_path = './';
$phpEx = 'php';
include('./common.php');
include('./includes/bbcode.php');

// start session for user, so phpBB will let us do stuff
$user->session_begin();
$auth->acl($user->data);
$user->setup();

/* query database - phpBB uses its own custom database class (so it can be polymorphic
   over the database), so don't assume you can just use documentation for MySQLi or
   anything like that! */
$post_id = 955526;
$sql = 'SELECT post_text, bbcode_uid, bbcode_bitfield, enable_bbcode, enable_smilies, enable_magic_url FROM phpbb_posts WHERE post_id = ' . $post_id;
$result = $db->sql_query_limit($sql, 1);
$row = $db->sql_fetchrow($result);
$db->sql_freeresult($result);

// Parse the BBCode, finally
$row['bbcode_options'] = (($row['enable_bbcode']) ? OPTION_FLAG_BBCODE : 0) +
    (($row['enable_smilies']) ? OPTION_FLAG_SMILIES : 0) + 
    (($row['enable_magic_url']) ? OPTION_FLAG_LINKS : 0);
$text = generate_text_for_display($row['post_text'], $row['bbcode_uid'], $row['bbcode_bitfield'], $row['bbcode_options']);

// echo the HTML result to stdout, or to the webpage if running as a webpage
echo $text;
Assuming you’ve saved this as parse-bbcode.php, this script should output the HTML corresponding to post number 955526 to stdout if you run it in the directory of your local phpBB installation (you do need to install phpBB locally in order to use its BBCode parsing facilities). Another nice way to play around with this is to use PHP’s integrated web server by running php -S localhost:8000, and then navigate to http://localhost:8000/parse-bbcode.php to see the rendered output of post 955526. (Note: post 955526 is the first post of the Polysynthesis for Novices thread; I chose this as it’s a reasonably well-known post with a nice amount of BBCode. But you can replace this with any post ID as long as it exists; good ones are 1059218 for more BBCode, 893313 for an image, 1142643 for lots of quotes)

I’m thinking that now we could run this script over every post to get one HTML file per post (although we’d need to add a couple more things, like the poster’s username), then concatenate the posts to get one HTML file per topic, and then create one more file for each forum. If we then add bit of CSS (required to make stuff like quotes display nicely), I think that would create a reasonably nice archive. What does everyone think about this plan?

EDIT: Oops, just realised that the script I gave only works for the unstripped database, which needs to be loaded into phpBB in order for the $db calls to work. It’s getting very late here now, so I’ll have another look at getting it working with the publicly available stripped version tomorrow. But it doesn’t look too difficult to get it working: simply replace the query using $db with the equivalent query using MySQLi. (Although admittedly, every time so far that I’ve said something ‘doesn’t look too difficult’, that’s exactly what it’s turned out to be… Hopefully this time will go better!)

Re: Posts from the old board

Posted: Tue Mar 17, 2020 8:12 am
by Richard W
Huge topics will need to be split into multiple pages.

Re: Posts from the old board

Posted: Tue Mar 17, 2020 10:04 am
by Pabappa
I appreciate all of the hard work you all have done. I had offered to help at the beginning but it looks like I would have contributed nothing and may even just gotten in the way. I am glad this project is going forward and near completion.

Ideally we could split, yes, but I wonder if there's a technical problem that would come about if links to posts are hardcoded to have a certain page. Are links to posts going to work at all? Even if not, it's still a tremendous help to us to have those threads up and accessible.

If we cant split long threads for whatever reason, perhaps the bare bones UI is slim enough that the rare 70-page-long threads will load slowly but still much faster than they would on a normal phpBB board. I have a page on my website that's 80,000 words long, and it loads instantaneously even on my relatively old mobile phone, and even when on a slow connection. As zomp says, there really is an incredible amount of bloat in modern web design .... my 80000 word page is only 787K, because it's entirely text. Yet a twitter post with just a few hundred words almost always occupies more than 1MB when viewed on the PC Twitter UI. (I dont have a way of checking for phones.) And that 787K page is my dream diary, which I manually copied from the dream thread before the blank-posts problem appeared. I estimate one of every five posts in the thread was from me, so if that's accurate, even the longest of all threads on the old ZBB will not be much more than 4 MB to display all at once.

Re: Posts from the old board

Posted: Tue Mar 17, 2020 1:17 pm
by zompist
bradrn wrote: Tue Mar 17, 2020 7:57 am I’m thinking that now we could run this script over every post to get one HTML file per post (although we’d need to add a couple more things, like the poster’s username), then concatenate the posts to get one HTML file per topic, and then create one more file for each forum. If we then add bit of CSS (required to make stuff like quotes display nicely), I think that would create a reasonably nice archive. What does everyone think about this plan?
That would create over 200,000 html files!

You should be able to write a single .php page that runs the query for a given post id, runs your parsing script, and displays it.

Re: Posts from the old board

Posted: Tue Mar 17, 2020 5:16 pm
by Nerulent
Wouldn't it just be easier at this point to dump the data into a new version of phpBB, rather than trying to re-engineer everything? Even just into this one as a read-only forum (assuming you can do that)? Or is there some major structural changes to phpBB that make this unfeasible?

Re: Posts from the old board

Posted: Tue Mar 17, 2020 5:55 pm
by zompist
It can be done-- if I understand right, that's exactly what Brad has on his local machine.

Whether that's easier or not is another question! Getting phpbb installations right is, in my experience, easy except when it's not.

I think it's up to Brad what's easiest to do. If he wants to use phpbb he can just upload what he has to incatena.org, replacing the old board, and making it read-only. (Recall that part of the work was to remove all the login info, so it can't really be used as a live board.)

Re: Posts from the old board

Posted: Tue Mar 17, 2020 5:57 pm
by bradrn
Richard W wrote: Tue Mar 17, 2020 8:12 am Huge topics will need to be split into multiple pages.
Yes, I think I agree with this, actually. In fact, I don’t think pagination would be too tricky.
Pabappa wrote: Tue Mar 17, 2020 10:04 am Ideally we could split, yes, but I wonder if there's a technical problem that would come about if links to posts are hardcoded to have a certain page. Are links to posts going to work at all? Even if not, it's still a tremendous help to us to have those threads up and accessible.
We could potentially do something similar to phpBB: each page gets a link to the next and previous page, plus the first and last page. Links to posts can be achieved by giving each post a header with an ID attribute, so you can link to them like #post-number.
zompist wrote: Tue Mar 17, 2020 1:17 pm
bradrn wrote: Tue Mar 17, 2020 7:57 am I’m thinking that now we could run this script over every post to get one HTML file per post (although we’d need to add a couple more things, like the poster’s username), then concatenate the posts to get one HTML file per topic, and then create one more file for each forum. If we then add bit of CSS (required to make stuff like quotes display nicely), I think that would create a reasonably nice archive. What does everyone think about this plan?
That would create over 200,000 html files!

You should be able to write a single .php page that runs the query for a given post id, runs your parsing script, and displays it.
Actually, this is a good point. I had just sort of assumed that there would be a manageable number.

On the other hand, I thought that we were doing this because we wanted to avoid PHP. So if there is a way to run PHP, then why don’t we just make the board read-only and re-host it?
zompist wrote: Tue Mar 17, 2020 5:55 pm It can be done-- if I understand right, that's exactly what Brad has on his local machine.

Whether that's easier or not is another question! Getting phpbb installations right is, in my experience, easy except when it's not.

I think it's up to Brad what's easiest to do. If he wants to use phpbb he can just upload what he has to incatena.org, replacing the old board, and making it read-only. (Recall that part of the work was to remove all the login info, so it can't really be used as a live board.)
Wouldn’t this mean that I would have to upgrade the phpBB version in order to get it working?