Page 2 of 5

Re: Posts from the old board

Posted: Fri Jan 24, 2020 3:38 pm
by Nortaneous
bradrn wrote: Thu Jan 23, 2020 9:09 pm
Pabappa wrote: Thu Jan 23, 2020 9:06 pm I wouldnt think the passwords would be recoverable, since if they were, people who host phpBB boards could use them to hack their own users' accounts on other sites. I havent looked into it, but I do remember one phpBB board admin who hacked the administrator of a rival phpBB board, and to do it he had to set up a false registration process where the password was sent unencrypted. (This worked because the rival board owner just so happened to use the same password for both boards.) I wouldnt think it would be necessary for the perpetrator to do this if it were possible to just unhash the passwords in the database. But again, I havent looked into this.
Even if the passwords are unrecoverable, I still think that it would be better to obliterate them, just in case there is some method to recover them.
yes, I don't know how phpBB stores its passwords but if they're unsalted you can use rainbow tables. wow that sounds fake

Re: Posts from the old board

Posted: Sun Jan 26, 2020 8:09 pm
by Travis B.
Nortaneous wrote: Fri Jan 24, 2020 3:38 pm
bradrn wrote: Thu Jan 23, 2020 9:09 pm
Pabappa wrote: Thu Jan 23, 2020 9:06 pm I wouldnt think the passwords would be recoverable, since if they were, people who host phpBB boards could use them to hack their own users' accounts on other sites. I havent looked into it, but I do remember one phpBB board admin who hacked the administrator of a rival phpBB board, and to do it he had to set up a false registration process where the password was sent unencrypted. (This worked because the rival board owner just so happened to use the same password for both boards.) I wouldnt think it would be necessary for the perpetrator to do this if it were possible to just unhash the passwords in the database. But again, I havent looked into this.
Even if the passwords are unrecoverable, I still think that it would be better to obliterate them, just in case there is some method to recover them.
yes, I don't know how phpBB stores its passwords but if they're unsalted you can use rainbow tables. wow that sounds fake
If they're MD5-hashed you can directly break them with ease, that's how broken MD5 is.

Re: Posts from the old board

Posted: Wed Feb 05, 2020 10:11 pm
by Pabappa
only 602 bytes got me a search function in PHP that Im sure would work as well as anyone would need it to ... it might over-report results, but that's better than the opposite. Unless MySQL is encrypted/compressed and therefore low-level search functions like grep wont work?

or maybe speed is the problem ... as you said, searchin a 300 MB database might take a thousand times as long as searching 300K worth of short text files, and even if the traffic to the site was fairly small it could use up a lot of processor time.

Re: Posts from the old board

Posted: Wed Feb 05, 2020 11:12 pm
by Nortaneous
i do not think it is wise to try to search a sql database with grep

Re: Posts from the old board

Posted: Wed Feb 05, 2020 11:28 pm
by bradrn
How does one search a SQL database then?

Re: Posts from the old board

Posted: Thu Feb 06, 2020 12:11 am
by Nortaneous
with Structured Query Language

Re: Posts from the old board

Posted: Thu Feb 06, 2020 1:17 am
by bradrn
Nortaneous wrote: Thu Feb 06, 2020 12:11 am with Structured Query Language
Good point. I’m horribly unfamiliar with SQL, so I wasn’t sure whether it was possible to use it to do a text search across multiple records.

(Of course, I don’t actually know if phpBB stores posts in a database. It may well turn out that it stores it in a form which is amenable to a simple text search! But I presume that others here know more than me about this topic.)

Re: Posts from the old board

Posted: Thu Feb 06, 2020 7:38 am
by Nortaneous
bradrn wrote: Thu Feb 06, 2020 1:17 am
Nortaneous wrote: Thu Feb 06, 2020 12:11 am with Structured Query Language
Good point. I’m horribly unfamiliar with SQL, so I wasn’t sure whether it was possible to use it to do a text search across multiple records.

(Of course, I don’t actually know if phpBB stores posts in a database. It may well turn out that it stores it in a form which is amenable to a simple text search! But I presume that others here know more than me about this topic.)
phpBB uses a SQL database, yes. Some SQL databases provide the tools for relatively simple optimized text search -- see here. For others, you'd have to either use LIKE '%foo%' or build the term vectors yourself.

Re: Posts from the old board

Posted: Thu Feb 06, 2020 9:13 am
by Kuchigakatai
I would like to write this post with more detail, but in the little time I have, I'll say:

1. I'm in favour of publishing the database, with personal information pruned. Then multiple people could try to make working frontends.
2. By personal information I mean emails, passwords, and also very importantly, private messages.
3. Passwords from public database hashes are generally recoverable. A good algorithm (e.g. Bcrypt, which is what I think phpBB uses) with good salting (long salts, and a different salt for each password) simply make the recovery a lot slower, which is great because it buys you time. I don't know what configuration phpBB 3.0 uses, but a couple years ago someone who I trust told me a password protected with PHP's classic default Bcrypt setup (good salting with a cost parameter of 10) may take three good Amazon WebServices servers around six months or so to definitely decrypt one password (normally very impractical for a hacker who wants to attack many users, but practical if they're aiming at one specific person; and governments of course can do a lot better).
4. I was once fairly familiar with phpBB's database structure because of a custom modification I once made. I'm happy to say it's not that hard to figure out.
5. I wouldn't call phpBB's search data in the database laughable, precisely because the data provides wonderful savings in processing & time when searching text. It is useful data to keep co$$$ts down while providing fast searches, in ways that MySQL's or Postgres's search functions are not. The data keeps costs down because these days processing is more expensive than storage.

Re: Posts from the old board

Posted: Thu Feb 06, 2020 11:42 am
by Nortaneous
If it gets published, better to pick tables to publish (and leave out everything else by default) than to pick tables *not* to publish -- if the table doesn't absolutely *need* to be published, it shouldn't be.
Ser wrote: Thu Feb 06, 2020 9:13 am 5. I wouldn't call phpBB's search data in the database laughable, precisely because the data provides wonderful savings in processing & time when searching text. It is useful data to keep co$$$ts down while providing fast searches, in ways that MySQL's or Postgres's search functions are not. The data keeps costs down because these days processing is more expensive than storage.
Is Postgres's full-text search that bad?

Re: Posts from the old board

Posted: Thu Feb 06, 2020 2:57 pm
by Kuchigakatai
Nortaneous wrote: Thu Feb 06, 2020 11:42 amIf it gets published, better to pick tables to publish (and leave out everything else by default) than to pick tables *not* to publish -- if the table doesn't absolutely *need* to be published, it shouldn't be.
Yes, I think you're right about that. In fact, I just remembered that phpBB saves the last ~10 distinct IP addresses or so from which a user has logged in, and that's personal information that should not be published. The tables with users' settings are also very much unnecessary (most people never even change the default settings anyway).
Is Postgres's full-text search that bad?
Postgres's full-text search is a very decent thing, but it's weaker when the data is in multiple languages (because word normalization undergoes lots of false positives, unless carefully configured and applied on well-marked content, which is usually not the case on a phpBB public forum) or contains a lot of non-standard language (because that doesn't get normalized). There is always the option of doing almost no normalization at all (other than turning uppercase to lowercase and the like), but then you're trying to find posts with practically the same procedure that phpBB uses.

On the other hand, Postgres has that ranking thing out of the box which gives you the most important results depending on how many matches a post has, while phpBB just orders everything by the timestamp.

The extensive storage of word locations (for all attested words except the most common ones), which zompist finds a bit laughable, is referred to as a Generalized Inverted Index in Postgres's documentation, and is recommended for often-consulted columns. So both phpBB and Postgress use that, although it wouldn't surprise me if Postgres has a better implementation with less storage size with only a slight penalty in processing.

Re: Posts from the old board

Posted: Thu Feb 06, 2020 3:06 pm
by zompist
I mostly said it was laughable because it doesn't work well! That was a longstanding complaint about the old board, in fact. To actually find anything you mostly had to use Google...

Re: Posts from the old board

Posted: Thu Feb 06, 2020 4:51 pm
by Kuchigakatai
zompist wrote: Thu Feb 06, 2020 3:06 pm I mostly said it was laughable because it doesn't work well! That was a longstanding complaint about the old board, in fact. To actually find anything you mostly had to use Google...
Oh, yes, I just realized phpBB must decay in long-running forums, because pruning threads doesn't update the search index in the database. lol

So... those 200000 posts throughout the decade and a half all counted to mark words as "too common in the forum". That's terrible. Another thing: it doesn't help much when querying phrases anyway.

Alright guys, I was wrong, and that data is not very useful in a long-running forum...

Re: Posts from the old board

Posted: Fri Feb 28, 2020 10:44 pm
by bradrn
Has there been any progress on this yet? It’s been three weeks since the last post.

Re: Posts from the old board

Posted: Thu Mar 05, 2020 6:03 pm
by Kuchigakatai
I am definitely very interested in recovering my own posts from the old ZBB, so I'll be spending some time looking into what can be done with a database dump or a rebuilt local copy of such a database. The task is to identify which columns of which tables have content we want to make public. As mentioned before, it is not as straightforward as it sounds, since a lot of the metadata is done through relations between different tables (such as a user ID number for the person who posted a post, and a thread ID number and a forum ID number the post is found in). An ideal proposal would document that.

If anyone else is interested in finding out how, here is a page where you can download phpBB 3.0.12 (the version that the old ZBB was using when the database dump was made):
https://download.phpbb.com/pub/release/3.0/3.0.12/

Then install phpBB locally, make extra subforums as an admin, create various users and make some posts with them. Then examine the database. Again, don't propose grabbing whole tables, as a fair bit of personal information is recorded for users in particular (email, hashed password, last few IPs they connected from, likely a number of other things).

Re: Posts from the old board

Posted: Thu Mar 05, 2020 8:05 pm
by bradrn
Ser wrote: Thu Mar 05, 2020 6:03 pm I am definitely very interested in recovering my own posts from the old ZBB, so I'll be spending some time looking into what can be done with a database dump or a rebuilt local copy of such a database. The task is to identify which columns of which tables have content we want to make public. As mentioned before, it is not as straightforward as it sounds, since a lot of the metadata is done through relations between different tables (such as a user ID number for the person who posted a post, and a thread ID number and a forum ID number the post is found in). An ideal proposal would document that.

If anyone else is interested in finding out how, here is a page where you can download phpBB 3.0.12 (the version that the old ZBB was using when the database dump was made):
https://download.phpbb.com/pub/release/3.0/3.0.12/

Then install phpBB locally, make extra subforums as an admin, create various users and make some posts with them. Then examine the database. Again, don't propose grabbing whole tables, as a fair bit of personal information is recorded for users in particular (email, hashed password, last few IPs they connected from, likely a number of other things).
That sounds like something I’d be really interested in investigating as well! I don’t have too much free time right at this moment, but I certainly think I’ll be trying this at some time in the next couple of days.

Re: Posts from the old board

Posted: Fri Mar 06, 2020 4:31 am
by bradrn
Hmm, phpBB is proving much more difficult to install than expected — I can’t get it to find MySQL. But in the process I did find this list of phpBB database tables, which could prove useful.

EDIT: I finally managed to get phpBB working! Like I expected, it was horribly painful (made worse by the fact that I’m using the same outdated version the old ZBB uses); unlike I expected, I actually managed to get it working in one night. I think that now I’ll make a small forum with a couple of subforums, users and posts, and then make a dump and post it here for further analysis by anyone interested.

Re: Posts from the old board

Posted: Fri Mar 06, 2020 7:32 am
by bradrn
Alright, so I’ve managed to set up a small forum and make a backup of it using phpMyAdmin (with the default options)! Here’s a full description of the contents of the forum (collapsed for space):
More: show
Users: There are four users: admin_user is an admin, with password passwd; moderator is a moderator, with password password; normaluser is a user with standard permissions, with password 123456789; þe Únıcoðɛ ūseṟ is a user with standard permissions and a name to test support of non-ASCII characters, with password 123456789. (There’s also Admin1, with password admin; this user was created after I forgot the admin password and followed these instructions, but I couldn’t manage to delete this user afterwards.)

Forums: There is one top-level forum, called The Forum. This has two subforums, called Subforum 1 and Subforum 2.

Subforum 1 contains two threads:
  • One is titled S1 a [P1], and contains the following posts:
    1. The first post was created by admin_user, is titled S1 a [P1], and contains the text ‘This is Post 1 in Topic A of Subforum 1.’. It was subsequently edited by moderator, adding the text ‘EDIT: This is a moderator edit.’.
    2. The second post was created by normaluser, is titled Re: S1 a [P2], and quotes the first; it also contains the text ‘And this is Post 2 in Topic A of Subforum 1.’.
  • One is titled S1 b [P1], and contains the following posts:
    1. The first post was created by moderator, is titled S1 b [P1], and contains the text ‘This is Post 1 in Topic B of Subforum 1.’.
    2. The second post was created by moderator, is titled Re: S1 b [P2], and contains the text ‘And this is Post 2 in Topic A of Subforum 1.’.
    Subforum 2 contains two threads:
    • One is titled S2 a, and contains the following posts:
      1. The first post was created by admin_user, is titled S2 a, and contains the text ‘This is Post 1 in Topic A of Subforum 2.’.
      2. The second post was created by admin_user, is titled Re: S2 a, and contains the text ‘This is Post 2 in Topic A of Subforum 2.’.
      3. The third post was created by þe Únıcoðɛ ūseṟ, is titled Re: S2 a, and quotes both previous posts; it also contains the text ‘And this is Post 3 in Topic A of Subforum 2.’.
    • One is titled þe Únıcoðɛ ʌ BBCode θṟɛđ, created to give examples of more complex formatting and non-ASCII characters. It contains two posts, by þe Únıcoðɛ ūseṟ. The first post contains the following:

      Code: Select all

      Interesting characters: ⱥ ƃ ɔ ð ɛ ƒ ğ ħ ı ĵ ƙ λ ḿ ŋ õ ʘ ʌ̨
      
      [size=150]BBCode:[/size]
      
      [b]This is some bold text[/b]
      
      [i]This is some italic text[/i]
      
      [list=1]
      [*] Here
      [*] is
      [*] a list[/list]
      
      Which renders as follows:
      Interesting characters: ⱥ ƃ ɔ ð ɛ ƒ ğ ħ ı ĵ ƙ λ ḿ ŋ õ ʘ ʌ̨

      BBCode:

      This is some bold text

      This is some italic text
      1. Here
      2. is
      3. a list
      The second post contains the text ‘This is a post I will edit.’, followed by ‘EDIT: And here is the edit.’; this latter text has been edited in by þe Únıcoðɛ ūseṟ.
Hopefully the above forum includes a fairly comprehensive subset of the phpBB features used in the old ZBB; if it’s missing anything, please let me know, and I would be happy to add a couple more posts which use those features. The backup file is named test-backup.sql.tar (because phpBB won’t let me attach SQL files; just remove the .tar at the end and it should be fine), and should be attached to this post.

Re: Posts from the old board

Posted: Fri Mar 06, 2020 7:57 pm
by bradrn
Using the backup file I posted, I think I’ve managed to figure out how the phpBB backup is structured. Here’s a description of the most important parts (again collapsed for brevity):
More: show
Users are stored in the phpbb_users table. The most relevant fields of this table are:
  • user_id is the id of the user (primary key)
  • user_type is the type of the user: 0 for normal, 1 for inactive (hasn’t activated account yet), 2 for ignore (e.g. bots, guests), 3 for Founder. user_type = 2 users can be safely disregarded.
  • group_id is the id of the group of the user. The list of groups is defined in the phpbb_groups table; the only interesting fields in that are group_id and group_name.
  • username is the name of the user, presented verbatim; username_clean is the username as well, but lowercased for ease of comparison.
  • user_password is the user’s password, but hashed. Not sure what the hash function is, but the same password seems to get hashed to multiple values, although I can’t find any salt (e.g. 123456789 gets hashed to $H$9hiNma.5YfOd7blUScu7i1gPjMKWNO. for normaluser but $H$9oML3WttexCs8D/NaiImpHSFr88pft/ for þe únicoðɛ ūseṟ).
  • user_posts is the number of posts the user has.
  • user_rank is the rank of the user. The ranks are listed in the phpbb_ranks table; this table contains the columns rank_id, rank_title, rank_min, rank_special, and rank_image.
  • user_avatar stores the avatar of the user: ‘URI for remote avatar, file directory and name for gallery avatar, combination of user id and time stamp for uploaded avatar.’ (from https://wiki.phpbb.com/Table.phpbb_users) The width and height of the avatar are stored in user_avatar_width and user_avatar_height respectively.
  • user_sig stores the signature of the user. user_sig_bbcode_uid and user_sig_bbcode_bitfield are other fields describing the BBCode of the signature; they are used to parse the BBCode correctly.
I assume we wouldn’t be too interested in extracting the location, email, website, birthday etc. of users, but you can find descriptions of all the other fields of phpbb_users at https://wiki.phpbb.com/Table.phpbb_users.

Forums and subforums are stored in the phpbb_forums table. The most relevant fields of this table are:
  • forum_id is the id of the forum (primary key)
  • parent_id is the id of the parent. Since the old ZBB only had one top-level forum, I think we’ll probably end up filtering out the single forum with parent_id = 0.
  • forum_name is the name of the forum.
  • forum_description is the description of the forum.
  • forum_topics counts the number of topics within the forum.
Topics are stored in the phpbb_topics table. The most relevant fields of this table are:
  • topic_id is the id of the topic (primary key)
  • forum_id is the id of the forum the topic is in, referencing the forum_id field of the phpbb_forums table.
  • topic_title is the title of the topic
  • topic_poster is the id of the topic poster, referencing the user_id field of the phpbb_users table.
  • topic_time is the time the topic was posted, as a Unix timestamp
  • topic_views is the number of views of the topic
  • topic_replies is the number of replies the topic got. topic_replies_real is the number of replies, but also including posts waiting for approval; I don’t think the old ZBB had any of those.
  • topic_first_post_id and topic_last_post_id are the ids of the first and last post in the topic respectively; they reference the post_id field of the phpbb_posts table.
  • topic_first_poster_name and topic_last_poster_name are the usernames of the first and last poster in the topic respectively
  • topic_last_post_subject is the subject of the last post in the topic
  • topic_last_post_time is the time of the last post in the topic, as a Unix timestamp
Posts are stored in the phpbb_posts table. The most relevant fields of this table are:
  • post_id is the id of the post (primary key)
  • topic_id and forum_id are the ids of the topic and forum respectively which the post is in
  • poster_id is the id of the user who created the post
  • post_time is the time of posting, as a Unix timestamp
  • enable_bbcode, enable_smilies, enable_magic_url, bbcode_bitfield and bbcode_uid are settings used for BBCode parsing (See https://wiki.phpbb.com/Parsing_text for a description of how to do that)
  • enable_sig stores whether the post has a signature
  • post_subject is the subject of the post
  • post_text is the text of the post, including edits (although BBCode tags here seem to be altered a bit; perhaps it’s to make parsing easier?)
  • post_attachment stores whether the post has an attachment. If so, the attachment is stored in the phpbb_attachments, the most useful fields of which are: attach_id, the primary key; post_msg_id, the ID of the post of the attachment; is_orphan, storing whether the attachment was uploaded but never used in a post; physical_filename, the name of the attachment; real_filename, the name of the attachment before it was uploaded; download_count, the number of times the attachment was downloaded; attach_comment, any comment on the attachment; and extension and mimetype, which are self-explanatory. I must admit though that I am a bit confused as to whether the attachment is actually stored in the database or not.
So, for instance, to display the text and username of all posts in topic 2 (say), you could run:

SELECT post_text, username FROM phpbb_posts INNER JOIN phpbb_users ON user_id = poster_id WHERE topic_id = 2;
__________

On the other hand, I was talking to someone I know who works in IT, and he suggested an alternative approach. As I understand it, the problem with restoring the backup of the old ZBB to this board was that the old version was too out of date (inferred from https://www.verduria.org/viewtopic.php?f=5&t=3); he suggested that zompist could download several versions of phpBB from 3.0.12 up to the current version (e.g. 3.0.12, 3.1.0, 3.1.6, 3.2.0, 3.2.8), and incrementally restore the backup to each one in turn. Then, once the database has been made compatible with the latest version, he could make the board read-only again and upload it; since the board would now be the latest version, this would prevent problems with PHP like the one which stopped the old board from working. Would there be any problems with this approach?

Re: Posts from the old board

Posted: Sat Mar 07, 2020 11:34 am
by Kuchigakatai
bradrn wrote: Fri Mar 06, 2020 7:57 pmOn the other hand, I was talking to someone I know who works in IT, and he suggested an alternative approach. As I understand it, the problem with restoring the backup of the old ZBB to this board was that the old version was too out of date (inferred from https://www.verduria.org/viewtopic.php?f=5&t=3);
I mentioned in this thread that the problem is that phpBB 3.0.12 relies on a "modifier" of a PHP function argument that was removed in PHP 7 due to security vulnerabilities (modifier 'e' of preg_replace). (Did you see no errors in your local install while using PHP 7? It'd be amusing if you didn't.) Then, in the post below that, I said that successfully updating phpBB should work, with a tip about mysqli configuration.
he suggested that zompist could download several versions of phpBB from 3.0.12 up to the current version (e.g. 3.0.12, 3.1.0, 3.1.6, 3.2.0, 3.2.8), and incrementally restore the backup to each one in turn. Then, once the database has been made compatible with the latest version, he could make the board read-only again and upload it; since the board would now be the latest version, this would prevent problems with PHP like the one which stopped the old board from working. Would there be any problems with this approach?
I see none, but I don't know how difficult and/or error-prone updating phpBB forums to new versions is.

Thanks for your work so far on the database! I think that we could and should make the limited archive anyway even if incatena.org could be restored to a functional state.