wowdev / WoWDBDefs

Client database definitions for World of Warcraft

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Format discussion

Marlamin opened this issue · comments

Discuss! Current proposal can be found in the README.

Something that still needs working out is the way version for each structure will be specified.

I'm against storing definitions for the same file in different directories (AKA 7.3.5.25807/Map.dbd) which would cause there to be a lot of duplicated data and much harder to maintain column names across versions. A solution to that would be to only have a single file per DBC with multiple definitions inside.

@MMOSimca suggested something like this for a single-file format in the past: LAYOUT 01234567, 89ABCDEF; which would go above the first column definition for that build.

Which works by referring to the layout_hash in the DBC file (WDB5 and up) which would save us from redefining structures when there have only been minor/not noteworthy changes (hence multiple layout_hash) support.

Pre-WDB5 files would just have a BUILD 1234; header instead, where 1234 would be the first build that structure was seen in. If you are loading DBCs for build 1245 and the structure would still be the same as 1234, the 1234 definition would be used. If the structure in 1245 has changed a new definition for that build should be added, to which future builds with the same structure can fall back on.

Note that the above implementation would have an issue during periods where two branches of the game (such as 8.0 Beta and 7.3.x PTR) could have similar build numbers, causing the wrong definitions to be loaded for that build. This could be solved by something that @bloerwald suggested (correct me if wrong) at one point which is to have a comma separated lists of builds (like LAYOUTs) for which the structure was valid, but this list would get very long. This solution would only have issues in cases where two separate game branches have the same build number, which has happened in the past. In that case, we could always assume the build number is for the lowest level client (Beta > PTR > Live).

Thoughts?

commented
BUILD 1000;
<uint32 a>
string[4] b
string c
string d

lets say we have this as base and imagine string c gets dropped in build 1001 that could result in the following, where we still have redundant data, but it's human readable. Basicly the same as splitting in directories, just in one file.

BUILD 1000;
<uint32 a>
string[4] b
string c
string d

BUILD 1001;
<uint32 a>
string[4] b
# string c got removed
string d

an other approach would we that we have build/layouthashes defined per column, but that will get way too confused for human reading. Imagine:

<uint32 a> {BUILD 1000, 1001}
string[4] b {BUILD 1000, 1001}
string[6] b {BUILD 1002} # array resize
int32 e {BUILD 1002} # added
string c {BUILD 1000}
string d {BUILD 1000}

for that PTR/Live issue, we mostly have layouthashes since this stuff happened afaik. So should be fine using layouthashes for this. And a probably better approach would be that we may list ranges and comma separate build/hashes, e.g.: BUILD 1000-1100, 1400, 1404-1500, gonna be a bit funny to parse though.

I think the middle one with duplicated data and no build/layout hash for every line is the best. Using # for comments sounds good to me.

commented

By the way - not sure if this happens anywhere, but just in case. When we have an array of foreign keys I'd suggest to format it this way:
uint16<AreaTable>[5] AreaTableID
so that we have this syntax in general for each field:
DataType<referenced Table>[ArraySize] FieldName

Sounds good!

Here's a bit of conversation regarding layout hashes and some adaptations to the proposal to deal with array resizes from #modcraft on QuakeNet.

17:32 <Simca>  everything except array resize
17:32 <Alram>  How often does that happen
17:33 <Simca>  there have 2 instances in the history of layouthashes that i'm aware of
17:33 <Simca>  have been*
17:33 <Alram>  How does one account for that
17:33 <Alram>  Go by build?
17:34 <Alram>  Or just yolo and fuck that version
17:35 <Simca>  i proposed a solution: LAYOUT hash1-recordsize1
17:36 <Simca>  that would have solved both instances
17:36 <Alram>  But is it worth the added complexity
17:36 <Simca>  in THEORY, it's not enough.
17:36 <Simca>  because if they resized two arrays of the same size at the same time
17:37 <Simca>  or if the resized array was so small that padding compensated for the change that would have occurred in record size
17:37 <Simca>  then even that solution would fail
17:37 <Simca>  but in practice record size would be far more than ever needed. the only case where a resize occurs (both cases so far) have been on Flags fields
17:39 <@schlumpf>  Can we kind of do the inverse and go from theoretical sql format to dbc format? Like, automate the column shrinking?
17:41 <Simca>  not really. we could imitate it poorly, but we'd be left with gaps and problems - we just don't have enough information, especially about newer fields, after they did all the sorting
17:42 <Simca>  the advantage of the format you purposed that has layout and build always side by side in the header is that it would solve array resizing
17:44 <Simca>  anyway, two different configs could then share a layouthash but they would never share builds at the same time
17:44 <Simca>  that assures we'd account for all cases
17:44 <Simca>  if a person requests a layouthash without providing a build, we just give them the version of the hash for the latest build
17:47 <Alram>  Hopefully this isn't too much complexity
17:47 <Alram>  Otherwise nobody will implement it and stick with their own definitions
17:47 <Simca>  no, it's actually perfectly simple. layouts, builds, config
17:48 <Alram>  Can help write converters/implementations in worst case```
  

Other points (than mentioned in above comment) to discuss/agreed upon:

  • Unknown column naming (Unk0, Unk1 etc or just empty and let parsers handle rest)
  • Additional specialized types other than locstring, do we handle these? (Vector3 being float[3], flags)
  • <> for inline types is kinda weird. Do parsers need to know about them? Can we comment them?
  • Do we have comments? Do they get put on new lines or not? Are they parsed (wiki?) or human only?
  • Do we mention map name in file? @miceiken argues filenames are metadata and not contents
  • (via @bloerwald) How do we make sure column names are the same across definitions and won't diverge?

Updated the format in README after some feedback from IRC. The addition of the locstring type is the biggest change. Most parsers already handle this in some way (some call it loc, others langstringref etc) so it shouldn't be a big bother to implement.

Other stuff in above comment is still up for discussion.

commented

Just talked to @Marlamin on Discord, we noticed that we definitly need ranges for builds.
Imagine:

BUILD 21846, 21863, 21874, 21911, 21916, 21935, 21952, 21953, 21963, 21973, 21989, 21992, 21996, 22017, 22018, 22019, 22045, 22053, 22077, 22083, 22101, 22124, 22133, 22143, 22158, 22171, 22201, 22210, 22231, 22248, 22260, 22280, 22289, 22293, 22306, 22324

Idea is to split ranges/single builds by line, e.g.

BUILD a-b
BUILD c-d
BUILD e
BUILD f-g
BUILD h-i
BUILD j-k
BUILD l
BUILD m-n
BUILD o-p
<attributes here>

another approach is comma seperated, but that is a bit harder to read as human:

BUILD a-b, c-d, e, f-g, h-i, j-k, l, m-n, o-p
<attributes here>

Some feedback from my side:

  • Instead of <uint32 id> for keys, I'd suggest doing key<uint32> id
  • Instead of uint32<ForeignDbName> fieldName do foreign<uint32, ForeignDbName, ForeignFieldName> fieldName

Not sure against directory per build/version/layout, since those actually are able to do symlinks nicely. Also, diff v1/map v2/map. Yet, I agree it isn't best UX, but there are reasons for it. I want to throw in a third option: abuse git. One commit per build. Branches per major. git rebase -i && git push -f if you actually do a change. This would force you to resolve conflicts, meaning that if you add a name, you have to ensure it is matching all versions after that. It would not force you to verify to the earliest version, but at least for the past. → One file, blocks for now.

Layout hash imho is a bad idea. While it seems quite fine, it isn't a hash of the actual file content as shown with it changing even though file does not, and inverse the contents changing (array length) while the hash does not. I don't like it, but see the appeal. We can use it to deduplicate structs during annotation though. → I wouldn't rely on it as first hand id.

Build IDs are not enough, we need versions due to branches. I would not do ranges implicitly but only verified versions. There are cases where we know that it has literally never changed structure ever since Vanilla. Wiki currently says ranges, but explicitly mentions verified ones (e.g. 0.5.3.3368-1.12.1.5875-3.0.2.8905-3.3.5.12340-6.0.1.18179 for https://wowdev.wiki/DB/CinematicSequences) → I want a distinction between "assumed matching" and "verified matching" versions. Versions are major.minor.patch.build, as Blizzard does.

What's missing so far in the proposed formats is descriptions, comments. These are the first that get out of sync between versions and the biggest reason for a uint32_t m_ID; <build 12340, 20505>. It contrasts with reorders, and the late-time bitpacking shit changing integer width. I do not have a solution for this. Maybe a hybrid approach where possible columns are defined on top and then referenced? It follows some examples for a possible implementation:

  • CinematicSequences.dbd
COLUMNS
uint$key                               m_ID
uint$foreign_key$SoundEntries$m_ID     m_soundID
uint$foreign_key$CinematicCamera$m_ID  m_camera   While there is an array, only one is ever used in live data. If multiple are given, they are played in sequence, one after the other.

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179
m_ID<32>
m_soundID<32>
m_camera<32>[8]

BUILD 7.0.1.probably
m_ID<32>
m_soundID<16>
m_camera<8>[8]
  • Cfg_Categories.dbd
COLUMNS
uint$key                m_ID
uint                    region<unverified>
locstring               m_name_lang
bitfield$LocaleMask     m_localeMask
bitfield$CharsetMask    m_create_charsetMask
bitfield$CharsetMask    m_existing_charsetMask
bitfield$F              m_flags

ENUM<F>
1 needs tournaments enabled on account

ENUM<CharsetMask> binary
00000 development
00001 eu_or_us
00100 russia
01010 korea
10001 taiwan_or_china        note how this is bogus since it overlaps with eu_or_us, these are obviously something else

BUILD 1.12.1.5875
m_ID<32>
region<32>
m_name_lang

BUILD 6.0.1.18179
m_ID<32>
m_localeMask<32>
m_create_charsetMask<32>
m_existing_charsetMask<32>
m_flags<32>
m_name_lang
  • LocaleMask.enum
ENUM<LocaleMask> binary
000000000001 enUS    also enGB
000000000010 koKR
000000000100 frFR
000000001000 deDE
000000010000 enCN    also zhCN
000000100000 enTW    also zhTW
000001000000 esES
000010000000 esMX
000100000000 ruRU
010000000000 ptPT    also ptBR
100000000000 itIT

→ I really want a hybrid approach to minimise duplication.

Also in the suggested code above, more types. I don't think we need C3Vector as intrinsic type, but I wouldn't be against C3Vector.type describing it to be reused. → Bitfields with typed values for example. Also, Foreign keys need to reference what they reference in the other table.

While I prefer unknown columns to just not be named, I guess we need to give them bogus names to track them over versions. → I suggest a marker for verified names, or inverse.

I have no opinion on how to mark out-of-line columns. Is that maybe just as much of a layout thing as in-row strings?

I don't understand what "mentioning map name" refers to.

I don't understand what "mentioning map name" refers to.

Typo, my bad. Should be "mentioning dbc name".

@justMaku I would prefer to leave the type ('uint32') at the front of each line if possible. For parsers, that bit is by far the most important thing, and moving it around (behind key< or foreign<) makes that more complex to grab.

@bloerwald Let's not waste space and everybody's time by prefixing every variable with 'm_' in the actual format. If you want to do that in the script for the wiki, that's fine. It is your wiki. But in the actual dbd files, let's not waste space to adhere to a C standard that even people who use C generally think is stupid. This is not C.

As for the rest of it, I'd personally prefer to leave enum definitions and bitfield definitions out of it. It's a valid point to make and a valid argument to have though. To me this is more like 'what is the structure of rows', whereas enums, bitfield definitions, and to a lesser extent foreign key references veer off into.

The problem with relying exclusively on BuildIDs is that later versions do not have that information. If you give me a db2 and a config that only lists BuildIDs, I have nothing. That means all parsers from now on will require user input to read files (build ID). Currently, they don't. This is a pretty massive issue. At the very least, for this format to be -useful- for parsers, it must include a list of layouthashes with each entry.

I would rather not introduce more types. Bitfield and enum may be acceptable, depending on how they're handled, but definitely not things like C3Vector. That could be a comment or description, but adding in additional types as a mandate to the format complicates things further.

We want the format to be as simple as possible. It's important that it convey a lot of information, but I would rather it conveyed less if it adding more wildly increased complexity.

I would prefer to leave the type ('uint32') at the front of each line if possible. For parsers, that bit is by far the most important thing, and moving it around (behind key< or foreign<) makes that more complex to grab.

@MMOSimca isn't key or foreign_key a type though and should be implemented as such by parser? Having it before the underlying type allows the parser to know about it earlier.

First and foremost we should be careful to not make the format too complex. People have to implement this (including relatively dumb people like me) for it to actually get traction and solve the issues we're trying to solve (different docs spread everywhere across the internet).

On that note, I haven't been able to reach @barncastle yet who will be pretty vital in getting (public) adoption going. I have also not received any feedback yet from @tomrus88 and @Warpten who also have DBx implementations that deal with definitions in some way or form.

@MMOSimca isn't key or foreign_key a type though and should be implemented as such by parser? Having it before the underlying type allows the parser to know about it earlier.

I don't think there's a difference between something being a foreign key and it not being a foreign key before WDC1. There's a difference now though but I'm unsure how that affects parsing.

I wouldn't rely on it as first hand id.

I think I'm with @MMOSimca on this, parsers generally don't know about builds when looking at files so layouthash should always be listed if available. I am for listing more specific builds as well, though. This should solve any branch issues we're having. We still need a way of Schlumpf also mentioned (in private) that it might be an interesting idea to list tablehash as well.

If we wish to list Tablehash in the file, it should just be the first line of the file.

TABLEHASH XXXXXXXX

For cases where the file existed and was removed before db2s were introduced (or before the file was converted to a db2), we would call it TABLEHASH 00000000.

Do we know how TABLEHASH is generated? Could just generate correct hashes for older files.

Unfortunately, no. Given the fact that we're dealing with Blizzard here, it could be literally anything. Somebody could have taken the MD5 of a jpeg of a cat and added 12345678 to it, and every db2 was just based on a different cat picture.

The popular theory I'd heard was that it was probably just the name of the table, hashed. The problem then happened when 'item-sparse' was renamed to 'ItemSparse' and its TableHash did not change.

Unfortunately, no. Given the fact that we're dealing with Blizzard here, it could be literally anything. Somebody could have taken the MD5 of a jpeg of a cat and added 12345678 to it, and every db2 was just based on a different cat picture.

Seems like a solid plan, all in favour, vote with your favourite cat gif.

While keys don’t have to be in a type system, I advocate for type system as strong as possible giving as much information as possible. Having the information what keys and foreign key columns are helps with ever single version of the file format.

I would personally avoid a magic value for unknown table hashes but just not have it (optional<uint32_t>)

I wouldn’t mention dbc Name in file. Then again, mentioning table hash is just the same. It is probably fine having either.

The m_ prefix and _lang suffix comes from blizzard, not c or my drunk brain. I generally prefer having stuff as close as possible to their stuff, so I decided to keep it on wiki.

Build/version, filename, tablehash and layouthash, column count, all have one common property: they can merely act as a heuristic for the parsers. A viewer can try to infer a definition to use based on them, but in the end none of those are guaranteed to result in the right definition unless the user picks it manually. I’m all in for having all of the information available, so add tablehash and layouthash if available. I just wanted to state that I don’t think that one of them alone should be the primary identifier of a definition.

I agree on not adding complex types like c3vector, yet bitfields and enums have huge Information other than bitcount that integers would have. Feel free to use them as an alias for integers in your implementation, but I strongly suggest to not throw away that information, seeing that we have it.

I want to throw in a third option: abuse git. One commit per build. Branches per major. git rebase -i && git push -f if you actually do a change. This would force you to resolve conflicts, meaning that if you add a name, you have to ensure it is matching all versions after that. It would not force you to verify to the earliest version, but at least for the past.

This would make collaboration super complicated in case we want to go back: for example adding names to previously unknown columns. Rewriting history is never a good idea in a distributed system like git.

Because it popped up in private discussion: enum thing needs versioning too, which my current approach doesn't cover.

Private discussions. Yuck

From IRC:

  • We're seeing about adopting @bloerwald's above mentioned format (with some changes), with more examples soon.
  • We're doing build ranges, already shipped to sample definition. Build range should be per x.x.x patch (new BUILD line for new range) and should never go into the future or span too many builds in the present.
  • Verified builds will be added through a VERIFIED_BUILD (or something similar). Same structure as BUILD. Verified definitions are verified to be correct for that build, while the BUILD header is a more generic indication of support (likely correct, but not guaranteed).
  • Additional people that work on DBC related things have been poked. Adoption will be key and as such we are still looking for more feedback.

There are now some WIP sample definitions by @bloerwald in a separate folder. It's looking like a great start but we might have to make the column stuff at the beginning a bit more readable so we can maintain it a bit easier in the future. The changes proposed by his format cover deduplication, comments and make sure that column names are enforced to be the same. Feedback on those samples is welcome!

ENUM<WeaponFlags>
0 untouched
4 sheathe after animation
16 sheate after aniation
32 pull

This doesn't look really that human readable to me, in my opinion anyone reading this manually would probably be more accustomed to the enum style found in many programming languages:

enum WeaponFlags {
   untouched = 0
   sheathe after animation = 4
   pull = 32
}
uint$foreign_key$CinematicCamera$m_ID 

Same issue here, the format is very hard to read as a human. Most important information (to me as a reader) is at the very end of the line. I believe it would look much better as:

id key<uint32> // Name, Type, Backing Type
name string // Name, Type
map foreign<Map, uint32> // Name, Type, Location, Backing Type

This way the bits of information are ordered in the magnitude of relevance from left to right. This matters for us, mere human beings, and computers don't care.

enums

I like how you missed 16. The remainder is just format though. The two are equivalent. I still claim that parsers are irrelevant as long as they are the same complexity. Both, “enum“ Name\n(value name\n)+ and “enum“ Name „{„ (name „=„ value „,“?)+ „};“ are the same complexity and content and thus equivalent. Debating this is not relevant as of now. Rather is wanting to have it or if/how to do versions.

I like how you missed 16.

Couldn't be arsed to copy-paste typos.

The remainder is just format though.

Yeah, this is format feedback, as asked by @Marlamin

I still claim that parsers are irrelevant as long as they are the same complexity. Both, “enum“ Name\n(value name\n)+ and “enum“ Name „{„ (name „=„ value „,“?)+ „};“ are the same complexity and content and thus equivalent.

I wholeheartedly agree and that's why I believe we should always be thinking about making it as human readable as possible in the first place (unless we decide that human-readability is not a major feature anymore).

  • not a typo
  • while I agree to have something as easy as possible human readable at the end, it doesn’t make a difference during discussion if I write uint$key, key<uint>, key uint, ’tis an unsigned integer being a key. They are the same complexity and information. That’s what should matter.

Just as we don’t give a fuck if it is version or build. It just doesn’t matter. The question is what information is there and which structure it has, how it is referenced.

Completely disregard the mess WDC1 introduces, is should be up to parsing implementations to deal with it. For bitpacked fields:

a) specify bit width in a comment somehow
b) uintN, floatN (I don't think floatN is a thing yet, but fuck)
c) don't include it. It's in the file anyway.

IMHO it is important to not cram the format too much. Keeping types, names, and FKs is plenty enough.


Ref human readability, just write a parser for the human readable stuff that generates it better suited for machines/code. Make everyone happy. We write stuff we can read and process it for a machine.


pod$fk_type is disgusting. I think spaces everywhere are simpler, with a defined line format that every single line respects. Not an array? I don't care, declare a size of 1.


Regarding versioning. A solution at least for machines is order maps. Fields are indexed 0-n and after they are declared, we have something like

(build0, build1, ...) = { 0, 2, 5, 1, 3, ...} // shared structure
(build2) = { you get it }
(build3) = (build0) // reference other structure

Sure there is duplication when we are facing a dumb case where two columns exchange place between builds but I think duplication is not a bad thing as it makes data more readable, and the overall size of a definition file is something no one gives a flying fuck about. this is IMHO not even too bad for human format. Just make indexing explicit on file, no one wants to count fields in ItemSparse.

I.E.:

0 1 int32 Id;
1 3 float; // unk name float[3]
2 1 int map id; # Map Id // foreign key to Map.Id

Delimiter for shit md parser

^(N:[0-9]+) +(cardinality:[0-9]+) +(type:[a-z]+)(optionalBitSize:[0-9]+) +(humanName:[a-z0-9_-]+); +(?:\# (fk_ref:.+))?$

to fix the issue with build/tablehash, just map those as well. Tedious, but still at least that can be automated.

Enums can be crammed in there too I'm sure, probably smth like int32!wep_flags. Looks more readable than $ to me. Humans are used to numbers around dollar sign.

Also for enums definition provide both bitset value and shift value.

0 0001 a
1 0010 b
2 0100 c
3 1000 d

Looks like a table and humans are better at reading tables than text.

c) don't include it. It's in the file anyway.

This is the correct way to handle WDC1 bitpacked fields. Bitpacked fields are an implementation detail. What we want is table layout the way the game holds it in memory, basically, not the way it is in the files (yes, I understand the irony of this considering that the entire point of this is for a guideline on how to read the files, but I stand by my comment and can explain further if required). That's part of the reason my original format did not care about localization - the game doesn't either. (And I never cared about pre-WoD files.)

I'm also in the anti-$ crowd, for whatever that's worth. It makes me think of PHP and I have PHP-PTSD.

FWIW, I've already changed the $ stuff in the sample definition file for Map.

Format, spaces, $, …: this doesn’t change the basics and can thuan interchanged freely and only concerns the final spec.

Regarding “it is in the file anyway”: we just need to make sure it actually is possible to parse all versions of the file.

Regarding localization, while you can parse without that, you can also parse without column names. The entire thing is done to give additional information. Localized columns to me are a relevant semantic.

Also idk if it's been settled but I'm fine with locstring being a pseudo pod type

Just throwing in my two cents - ignore my ignorance if I'm off the mark!

I agree with the current structure however I have a couple of queries/remarks about the Columns block.

  1. I appreciate that some of the comments will be helpful when developing but they're not vital especially since the whole point of this is to be a generic definition for parsers. Would it not be better to store that information separately on the wiki (which also promotes wiki contributions) and just link the article in the meta? This also avoids formatting issues/additional space usage and fights over if a comment is too long/too short/valid/relevant etc.
  2. Do the column types default to the largest size? E.g. in Alpha if it was an int but in WoD its now a byte and treated as such (say max value check), it gets set as an int but has the bitwidth next to it?

I mostly agree with the points made in 1, should we drop comments?

As for 2, I'm unsure. Maybe @bloerwald has kept that in mind somehow?

I proposed comments to be part of this description since it is also something heavily duplicated on the wiki currently. I see that they are not vital for pure parsers. Just ignore them there. For the wiki, it would be hugely useful since it would take care of the issue of having different versions of the comments per version, as it is currently.

In my format suggestion, column types do not default. A version definition specifies the bits for the column if it is a dynamic-bit-possible (i.e. int) column. One can add the (sane) default to use int32 if nothing is specified, i.e. let

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179
m_ID
m_soundID
m_camera[8]

be equivalent to

BUILD 0.5.3.3368,1.12.1.5875,3.0.2.8905,3.3.5.12340,6.0.1.18179
m_ID<32>
m_soundID<32>
m_camera<32>[8]

if that's really worth simplifying.

Tablehashes for pre-tablehash dbs (via furl(?), simca): https://repl.it/repls/PlasticLankyTrogon / https://github.com/Blizzard/heroprotocol/blob/master/mpyq/mpyq.py _hash, with some rules like stripping characters and all upper I don’t remember

So I think we're pretty close to locking in on something we can actually start working with.

Can we now start working on the following?

  • Validator/reference parsers (I'll see if I can get some awful C# going)
  • Generators generating dbdefs from currently used definition formats (just for starting with/working of)

Also, we need to agree on how the files look (which symbols are used for what etc). I know it keeps being dismissed as "just being format" but it'd be nice to get this down.

To bounce back in @barncastle 's 2, I personally treat integers as either 8, 16, or 32 (or N bits more recently) bit values internally and just cast up when loading into the structure (at least for WDB5, as you could accurately guess the size of every field but the last one if there were mismatched element sizes). Doing that for virtually every integer-y field shouldn't be too difficult. But that's a debate point, since it also makes more sense to just assume everything is int, and cast down when serializing... It also kind of sounds like an implementation detail, which has the nice side effect of dumbing down the process of creating a structure for the user: "when in doubt, int"

I've updated the sample code, sample definition and format proposal with the changes that I think were agreed upon. Stuff in the above few comments (default int size/comments) is still left open.

The first set of files based on 6.0.1 DB structures has been generated which closes the discussion for the initial format spec. We're not going to do default int sizes as of right now, but this can still be discussed as it's a pretty minor change. Comments are still in but should be kept somewhat small. Things that require more explanation should go on wiki and on wiki alone.

Currently in the process of adding proper foreign keys to things after which we'll start adding more versions. I'll also extend the validator at one point to check whether or not foreign keys go to valid DBs/columns.

Thanks for contributing up to this point, everyone! Next up: multiple definitions spanning multiple versions! Map is still the only one currently doing so.

I'll leave this open for a while in case anyone still has comments they want to make.

What is 'default int sizes'?

Do you mean you aren't doing the SoundID<8> thing at all or that you skip it in cases where it is 32?

The latter is fine, but the former is only fine if you want most of Warlords to be totally unparseable.

@MMOSimca Assuming 32 if no int size is specified. Not implemented in current spec.


Some things encountered/thought of when working on initial formats:

  • Build lists for DBs that have never changed since initial structure will get REALLY long if we stick to the "one range per minor version" rule depending on how long they've been around.
  • Foreign keys to FileData will have to have special handling in versions where it's no longer present but still a foreign key but now into root. @bloerwald mentioned the possibility of making a special type for these. Will need discussion.
  • How will WDC1 relationship data be handled? Will we add fake fields like wiki suggests? Special type seeing (right now) it is always uint32?
  • Current definitions are based on Blizzard column names from 6.0.1, it is ofcourse preferable to stick with these but in some cases I have changed stuff like uint id to uint ID so foreign keys work the same everywhere. Acceptable?

Edit: Used bullet points.

  • Yes, they will.

  • As for FKs into FileData, I think it's fine. Just continue to pretend that FileData is a real client-shipped table even after they stopped shipping it. Changing it to root is pointless. root is an implementation detail, too.

  • It's not even a fake field really, but yeah. Keep it uint32 for simplicity as well. Here's a quick example of how the game sees it (we'll be doing it the 'fields' way):
    fields:
    int32 creatureDisplayInfoID: offset=0 flags=0
    float probability: offset=4 flags=0
    float scale: offset=8 flags=0
    int32 creatureID: offset=12 flags=0

    fields in file:
    int32 creatureDisplayInfoID: flags=0
    float probability: flags=0
    float scale: flags=0

  • Yes. @bloerwald may hate the idea but the reality is we will need to change some of the official names slightly. The problem is that not even Blizzard is consistent with their names. As a recent example, one of the Creature db2s calls a FileDataID field 'FileID' instead of 'FileDataID'. Also, personally, I dislike the 'lowerCaseNamingScheme' and prefer 'ItToBeNamedLikeThis'. It's not a huge deal to me either way because I can just make my tool force capitalize the first letter. Still, we should decide on a method to be consistent. Blizzard is not consistent, by the way - they do both styles of capitalization, often even within the same DB.
    Also, Blizzard always capitalizes ID in the future. I wish they'd do the same for 'UI' because I'm tired of seeing 'UiTextureKitID'. UI IS AN ACRONYM, NOT A WORD.

Fine with WDC1 stuff being just a field and leaving filedata FKs as is.

A standard for capitals, especially for when we deviate from Blizz names or have to come up with names ourselves is definitely something I'd want to do.

As a summary, here's a short list of field-name-related issues we have to decide on:

  1. What style of naming do we use?
    Example: DeathThudLookups has fields named like 'SoundEntryID' (uppercase first letter, we'll call this Style A).
    CurrencyTypes has fields named like 'maxEarnablePerWeek' (lowercase first letter, we'll call this Style B).
    Criteria has fields named like 'eligibility_world_state_ID' (all lowercase besides 'ID' with underscores between words, we'll call this Style C).
    We need to pick one, then match all names (created by us or taken from Blizzard) to that style. Most of the names that Blizzard invented use Style B, so I suspect that is what most people will go towards.

  2. Do we correct Blizzard shorthand in names?
    Example: DungeonEncounter has 'CreatureDisplayID' and 'spellIconFileID'. Those are shorthand for 'CreatureDisplayInfoID' and 'SpellIconFileDataID' respectively.
    This is ultimately not as important to fix as one would think, since we have an FK section where we state these things explicitly. We may want to prefer brevity in the names and then leave the detail for the FK information.

  3. Should we expand the list of protected capitalization words beyond 'ID'?
    For example, files use names like 'state1Wmo' and 'UiTextureKitID'. 'WMO' and 'UI' are acronyms, similar to 'ID'. However, Blizzard only treats 'ID' as something that should be always properly capitalized and ignores the other two (there may be others I'm not thinking of right now as well).

  1. I personally prefer style A, then B.
  2. Prefer fixing these when we come across them.
  3. Yes. WMO, M2, ID, UI etc should definitely be capitals.
  1. B > A > C, even though i prefer C irl. Note how some dbs like Map have A and B mixed.
  2. I always prefer being as close as possible to theirs. There is no need to insert that Info since we have that via type begin a FK.
  3. I'm fine with fixed upper cased, but don't care too much.
commented
  1. A > B
  2. Agree with @bloerwald , we should be as close as possible to theirs
  3. We should make a list of some "keywords" which should always be uppercased, yes.

Let's do a quick vote on 1 so we can decide on this: https://www.strawpoll.me/15161342

As for 2 and 3 I think we've come to a conclusion on those; don't bother fixing and we need a list of things that should always be uppercase.

for primary key columns that are not the first (and maybe not be named id), do we want some annotation for primary key columns?

This is the case for e.g. ObjectEffectStateName::Value, TextureFileData::fileDataID and ModelFileData::fileDataID.

Just had an issue where ItemRandomProperties.dbd has a conflict between two column names both named Name because we filtered out m_ and _lang while dumping.

I've manually added _lang to that file for now but we might have to consider bring it back globally to make sure this can't happen again.

EDIT: Simca thinks we shouldn't have gotten rid of _lang in the first place, so we're bringing it back.

We've come across a case (the only one we've seen) where a type goes from uint to float between two versions. This throws a wrench in our whole "stick types on top" shtick (heh) we have going.

Solution we came up with (as this seems to be extremely are) is to allow version definitions to override column types mentioned on top of the file like so:

COLUMNS
int ID
uint coolPercentageField

BUILD 1.2.3.1234
ID<32>
coolPercentageField<32>

BUILD 1.2.3.1235
ID<32>
#float#coolPercentageField

It is another thing to support when parsing, but the alternative was to get rid of types on top and move them all to the bottom by default. That sucks and allows for more errors in duplication to slip in when there are no actual differences.

Above post has been (sort of) documented in README (will need to make a better doc when we really have it set it in stone). Anyone have any comments regarding changes to the format description in README that were done today? Now is your time to voice any last opinions/concerns!

relation this column is stored in the relationship table. Example: $relation$ColName

This is superfluous implementation information that I would like to see removed, ideally. If somebody really wants to note those columns, they can make it a comment, but let's not make it part of the spec.

In WDC1, you don't need the $relation$ tag to properly parse or utilize the format. More on that below. For DBCache.bin, the information is literally useless as the record contains the field like normal. The game client itself also automatically converts between the two formats seamlessly without the need for more information.

Here's how: there's two cases to consider when there is a 'relationship data' section present in the db2 (relationship_data_size field in file header > 0).

  1. In the first case, where the file and config report the same number of columns, the relationship data is purely a duplicate. There will be a column already in the db2's records that has all of the same data. it does not have to be at the end, it could be anywhere in the record.
  2. In the second case, where the file has 1 less column than the config, the relationship data section is required to be parsed. It always fills the final column of the structure.

Example of Case 1: Achievement::CriteriaTreeID
Example of Case 2: CriteriaTreeXEffect::CriteriaTreeID

In both cases, you didn't need the game client or anything special in the provided config to tell you how to parse the file - you did it yourself just by comparing the number of columns.

P.S. $id$ is useless for 99% of builds (needed for WDB3 and WDB4, which were both very short-lived). Unfortunately, that 1% means I can't complain about it. Well, I can complain about it, but I can't sound reasonable while doing it.

Some discussion from IRC. Would like to remind everyone that even though IRC is easier and such not everyone is able to read it. Let's try and keep it in here.

03:31 <@schlumpf>  Simca_: you do understand that dbd needs literally 0 bytes in order to parse a wdc1? The entire idea is adding semantics. You don’t need types, names, foreign keys, id columns (there are columns not called ID in 801 as well), relationship columns, but counts, locstring, enums, comments  in order to parse wdc1. We want that information to preserve it and to document everything that we know. Every single named property is optional in
03:31 <@schlumpf>  the format even. for semantics though it is good to know what is an id, a foreign key, a unique valued column, ….
03:32 <Simca_>  then it could be a good comment maybe.
03:33 <Simca_>  the actual spec should just give you the information required to load a file from any version. so it needs $id$, if only for WDB3 and WDB4.
03:33 <Simca_>  and it needs byte sizes, for several different versions```

So the above discussion seemed to mostly be a disagreement based on different use cases. For @MMOSimca's use case $relation$ is entirely superfluous, but I do see @bloerwald's reasoning of keeping it for the sake of not losing information. It'd be nice if someone else could chime in on this, even though interest in the project currently seems to be at an all-time low. I'm going to go ahead with the current version of the spec as laid out in the README. If we come to a decision on whether to keep this or not we can easily change the definitions at that point.

I'm going to keep working on this for at the very least a few more weeks. If it doesn't get any traction by then I suggest we probably put this on hold indefinitely. If this is just going to end up feeding the wiki and nothing else it will have been a giant waste of time.

I don't think it's losing any information, since the information is there the same way it was before. It's not explicit, but the issue is that the information is not useful explicitly.

In any case, I'd like to question the 'unverified' thing @bloerwald has been doing. I don't think that's part of the spec (yet), but it's in some demo defs. Just worth mentioning, we need to be really careful with symbols usage. Already, we've got <> in use for bit sizes and FKs, then $ for IDs (and relations, unfortunately), and now # for type overrides. For the unverified thing, please use something other than another set of <> as that is just confusing. Having to find and parse two sets of <> with one being optional is just making a format that's already hard to parse even harder.

Also, I think that condensing down superfluous wording would help both human readability and machine readability. If you kill $relation$, that frees up the dollar sign to mean one thing: ID column. Then we can just remove the '$id' part of it, and leave it like:

uint32 ID

$ID

Using easy symbols in place of constantly repeated words and terms is a nice optimization for hand editing and for reading.

RE Simca's first comment: the schlumpf directory is old and shouldn't be looked at. Definitions in the definitions folder are leading. Unverified is a question mark in the top COLUMNS definitions there (I'll put this in README as it seems to be missing).

Also, there was some discussion only on IRC again: 👎

18:24 <@schlumpf>  Simca_: yes, $id$ID is non-inline; $id$ID<32> is not.
18:24 <Alram>  mfw schlumpf never learns
18:24 <Simca_>  horrible
18:24 <@schlumpf>  I was just answering a question?!
18:24 <@schlumpf>  Oh fuck it Simca_, do your own format.
18:25 <Alram>  maybe answer in the place he asked next time
18:25 <Alram>  and simca has his own format
18:25 <Alram>  he doesn't really benefit from this at all :D
18:25 <Simca_>  both inline and non-inline IDs are 32-bit.
18:26 <Simca_>  the distinction you're making is not only confusing as hell but completely pointless```

And yeah, I do have my own format. I have a stake in this format though because I want it to be -good-, so that I can use it instead of mine.

@bloerwald I think you're trying too hard to make DBD like the wiki, when all of that stuff that you want should be in the conversion process from DBD to wiki definition. Having non-inline IDs in the structure definitions throws off DBCache.bin reading.

The principle concerns I think here are (in order of importance):

  1. Useful for DBC/DB2 reading/writing
  2. Useful for DBCache.bin/ADB reading/writing.
  3. Easy to read at a glance (for people who already know what they're doing)
  4. Easily hand-edited (minimal number of characters required and effort needed to add new versions)
  5. Pretty for wiki / useful for beginners

The reason for that order is because you can fill in the gaps yourself automatically when making the wiki's definitions from the DBD files. For example, if the structure doesn't have an $id$ column, you already know: the ID is non-inline. It doesn't need to be explicitly written out for use #1 or use #2. So when you convert to a wiki definition you can -extremely easily- just auto add those entries to the output.

I want to keep the base format as clean and simple as humanly possible. Literally the less characters in the file, the better we're doing. Then the conversion to wiki can add the fluff to the definition, like making a '// non-inline ID goes here' thing.

You seem to be prioritizing the definitions being extremely noob-friendly, so that people who don't understand the format at all don't have to ask questions like "what's up with the relationship data?" or "where is the ID column?!". I actually do think that's a legitimate concern, and that these topics can be too hard to broach from an outside perspective. That's actually a huge part of why I've tried to maintain the DB2 wiki page as much as I have. But DBD files are going to be inherently technical. It's fine if the config files to a program don't explicitly explain every detail. In fact, they shouldn't. We'll have a README that says what all these decisions mean and why they were taken.

The wiki should be where beginners go to get super detailed definitions, and the DBD files should be just as detailed, but with the details implicit where possible and with extensive use of abbreviations, symbols, and other shortcuts. If we keep these files as complex as they're getting to be, we might as well move to XML instead because parsing them is going to be hell.

It took less than a day to write a parser for dbd and a pretty printer for wiki in a language and framework I don’t know with most of the time spent on trying to understand pythons basic language features. The EBNF has less than 20 rules with at least four rules being just aliases.

If that’s too complex for a format that is “inherently technical”, I don’t know.

I surely don’t see it as being too complex that we have to remove explicit knowledge to make everyone who uses the format needed to be an expert.

Why did I put 32 on every ID? Because I didn’t even know that currently ids have that requirement. Does it hurt having it and just skipping it? No, not at all. It even simplifies things since every column reference has the exact same format rather than having to decide how to parse an entry based on the name or some shit. In fact, i would even make it MORE verbose and explicit. MARK no inline columns in the language rather than a comment.

This makes the format simpler, not more complex.

You're missing the point. You're adding in a field that doesn't exist in the structure (non-inline IDs).

Does it hurt having more information and having to teach parsers to skip it? Yes, yes it does. It is easier for one person to waste their time adding it in conversion to wiki form (you) than forcing every other person who uses this format to waste their time skipping nonsense fields in parsing.

I guess the question is: is that true? Ignoring what the wiki needs from the format (you can 'fix it in post' as they would say in Hollywood), what would general coders want (general users would go to wiki, this is intended for code-use only)? I think we're asking the wrong questions here.

There's two scenarios here:

  1. My way. When parsing, records are treated normally. Non-inline IDs are sorted in a separate array. There is an ID accessor provided along with the records that people use to pull IDs. If the flag for non-inline IDs is present, then the accessor returns the appropriately-indexed non-inline ID (the one matching the row index), otherwise it returns the field marked as the ID index by the file header (or WDB3/WDB4, the config file).
  2. Your way. Parsers have to be taught that column 0 can be fake, so they need to start reading their data into the structure from column 1 to the end. After that, you go back and insert data from the non-inline IDs array into column 0. You'll still probably need an ID accessor of some kind since the column can move, unless you're using reflection and using 'ID' as the name for it consistently.

Which is actually preferable to most people? I honestly don't know.

And yes, you're correct about one thing. If we go with including a field that doesn't exist in structures, then the only way to salvage the format for file reading is to make the fake field extremely well marked. None of this "doesn't have a <32> on the end" stuff. How about $FAKENEWS$? :>

I'd personally prefer to keep non-inline IDs purely for human reasons/clarity/semantics. However, clearly marking it for parsers to easily ignore sounds like a good idea. Having an incorrect amount of fields (as parsers are concerned atleast) would definitely throw a wrench in some things that I might or might not have already ran into attempting to implement this.

What about having $id$ for inline IDs and $noninlineid$ for non-inline IDs? Feels kinda bad still but would be better than not marking at all.

I've had a bit of a change of heart on this, tbh. I guess it was realizing that the format I was pushing for had virtually zero audience. If we want the audience to be 'db2 experts', then who is that - really? Like 4-6 people outside of Blizzard, probably.

It also helped to write out the methods of coding both scenarios in my last post because it made me realize that I even do it both ways myself and that another person I know actually uses the second method exclusively, requiring a fake ID column to be in the provided structure.

So, yes. That sounds good. I propose making that change, then calling the format 'good enough' for now.

There's a few areas to expand into for the future - verified vs unverified, bitfield and enum definitions, and there's one minor 'gotcha' case that existed for WDB6 with default values only being present in the client that we'll want to explore (in WDC1, they're in the db2s so the problem solves itself eventually - here's how I handled it fwiw - https://paste2.org/P4JG5BWb).

But those are proposed additions to the format, not changes, and they can wait for later.

Shipped the $noninline$ change in b875e10 and removed //relationship column comment in 8525b36.

Is everyone satisfied with the current state of the format enough so we can start actually defining more versions?

not quite format, but abusing the issue:

  • $x_internal.dbc: have those? not like they will ever be relevant.
  • is it gt$table or Gt$table?
    • are game tables worth documenting to being with?
  • is item-sparse.dbx the same as item.dbx?

While I'm still functioning:

  • I agree, they're pretty irrelevant.
  • It'd be nice to document game tables just for the sake of completeness but I wouldn't prioritise it.
  • Personally I think the dbd's should reflect the filenames especially if at some point these names ever get re-purposed side-by-side. Also, I'm sure I've seen item.dbx and item-sparse.dbx both populated (but not necessarily read) in a beta cata build.
    • A followup question, is it worth documenting obsolete dbs still in the game files? (I say yes)
  • just ignore _internal
  • have them, I think I took gt, but no care-y there. case insensitive it should be, even though we aren't, yay.
  • multiple files, k
    • yes, not everyone is on latest version.

Right, we've been shipping a bunch of versions with the current format now and have noticed some things:

Signedness
The current version can define a uint or int in column definitions, but is unable to deal with it changing in a build. The most recent build (26788) introduced a lot of int -> uint changes and vice versa. The opinions on this mattering a lot or not differ and I want to have some more input on this.

Do we need to add something to the format that specifies signedness in version definitions? Something like ColumnName<u32> and having ints (signed or unsigned) always be int in column definitions? Will that make it too complex for not much of a benefit? Feedback, please.

Defaults
For a short time in WDB6 (7.2-7.3.2) DBMeta in exe had default values for only a handful of DBs. Do we want to support this (and add complexity)? Something like ColumnName<8> = 255, maybe? :(

Column reordering builds
Build 8.0.1.26788 has been kind of a bitch in reordering a ton of DBs causing a lot of manual work to correct definitions for this version. There's a plan to make an application that would automatically detect column order swapping based on type/flags/content to get rid of the hours of manual work that go into builds that do this. I know @barncastle has been working on an application that detects differences between versions so hopefully we'll be able to build on that?

Build ranges
These have been more annoying that initially thought to deal with properly. No further comment for now and most files will have growing lists of builds for the foreseeable future.

Adoption
Adoption is still low, I use it in my own tools and @barncastle has a test branch for WDBXEditor that I've been using to map DBCs that the other readers I have don't support (primarily 6.0.1.18179) and it has been working great. Hopefully that'll see the light of day when we are properly catched up definitions for older versions. We're still motivated to keep the format going and adding new versions, hopefully this will gradually pick up more steam.

Other than that, I'm personally pretty happy with what we have now. Work is progressing on a DBMeta dumper for multiple versions, slowly moving backwards and adding more WoW versions. We added a few missing definitions for DBCs that aren't present in exes recently as well.

Signedness

I'm fine with making COLUMNS only have int and adding the u prefix or suffix to the bit spec (still trivial to parse). I felt it was relevant from the beginning and only was for dropping due to the hassle.

Defaults

No opinion since I'm not even sure where they are used. For hot fixes to leave out parts? For those compression things in file?

Alternative without format change: Just place it in a comment.

Column reordering builds

While it is sad and annoying, I don't think we can get a better solution than such an automated thing. After all we need it anyway given we dump unnamed builds all the time. Currently we only match those on layout hash, which is probably fine, but matching on data also matching would be way more awesome, especially with signed/unsigned crap. I still wish we had our own auto layout, making the reordering non-existent based on layout=f(data+spec) where spec would be more version independent, but oh well.

For your sanity I suggest you don't manually map those currently and we hope for Barncastle or Warpten to finally come up with something.

Build ranges

Vote: keep, don't bother. At some point write tool that

  • groups versions by major.minor.patch onto different "lines"
  • sorts versions by build within the groups
  • uses manually verified list of known adjacent builds (note: given our lack of knowledge, this is equivalent to a list of all known public builds) for that group to merge the ranges per group

I tried writing that before but was not concentrated enough and still thought that cross-m.m.p ranges should also be merged (i.e.

BUILD 1.0.0.1, 1.0.0.3
BUILD 1.0.1.2, 1.0.1.4
known: 1.0.0.1, 1.0.1.2, 1.0.0.3, 1.0.1.4

→ 

BUILD 1.0.0.1-1.0.1.4

but this will be highly horrific for mid-ptr format change+reverts while live never changed and UGH. With only merging versions directly adjacent within a group, this chance ceases to exist, assuming a little bit of sanity left at Blizzard.

A person at Blizzard told me that we should explicitly not rely to assume that if x.y.z.1-retail and x.y.z.3-retail have the same layout, that x.y.z.2-ptr also will have the same layout. Since we re-dump every build anyway and the list of adjacent versions is manually defined, we should not end up with an issue here. Such an awkward case is QuestV2CliTask which had layout 3F026A14 in 8.0.1.26095, then changed to FE5B5478 for 8.0.1.26175-8.0.1.26231 and changed back to 3F026A14 for build 8.0.1.26287, to carry on with layout 5A9EE4A6 and others with build 8.0.1.26297.

Adoption

Sad, but I never got to finish the wiki converter either (mostly since I was at the point of having to get information from wiki to dbd), and I understand people don't want to bother filling in dbd while other formats have layouts already. An amazing contribution would be someone converting existing layouts to dbd, at least for those where it is possible with automation (wdbx, arctium I guess).

I to see that not having a lot of builds besides 8.0.1 is our main downside. And I noticed that we don't even have all of those. Sad.

Fine with proposed changes, both for signedness and default values.

@bloerwald (and others):
As a bit of background on the default values, they're are used as part of the process of handling common_data. They're mainly used for non-zero float columns, mainly multipliers where 99% of the rows will have the value '1'. The default being '1' instead of 0 allows them to use less space since they only have to store the value for the handful of rows that aren't the non-zero default.

In the WDB6 era, they existed only in the executable's DB meta information. However, when they moved to WDC1, they moved that information out of the executable (as far as I know, it's completely removed from the meta) and into the DB2. The good news is that the WDB6 era was not exceptionally long lived, and the system was not widely used. I count 7 different layouts across 6 different db2s using it in total (this could be wrong though because I added default value support to my tools very late into the life cycle of WDB6 and possibly missed retroactively updating some formats).

Will be implementing the signedness change today. This will be a breaking change, so @barncastle and anyone else using the format as is will have to update to the newer version of the C# lib (or update their own implementation). I'll be changing the README imminently and will then convert all current definitions to use only int in column definitions, then the next step is going through all the currently defined builds and re-merging with proper signedness.

If you're considering logging default values, it's probably worth documenting DB flags as well. At the very least this will help with reading WDB3 - eliminating the need for hacky file length checks. That being said, WDB3 is quite literally a dead format and will not be used by anyone and post WDB3 has the flags in file...

Not format but abusing the issue: do we know how exactly members are binned into pallet/common blocks?

Common is probably not too hard to figure out but I'd love to find the threshold value. The idea is to produce db2s that match the original file as far as column compression is concerned, so that no one needs to edit client meta

With definitions for older versions (#27) coming soon, I suggest we drop the "build range per minor version" rule for expansions older than the current one (so 0.x-7.x can theoretically be a valid build range after 8.0 releases next week). Not sure this requires a real vote as it is a non-breaking change, but you can react 👍 or 👎 depending on what you think.

Also, seeing GitHub has officially decided this thread is too long/slow (it is collapsing some comments for me) I think this might be a good time to stamp the format with a giant 1.0 as soon as 8.0 releases next week and open up separate issues for remaining discussion points for future versions. 👍 and 👎 please (if 👎 please explain what is blocking 1.0), thanks.

closing this issue

oh yes please

calling it v1

yeah, call it whatever, is fine.

merging 0-7

Hmk. I guess yeah, as long as we don’t see that 1 and 6 is the same and then assume 1-6 but do some kind of verification, at least with heuristics/random samples? I mean, there are some that surely never changed where it is fine, but it seems risky. In fact I wanted to do so at some point but didn't dare breaking anything.

With the last minor update (version ranges can be specified per expansion and are not to span multiple expansions) I'm gonna go ahead and close this. Work is ongoing on getting <6.0 versions merged in, with stuff in between following after that.

Further issues/feedback is still very much welcome, but please open a separate issue for that.

Thanks everyone for participating in the creation of 1.0!