dhruvbaldawa / multibase

Self identifying base encodings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multibase (WIP)

Self identifying base encodings

Multibase is a protocol for distinguishing base encodings and other simple string encodings, and for ensuring full compatibility with program interfaces. It answers the question:

Given data d encoded into string s, how can I tell what base d is encoded with?

Base encodings exist because transports have restrictions, use special in-band sequences, or must be human-friendly. When systems chose a base to use, it is not always clear which base to use, as there are many tradeoffs in the decision. Multibase is here to save programs and programmers from worrying about which encoding is best. It solves the biggest problem: a program can use multibase to take input or produce output in whichever base is desired. The important part is that the value is self-describing, letting other programs elsewhere know what encoding it is using.

Table of Contents

Format

The Format is:

<varint-base-encoding-code><base-encoded-data>

Where <varint-base-encoding-code> is used according to the multibase table. Note that varints (bases above 127) are not yet supported, but planned.

Multibase Table v1.0.0-RC (semver)

The current multibase table is here:

encoding      codes   name
identity      0x00    8-bit binary (encoder and decoder keeps data unmodified)
base1         1       unary tends to be 11111
base2         0       binary has 1 and 0
base8         7       highest char in octal
base10        9       highest char in decimal
base16        F, f    highest char in hex
base32        B, b    rfc4648 - no padding - highest letter
base32pad     C, c    rfc4648 - with padding
base32hex     V, v    rfc4648 - no padding - highest char
base32hexpad  T, t    rfc4648 - with padding
base32z       h       z-base-32 - used by Tahoe-LAFS - highest letter
base58flickr  Z       highest char
base58btc     z       highest char
base64        m       rfc4648 - no padding
base64pad     M       rfc4648 - with padding - MIME encoding
base64url     u       rfc4648 - no padding
base64urlpad  U       rfc4648 - with padding

These encodings are being considered:

base128
base-emoji    😎      base emoji
base65536     ᔰ       base65536
utf8
utf16

Multibase By Example

Consider the following encodings of the same binary string:

4D756C74696261736520697320617765736F6D6521205C6F2F # base16 (hex)
JV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32
YAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58
TXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64

And consider the same encodings with their multibase prefix

F4D756C74696261736520697320617765736F6D6521205C6F2F # base16 F
UJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP           # base32 U
zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt                 # base58 z
yTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==               # base64 y

The base prefixes used are: F, U, z, y.

FAQ

Is this a real problem?

Yes. If i give you "1214314321432165" is that decimal? or hex? or something else? See also:

Why the strange selection of codes / characters?

The code values are selected such that they are included in the alphabets of the base they represent. For example, F is the base code for base16 (hex), because F is in hex's 16 character alphabet. Note that the alphabets here are ASCII or UTF8 compliant. We have not found a case needing something else.

Why varints?

So that we have no limitation on functions or lengths. Implementation note: you do not need to implement varints until the standard multibase table has more than 127 functions.

What kind of varints?

An Most Significant Bit unsigned varint, as defined by the multiformats/unsigned-varint.

Don't we have to agree on a table of base encodings?

Yes, but we already have to agree on base encodings, so this is not hard. The table even leaves some room for custom encodings.

Implementations:

Disclaimers

Warning: obviously multibase changes the first byte(s) depending on the encoding. Do not expect the value to be exactly the same. Remove the multibase prefix before using the value.

Maintainers

Captain: @jbenet.

Contribute

Contributions welcome. Please check out the issues.

Check out our contributing document for more information on how we work, and about contributing in general. Please be aware that all interactions related to multiformats are subject to the IPFS Code of Conduct.

Small note: If editing the README, please conform to the standard-readme specification.

License

This repository is only for documents. All of these are licensed under the CC-BY-SA 3.0 license © 2016 Protocol Labs Inc. Any code is under a MIT © 2016 Protocol Labs Inc.

About

Self identifying base encodings