First attempt at SchemaLess FlatBuffers.

Change-Id: I86b9d002f3441ef9efdb70e059b8530ab2d74bb8
Tested: on Linux.
This commit is contained in:
Wouter van Oortmerssen
2016-02-01 18:00:30 -08:00
committed by Wouter van Oortmerssen
parent dabe030890
commit aac6be1153
8 changed files with 1705 additions and 2 deletions

View File

@@ -292,4 +292,148 @@ flexibility in which of the children of root object to write first (though in
this case there's only one string), and what order to write the fields in.
Different orders may also cause different alignments to happen.
# FlexBuffers
The [schema-less](@ref flexbuffers) version of FlatBuffers have their
own encoding, detailed here.
It shares many properties mentioned above, in that all data is accessed
over offsets, all scalars are aligned to their own size, and
all data is always stored in little endian format.
One difference is that FlexBuffers are built front to back, so children are
stored before parents, and the root of the data starts at the last byte.
Another difference is that scalar data is stored with a variable number of bits
(8/16/32/64). The current width is always determined by the *parent*, i.e. if
the scalar sits in a vector, the vector determines the bit width for all
elements at once. Selecting the minimum bit width for a particular vector is
something the encoder does automatically and thus is typically of no concern
to the user, though being aware of this feature (and not sticking a double in
the same vector as a bunch of byte sized elements) is helpful for efficiency.
Unlike FlatBuffers there is only one kind of offset, and that is an unsigned
integer indicating the number of bytes in a negative direction from the address
of itself (where the offset is stored).
### Vectors
The representation of the vector is at the core of how FlexBuffers works (since
maps are really just a combination of 2 vectors), so it is worth starting there.
As mentioned, a vector is governed by a single bit width (supplied by its
parent). This includes the size field. For example, a vector that stores the
integer values `1, 2, 3` is encoded as follows:
uint8_t 3, 1, 2, 3, 4, 4, 4
The first `3` is the size field, and is placed before the vector (an offset
from the parent to this vector points to the first element, not the size
field, so the size field is effectively at index -1).
Since this is an untyped vector `SL_VECTOR`, it is followed by 3 type
bytes (one per element of the vector), which are always following the vector,
and are always a uint8_t even if the vector is made up of bigger scalars.
### Types
A type byte is made up of 2 components (see flexbuffers.h for exact values):
* 2 lower bits representing the bit-width of the child (8, 16, 32, 64).
This is only used if the child is accessed over an offset, such as a child
vector. It is ignored for inline types.
* 6 bits representing the actual type (see flexbuffers.h).
Thus, in this example `4` means 8 bit child (value 0, unused, since the value is
in-line), type `SL_INT` (value 1).
### Typed Vectors
These are like the Vectors above, but omit the type bytes. The type is instead
determined by the vector type supplied by the parent. Typed vectors are only
available for a subset of types for which these savings can be significant,
namely inline signed/unsigned integers (`TYPE_VECTOR_INT` / `TYPE_VECTOR_UINT`),
floats (`TYPE_VECTOR_FLOAT`), and keys (`TYPE_VECTOR_KEY`, see below).
Additionally, for scalars, there are fixed length vectors of sizes 2 / 3 / 4
that don't store the size (`TYPE_VECTOR_INT2` etc.), for an additional savings
in space when storing common vector or color data.
### Scalars
FlexBuffers supports integers (`TYPE_INT` and `TYPE_UINT`) and floats
(`TYPE_FLOAT`), available in the bit-widths mentioned above. They can be stored
both inline and over an offset (`TYPE_INDIRECT_*`).
The offset version is useful to encode costly 64bit (or even 32bit) quantities
into vectors / maps of smaller sizes, and to share / repeat a value multiple
times.
### Blobs, Strings and Keys.
A blob (`TYPE_BLOB`) is encoded similar to a vector, with one difference: the
elements are always `uint8_t`. The parent bit width only determines the width of
the size field, allowing blobs to be large without the elements being large.
Strings (`TYPE_STRING`) are similar to blobs, except they have an additional 0
termination byte for convenience, and they MUST be UTF-8 encoded (since an
accessor in a language that does not support pointers to UTF-8 data may have to
convert them to a native string type).
A "Key" (`TYPE_KEY`) is similar to a string, but doesn't store the size
field. They're so named because they are used with maps, which don't care
for the size, and can thus be even more compact. Unlike strings, keys cannot
contain bytes of value 0 as part of their data (size can only be determined by
`strlen`), so while you can use them outside the context of maps if you so
desire, you're usually better off with strings.
### Maps
A map (`TYPE_MAP`) is like an (untyped) vector, but with 2 prefixes before the
size field:
| index | field |
| ----: | :----------------------------------------------------------- |
| -3 | An offset to the keys vector (may be shared between tables). |
| -2 | Byte width of the keys vector. |
| -1 | Size (from here on it is compatible with `TYPE_VECTOR`) |
| 0 | Elements. |
| Size | Types. |
Since a map is otherwise the same as a vector, it can be iterated like
a vector (which is probably faster than lookup by key).
The keys vector is a typed vector of keys. Both the keys and corresponding
values *have* to be stored in sorted order (as determined by `strcmp`), such
that lookups can be made using binary search.
The reason the key vector is a seperate structure from the value vector is
such that it can be shared between multiple value vectors, and also to
allow it to be treated as its own indivual vector in code.
An example map { foo: 13, bar: 14 } would be encoded as:
0 : uint8_t 'f', 'o', 'o', 0
4 : uint8_t 'b', 'a', 'r', 0
8 : uint8_t 2 // key vector of size 2
// key vector offset points here
9 : uint8_t 9, 6 // offsets to foo_key and bar_key
11: uint8_t 3, 1 // offset to key vector, and its byte width
13: uint8_t 2 // value vector of size
// value vector offset points here
14: uint8_t 13, 14 // values
16: uint8_t 4, 4 // types
### The root
As mentioned, the root starts at the end of the buffer.
The last uint8_t is the width in bytes of the root (normally the parent
determines the width, but the root has no parent). The uint8_t before this is
the type of the root, and the bytes before that are the root value (of the
number of bytes specified by the last byte).
So for example, the integer value `13` as root would be:
uint8_t 13, 4, 1 // Value, type, root byte width.
<br>