First attempt at SchemaLess FlatBuffers.

Change-Id: I86b9d002f3441ef9efdb70e059b8530ab2d74bb8 Tested: on Linux.
2026-06-07 22:03:40 +00:00 · 2016-02-01 18:00:30 -08:00
parent dabe030890
commit aac6be1153
8 changed files with 1705 additions and 2 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -30,6 +30,7 @@ set(FlatBuffers_Library_SRCS
  include/flatbuffers/util.h
  include/flatbuffers/reflection.h
  include/flatbuffers/reflection_generated.h
+  include/flatbuffers/flexbuffers.h
  src/code_generators.cpp
  src/idl_parser.cpp
  src/idl_gen_text.cpp
--- a/docs/source/FlatBuffers.md
+++ b/docs/source/FlatBuffers.md
@@ -78,6 +78,9 @@ inefficiency, but also forces you to write *more* code to access data
 In this context, it is only a better choice for systems that have very
 little to no information ahead of time about what data needs to be stored.

+If you do need to store data that doesn't fit a schema, FlatBuffers also
+offers a schema-less (self-describing) version!
+
 Read more about the "why" of FlatBuffers in the
 [white paper](@ref flatbuffers_white_paper).

@@ -138,6 +141,8 @@ sections provide a more in-depth usage guide.
    using FlatBuffers.
 -   A [white paper](@ref flatbuffers_white_paper) explaining the "why" of
    FlatBuffers.
+-   How to use the [schema-less](@ref flexbuffers) version of
+    FlatBuffers.
 -   A description of the [internals](@ref flatbuffers_internals) of FlatBuffers.
 -   A formal [grammar](@ref flatbuffers_grammar) of the schema language.

--- a/docs/source/FlexBuffers.md
+++ b/docs/source/FlexBuffers.md
@@ -0,0 +1,156 @@
+FlexBuffers    {#flexbuffers}
+==========
+
+FlatBuffers was designed around schemas, because when you want maximum
+performance and data consistency, strong typing is helpful.
+
+There are however times when you want to store data that doesn't fit a
+schema, because you can't know ahead of time what all needs to be stored.
+
+For this, FlatBuffers has a dedicated format, called FlexBuffers.
+This is a binary format that can be used in conjunction
+with FlatBuffers (by storing a part of a buffer in FlexBuffers
+format), or also as its own independent serialization format.
+
+While it loses the strong typing, you retain the most unique advantage
+FlatBuffers has over other serialization formats (schema-based or not):
+FlexBuffers can also be accessed without parsing / copying / object allocation.
+This is a huge win in efficiency / memory friendly-ness, and allows unique
+use cases such as mmap-ing large amounts of free-form data.
+
+FlexBuffers design and implementation allows for a very compact encoding,
+combining automatic pooling of strings with automatic sizing of containers to
+their smallest possible representation (8/16/32/64 bits). Many values and
+offsets can be encoded in just 8 bits. While a schema-less representation is
+usually more bulky because of the need to be self-descriptive, FlexBuffers
+generates smaller binaries for many cases than regular FlatBuffers.
+
+FlexBuffers is still slower than regular FlatBuffers though, so we recommend to
+only use it if you need it.
+
+
+# Usage
+
+This is for C++, other languages may follow.
+
+Include the header `flexbuffers.h`, which in turn depends on `flatbuffers.h`
+and `util.h`.
+
+To create a buffer:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+flexbuffers::Builder fbb;
+fbb.Int(13);
+fbb.Finish();
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You create any value, followed by `Finish`. Unlike FlatBuffers which requires
+the root value to be a table, here any value can be the root, including a lonely
+int value.
+
+You can now access the `std::vector<uint8_t>` that contains the encoded value
+as `fbb.GetBuffer()`. Write it, send it, or store it in a parent FlatBuffer. In
+this case, the buffer is just 3 bytes in size.
+
+To read this value back, you could just say:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+auto root = flexbuffers::GetRoot(my_buffer);
+int64_t i = root.AsInt64();
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+FlexBuffers stores ints only as big as needed, so it doesn't differentiate
+between different sizes of ints. You can ask for the 64 bit version,
+regardless of what you put in. In fact, since you demand to read the root
+as an int, if you supply a buffer that actually contains a float, or a
+string with numbers in it, it will convert it for you on the fly as well,
+or return 0 if it can't. If instead you actually want to know what is inside
+the buffer before you access it, you can call `root.GetType()` or `root.IsInt()`
+etc.
+
+Here's a slightly more complex value you could write instead of `fbb.Int` above:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+fbb.Map([&]() {
+  fbb.Vector("vec", [&]() {
+    fbb.Int(-100);
+    fbb.String("Fred");
+    fbb.IndirectFloat(4.0f);
+  });
+  fbb.UInt("foo", 100);
+});
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This stores the equivalent of the JSON value
+`{ vec: [ -100, "Fred", 4.0 ], foo: 100 }`. The root is a dictionary that has
+just two key-value pairs, with keys `vec` and `foo`. Unlike FlatBuffers, it
+actually has to store these keys in the buffer (which it does only once if
+you store multiple such objects, by pooling key values), but also unlike
+FlatBuffers it has no restriction on the keys (fields) that you use.
+
+The map constructor uses a C++11 Lambda to group its children, but you can
+also use more conventional start/end calls if you prefer.
+
+The first value in the map is a vector. You'll notice that unlike FlatBuffers,
+you can use mixed types. There is also a `TypedVector` variant that only
+allows a single type, and uses a bit less memory.
+
+`IndirectFloat` is an interesting feature that allows you to store values
+by offset rather than inline. Though that doesn't make any visible change
+to the user, the consequence is that large values (especially doubles or
+64 bit ints) that occur more than once can be shared. Another use case is
+inside of vectors, where the largest element makes up the size of all elements
+(e.g. a single double forces all elements to 64bit), so storing a lot of small
+integers together with a double is more efficient if the double is indirect.
+
+Accessing it:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+auto map = flexbuffers::GetRoot(my_buffer).AsMap();
+map.size();  // 2
+auto vec = map["vec"].AsVector();
+vec.size();  // 3
+vec[0].AsInt64();  // -100;
+vec[1].AsString().c_str();  // "Fred";
+vec[1].AsInt64();  // 0 (Number parsing failed).
+vec[2].AsDouble();  // 4.0
+vec[2].AsString().IsTheEmptyString();  // true (Wrong Type).
+vec[2].AsString().c_str();  // "" (This still works though).
+vec[2].ToString().c_str();  // "4" (Or have it converted).
+map["foo"].AsUInt8();  // 100
+map["unknown"].IsNull();  // true
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+# Binary encoding
+
+A description of how FlexBuffers are encoded is in the
+[internals](@ref flatbuffers_internals) document.
+
+
+# Efficiency tips
+
+* Vectors generally are a lot more efficient than maps, so prefer them over maps
+  when possible for small objects. Instead of a map with keys `x`, `y` and `z`,
+  use a vector. Better yet, use a typed vector. Or even better, use a fixed
+  size typed vector.
+* Maps are backwards compatible with vectors, and can be iterated as such.
+  You can iterate either just the values (`map.Values()`), or in parallel with
+  the keys vector (`map.Keys()`). If you intend
+  to access most or all elements, this is faster than looking up each element
+  by key, since that involves a binary search of the key vector.
+* When possible, don't mix values that require a big bit width (such as double)
+  in a large vector of smaller values, since all elements will take on this
+  width. Use `IndirectDouble` when this is a possibility. Note that
+  integers automatically use the smallest width possible, i.e. if you ask
+  to serialize an int64_t whose value is actually small, you will use less
+  bits. Doubles are represented as floats whenever possible losslessly, but
+  this is only possible for few values.
+  Since nested vectors/maps are stored over offsets, they typically don't
+  affect the vector width.
+* To store large arrays of byte data, use a blob. If you'd use a typed
+  vector, the bit width of the size field may make it use more space than
+  expected, and may not be compatible with `memcpy`.
+  Similarly, large arrays of (u)int16_t may be better off stored as a
+  binary blob if their size could exceed 64k elements.
+  Construction and use are otherwise similar to strings.
--- a/docs/source/Internals.md
+++ b/docs/source/Internals.md
@@ -292,4 +292,148 @@ flexibility in which of the children of root object to write first (though in
 this case there's only one string), and what order to write the fields in.
 Different orders may also cause different alignments to happen.

+# FlexBuffers
+
+The [schema-less](@ref flexbuffers) version of FlatBuffers have their
+own encoding, detailed here.
+
+It shares many properties mentioned above, in that all data is accessed
+over offsets, all scalars are aligned to their own size, and
+all data is always stored in little endian format.
+
+One difference is that FlexBuffers are built front to back, so children are
+stored before parents, and the root of the data starts at the last byte.
+
+Another difference is that scalar data is stored with a variable number of bits
+(8/16/32/64). The current width is always determined by the *parent*, i.e. if
+the scalar sits in a vector, the vector determines the bit width for all
+elements at once. Selecting the minimum bit width for a particular vector is
+something the encoder does automatically and thus is typically of no concern
+to the user, though being aware of this feature (and not sticking a double in
+the same vector as a bunch of byte sized elements) is helpful for efficiency.
+
+Unlike FlatBuffers there is only one kind of offset, and that is an unsigned
+integer indicating the number of bytes in a negative direction from the address
+of itself (where the offset is stored).
+
+### Vectors
+
+The representation of the vector is at the core of how FlexBuffers works (since
+maps are really just a combination of 2 vectors), so it is worth starting there.
+
+As mentioned, a vector is governed by a single bit width (supplied by its
+parent). This includes the size field. For example, a vector that stores the
+integer values `1, 2, 3` is encoded as follows:
+
+    uint8_t 3, 1, 2, 3, 4, 4, 4
+
+The first `3` is the size field, and is placed before the vector (an offset
+from the parent to this vector points to the first element, not the size
+field, so the size field is effectively at index -1).
+Since this is an untyped vector `SL_VECTOR`, it is followed by 3 type
+bytes (one per element of the vector), which are always following the vector,
+and are always a uint8_t even if the vector is made up of bigger scalars.
+
+### Types
+
+A type byte is made up of 2 components (see flexbuffers.h for exact values):
+
+* 2 lower bits representing the bit-width of the child (8, 16, 32, 64).
+  This is only used if the child is accessed over an offset, such as a child
+  vector. It is ignored for inline types.
+* 6 bits representing the actual type (see flexbuffers.h).
+
+Thus, in this example `4` means 8 bit child (value 0, unused, since the value is
+in-line), type `SL_INT` (value 1).
+
+### Typed Vectors
+
+These are like the Vectors above, but omit the type bytes. The type is instead
+determined by the vector type supplied by the parent. Typed vectors are only
+available for a subset of types for which these savings can be significant,
+namely inline signed/unsigned integers (`TYPE_VECTOR_INT` / `TYPE_VECTOR_UINT`),
+floats (`TYPE_VECTOR_FLOAT`), and keys (`TYPE_VECTOR_KEY`, see below).
+
+Additionally, for scalars, there are fixed length vectors of sizes 2 / 3 / 4
+that don't store the size (`TYPE_VECTOR_INT2` etc.), for an additional savings
+in space when storing common vector or color data.
+
+### Scalars
+
+FlexBuffers supports integers (`TYPE_INT` and `TYPE_UINT`) and floats
+(`TYPE_FLOAT`), available in the bit-widths mentioned above. They can be stored
+both inline and over an offset (`TYPE_INDIRECT_*`).
+
+The offset version is useful to encode costly 64bit (or even 32bit) quantities
+into vectors / maps of smaller sizes, and to share / repeat a value multiple
+times.
+
+### Blobs, Strings and Keys.
+
+A blob (`TYPE_BLOB`) is encoded similar to a vector, with one difference: the
+elements are always `uint8_t`. The parent bit width only determines the width of
+the size field, allowing blobs to be large without the elements being large.
+
+Strings (`TYPE_STRING`) are similar to blobs, except they have an additional 0
+termination byte for convenience, and they MUST be UTF-8 encoded (since an
+accessor in a language that does not support pointers to UTF-8 data may have to
+convert them to a native string type).
+
+A "Key" (`TYPE_KEY`) is similar to a string, but doesn't store the size
+field. They're so named because they are used with maps, which don't care
+for the size, and can thus be even more compact. Unlike strings, keys cannot
+contain bytes of value 0 as part of their data (size can only be determined by
+`strlen`), so while you can use them outside the context of maps if you so
+desire, you're usually better off with strings.
+
+### Maps
+
+A map (`TYPE_MAP`) is like an (untyped) vector, but with 2 prefixes before the
+size field:
+
+| index | field                                                        |
+| ----: | :----------------------------------------------------------- |
+| -3    | An offset to the keys vector (may be shared between tables). |
+| -2    | Byte width of the keys vector.                               |
+| -1    | Size (from here on it is compatible with `TYPE_VECTOR`)      |
+| 0     | Elements.                                                    |
+| Size  | Types.                                                       |
+
+Since a map is otherwise the same as a vector, it can be iterated like
+a vector (which is probably faster than lookup by key).
+
+The keys vector is a typed vector of keys. Both the keys and corresponding
+values *have* to be stored in sorted order (as determined by `strcmp`), such
+that lookups can be made using binary search.
+
+The reason the key vector is a seperate structure from the value vector is
+such that it can be shared between multiple value vectors, and also to
+allow it to be treated as its own indivual vector in code.
+
+An example map { foo: 13, bar: 14 } would be encoded as:
+
+    0 : uint8_t 'f', 'o', 'o', 0
+    4 : uint8_t 'b', 'a', 'r', 0
+    8 : uint8_t 2      // key vector of size 2
+    // key vector offset points here
+    9 : uint8_t 9, 6   // offsets to foo_key and bar_key
+    11: uint8_t 3, 1   // offset to key vector, and its byte width
+    13: uint8_t 2      // value vector of size
+    // value vector offset points here
+    14: uint8_t 13, 14 // values
+    16: uint8_t 4, 4   // types
+
+### The root
+
+As mentioned, the root starts at the end of the buffer.
+The last uint8_t is the width in bytes of the root (normally the parent
+determines the width, but the root has no parent). The uint8_t before this is
+the type of the root, and the bytes before that are the root value (of the
+number of bytes specified by the last byte).
+
+So for example, the integer value `13` as root would be:
+
+    uint8_t 13, 4, 1    // Value, type, root byte width.
+
+
 <br>
--- a/docs/source/doxyfile
+++ b/docs/source/doxyfile
@@ -759,6 +759,7 @@ INPUT = "FlatBuffers.md" \
        "Support.md" \
        "Benchmarks.md" \
        "WhitePaper.md" \
+        "FlexBuffers.md" \
        "Internals.md" \
        "Grammar.md" \
        "../../CONTRIBUTING.md" \
--- a/docs/source/doxygen_layout.xml
+++ b/docs/source/doxygen_layout.xml
@@ -37,6 +37,8 @@
          title="Use in PHP"/>
      <tab type="user" url="@ref flatbuffers_guide_use_python"
          title="Use in Python"/>
+      <tab type="user" url="@ref flexbuffers"
+          title="Schema-less version"/>
    </tab>
    <tab type="user" url="@ref flatbuffers_support"
        title="Platform / Language / Feature support"/>
--- a/include/flatbuffers/flexbuffers.h
+++ b/include/flatbuffers/flexbuffers.h
--- a/tests/test.cpp
+++ b/tests/test.cpp
@@ -27,6 +27,8 @@
  #include <random>
 #endif

+#include "flatbuffers/flexbuffers.h"
+
 using namespace MyGame::Example;

 #ifdef __ANDROID__
@@ -491,8 +493,6 @@ void ReflectionTest(uint8_t *flatbuf, size_t length) {
  TEST_NOTNULL(pos_table_ptr);
  TEST_EQ_STR(pos_table_ptr->name()->c_str(), "MyGame.Example.Vec3");

-
-
  // Now use it to dynamically access a buffer.
  auto &root = *flatbuffers::GetAnyRoot(flatbuf);

@@ -1360,6 +1360,66 @@ void ConformTest() {
  test_conform("enum E:byte { B, A }", "values differ for enum");
 }

+void FlexBuffersTest() {
+  flexbuffers::Builder slb(512,
+                           flexbuffers::BUILDER_FLAG_SHARE_KEYS_AND_STRINGS);
+
+  // Write the equivalent of:
+  // { vec: [ -100, "Fred", 4.0 ], bar: [ 1, 2, 3 ], foo: 100 }
+  slb.Map([&]() {
+     slb.Vector("vec", [&]() {
+      slb += -100;  // Equivalent to slb.Add(-100) or slb.Int(-100);
+      slb += "Fred";
+      slb.IndirectFloat(4.0f);
+    });
+    std::vector<int> ints = { 1, 2, 3 };
+    slb.Add("bar", ints);
+    slb.FixedTypedVector("bar3", ints.data(), ints.size());  // Static size.
+    slb.Double("foo", 100);
+    slb.Map("mymap", [&]() {
+      slb.String("foo", "Fred");  // Testing key and string reuse.
+    });
+  });
+  slb.Finish();
+
+  for (size_t i = 0; i < slb.GetBuffer().size(); i++)
+    printf("%d ", slb.GetBuffer().data()[i]);
+  printf("\n");
+
+  auto map = flexbuffers::GetRoot(slb.GetBuffer()).AsMap();
+  TEST_EQ(map.size(), 5);
+  auto vec = map["vec"].AsVector();
+  TEST_EQ(vec.size(), 3);
+  TEST_EQ(vec[0].AsInt64(), -100);
+  TEST_EQ_STR(vec[1].AsString().c_str(), "Fred");
+  TEST_EQ(vec[1].AsInt64(), 0);  // Number parsing failed.
+  TEST_EQ(vec[2].AsDouble(), 4.0);
+  TEST_EQ(vec[2].AsString().IsTheEmptyString(), true);  // Wrong Type.
+  TEST_EQ_STR(vec[2].AsString().c_str(), "");  // This still works though.
+  TEST_EQ_STR(vec[2].ToString().c_str(), "4");  // Or have it converted.
+  auto tvec = map["bar"].AsTypedVector();
+  TEST_EQ(tvec.size(), 3);
+  TEST_EQ(tvec[2].AsInt8(), 3);
+  auto tvec3 = map["bar3"].AsFixedTypedVector();
+  TEST_EQ(tvec3.size(), 3);
+  TEST_EQ(tvec3[2].AsInt8(), 3);
+  TEST_EQ(map["foo"].AsUInt8(), 100);
+  TEST_EQ(map["unknown"].IsNull(), true);
+  auto mymap = map["mymap"].AsMap();
+  // These should be equal by pointer equality, since key and value are shared.
+  TEST_EQ(mymap.Keys()[0].AsKey(), map.Keys()[2].AsKey());
+  TEST_EQ(mymap.Values()[0].AsString().c_str(), vec[1].AsString().c_str());
+  // We can mutate values in the buffer.
+  TEST_EQ(vec[0].MutateInt(-99), true);
+  TEST_EQ(vec[0].AsInt64(), -99);
+  TEST_EQ(vec[1].MutateString("John"), true);  // Size must match.
+  TEST_EQ_STR(vec[1].AsString().c_str(), "John");
+  TEST_EQ(vec[1].MutateString("Alfred"), false);  // Too long.
+  TEST_EQ(vec[2].MutateFloat(2.0f), true);
+  TEST_EQ(vec[2].AsFloat(), 2.0f);
+  TEST_EQ(vec[2].MutateFloat(3.14159), false);  // Double does not fit in float.
+}
+
 int main(int /*argc*/, const char * /*argv*/[]) {
  // Run our various test suites:

@@ -1399,6 +1459,8 @@ int main(int /*argc*/, const char * /*argv*/[]) {
  ParseUnionTest();
  ConformTest();

+  FlexBuffersTest();
+
  if (!testing_fails) {
    TEST_OUTPUT_LINE("ALL TESTS PASSED");
    return 0;