wmc v0.1

2022 september 6
ian henderson <ian@ianhenderson.org>

introduction

wmc is an experiment in self-describing file format design.

  1. a wmc file begins with a list of webassembly functions to compile.
  2. these functions can read bytes from the file, produce output, and compile further functions.
  3. the resulting output is structured and hierarchical, with arrays, key-value associations, numbers, booleans, and strings (like json).
  4. wmc isn't turing complete. loops and direct function calls are disallowed, and each function must either read bytes or produce output. functions call each other by returning the index of the next function to call.
  5. producing an array starts a new decoding process, allowing arrays to be decoded in parallel (or array decoding to be restarted on demand, for things like seeking within a video).

demonstration

wmc.img is a simple image format built using wmc. try it out using the web-based decoder!

how to decode a wmc file

the initial decoding process begins with three preinstalled functions.

the function preinstalled at index 0 reads one byte (the length) to memory address 0, then continues in the function at index 1.

the function preinstalled at index 1 reads 1 + length bytes (a one-byte index plus a length-byte function) to memory address 1, then continues in the function at index 2.

the function preinstalled at index 2 compiles the function at memory address 2 and installs it at index, potentially replacing one of these three preinstalled functions. it then reads one byte (the length) to memory address 0 and continues in the function at index 1.

execution starts in the function at index 0.

practically, this means a wmc file begins with a list of functions encoded as a one-byte length, a one-byte index, and length bytes of function data. each function is compiled and installed at the given index. after installing a function at index 1, the preinstalled function at index 2 will read one more byte, then continue in the just-installed function, ending the list.

function encoding

function data is encoded according to the webassembly specification—specifically, the func production in the binary encoding of the code section. that is, it's a vector of local variable declarations followed by a sequence of instructions ending with the byte 0x0B.

functions must take no parameters and return five i32 values:

  1. the index of the next function to execute,
  2. the number of bytes to read from the input before continuing in the next function,
  3. the webassembly memory address to write the input into,
  4. the number of output elements to produce, and
  5. the webassembly memory address to read the output elements from.

for example, returning the values 2, 10, 32, 0, and 0 will output nothing, copy 10 bytes from the input into bytes 32-41 of the webassembly memory, then continue in the function at index 2.

the length of each output element depends on its type. for example, in an array of u16 values, each output will be two bytes long; producing ten outputs will read twenty bytes from webassembly memory. see the output element encoding section for a table of types.

either the number of bytes to read from the input or the number of output elements to produce must be greater than zero. if both values are less than or equal to zero, decoding will stop. together with the restrictions on valid instructions, this guarantees decoding will always make progress. if both values are greater than zero, output is produced before input is read (so memory used for input and output can overlap).

as a special case, an output address of -1 will read elements directly from the input instead of from webassembly memory.

if a function attempts to read more bytes than are left in the file, only the bytes remaining in the file will be read (the rest of the memory will be left alone). if there are no bytes left in the file, and no output elements are produced, then decoding will stop.

the compile function, invoked from webassembly as call 0, compiles and installs function data from webassembly memory. compile has no return value and takes three i32 parameters:

  1. a memory offset where function data can be found,
  2. the length of the function data, and
  3. the index at which to install the compiled function. this index may be between 0 and 65535, inclusive.

function data must contain no calls to anything other than the compile function, no call_indirect instructions, and no loop instructions. function data containing these instructions will fail to compile and stop the decoding process.

output element encoding

data is always read from the input as bytes, but output can be produced in a variety of types. values are encoded using up to 32 bytes, depending on their type:

tagtypedescription012345678910111213141516171819202122232425262728293031
0s8signed (two's-complement) 8-bit integern
1u8unsigned 8-bit integern
2s16signed 16-bit (little-endian) integern
3u16unsigned 16-bit integern
4s32signed 32-bit integern
5u32unsigned 32-bit integern
6s64signed 64-bit integern
7u64unsigned 64-bit integern
8f3232-bit floatn
9f6464-bit floatn
10bool1 or 0b
11stringmemory address of zero-terminated stringaddress
12arraysee [1]number of elementsbyte offset in input datafuncmlentype tag
13pairname-value pair; see [2]type tagnamevalue

[1] producing an array copies the first mlen bytes of webassembly memory into a new memory object, then creates a new decoding process with this memory and the same installed functions as the current process. the new decoding process jumps ahead byte offset in input data bytes and starts decoding in the function at index func. the resulting array is made from the elements produced from this decoding process, which all have the type indicated by the type tag. the decoding process stops as soon as number of elements elements are produced.

[2] the name in a name-value pair is the address of a zero-terminated string in webassembly memory. the value is a value of any other type, indicated by the type tag, except for another name-value pair (they can't be nested). the encoding of the value begins at byte 8; bytes beyond the length of the value's encoding are ignored. an array of name-value pairs is similar to a json object.

decoding begins as if in an array with a single element of type tag 12 (array). so the first (and only) element produced by the initial decoding process must be a 24-byte array descriptor. the array described therein will be considered the root array of the final output.

if a function produces multiple output elements, those elements are read from webassembly memory without gaps. that is: producing 10 booleans will read 10 bytes; producing 10 name-value pairs will read 320 bytes. name-value pairs are always 32 bytes long, no matter what the value type is.

annotated example

this is the simple-pattern.wmc.img file, which you can view using the wmc.img decoder. the hexadecimal numbers on the left are bytes as you'd see them in a hex editor. webassembly instruction names and comments appear on the right.

0C 03                   compile the following 12-byte function and install it at index 3
00                      this function has no local variables
4100                    i32.const 0
4100                    i32.const 0
4100                    i32.const 0
4101                    i32.const 1
4100                    i32.const 0
0B                      end the function, returning the values [0, 0, 0, 1, 0]
                         that is, produce one output, encoded at memory address 0
                         since the root array only has one element, the decoding process stops

0C 04                   compile the following 12-byte function and install it at index 4
00                      this function has no local variables
4100                    i32.const 0
4100                    i32.const 0
4100                    i32.const 0
4103                    i32.const 3
4118                    i32.const 24
0B                      end the function, returning the values [0, 0, 0, 3, 24]
                         that is, produce three outputs, encoded starting at memory address 24
                         all three elements of the array are produced, and the decoding process stops

1F 05                   compile the following 31-byte function and install it at index 5
00                      this function has no local variables
41DE00                  i32.const 94
41DE00                  i32.const 94
2F0000                  i32.load16_u align:0 offset:0
413C                    i32.const 60
6A                      i32.add
4101                    i32.const 5
6C                      i32.mul
3B0000                  i32.store16 align:0 offset:0
4105                    i32.const 5
4100                    i32.const 0
4100                    i32.const 0
413F                    i32.const 63
41DE00                  i32.const 94
0B                      end the function, returning the values [5, 0, 0, 63, 94]
                         that is, produce 63 outputs, encoded starting at memory address 94, and continue in function 5
                         this writes a bunch of bytes that happen to be in memory as output
                         while modifying the bytes each time (add 60 and multiply by 5)

0D 01                   compile the following 13-byte function and install it at index 1
00                      this function has no local variables
4103                    i32.const 3
418901                  i32.const 137
4101                    i32.const 1
4100                    i32.const 0
4100                    i32.const 0
0B                      end the function, returning the values [3, 137, 1, 0, 0]
                         that is, read 137 bytes, write those bytes into memory starting at address 1, and continue in function 3

                        the following 138 bytes are read into memory by the function at index 1 and the preinstalled function at index 2,
                        then output by the function at index 3:

0300000000000000        this encodes an array with three elements
0000000000000000        whose data begins after an offset of 0 bytes
0400                    the decoding process begins in function 4
8900                    with a copy of the first 137 bytes of memory
0D000000                and the array contains name-value pairs (type tag 13)

05000000                this encodes a name-value pair with a u32 value (type tag 5)
78000000                whose name is the string at address 120 ("width")
E8030000                and whose value is 1000
0000000000000000        (the value is padded out to 24 bytes)
0000000000000000
00000000

05000000                this encodes a name-value pair with a u32 value (type tag 5)
7E000000                whose name is the string at address 126 ("height")
64000000                and whose value is 100
0000000000000000        (the value is padded out to 24 bytes)
0000000000000000
00000000

0C000000                this encodes a name-value pair with an array value (type tag 12)
85000000                whose name is the string at address 133 ("rgba")
801A060000000000        and whose value is an array with 400000 elements
0000000000000000        the data of which begins after an offset of 0 bytes
0500                    the array's decoding process begins in function 5
8900                    with a copy of the first 137 bytes of memory
01000000                and the array contains values of type u8 (type tag 1)

776964746800            the string "width" in ascii

68656967687400          the string "height"

7267626100              the string "rgba"