sutf8: a structured text encoding

An sutf8 string is a UTF-8 string which can also contain balanced pairs of 0xFE and 0xFF octets. These pairs give the string a tree structure.

For example, if name is a valid UTF-8 string,

"select * from users where name=\xFE"+name+"\xFF;"

is a valid sutf8 string representing a database query. The tree structure of this string is

select * from users where name= ;
                               |
                               John (for example).

Since the grouping octets 0xFE and 0xFF can't appear in a valid UTF-8 string, this method of embedding is guaranteed to be secure against injection attacks.

motivation

The sutf8 encoding was designed to solve the "escaping" problem that tends to afflict structured textual formats. Take HTML, for example. When generating HTML, writing code like

print_html(user_name + " wrote " + user_comment)

is usually a mistake. Both user_name and user_comment will appear on the resulting page as full HTML, not just innocent plain text. For example, if user_name includes a <div style="red"> tag, even text outside the user name could turn red. If user_comment includes a <script> tag, the JavaScript inside that tag will run, giving the user total control over your web page. Or the < symbol could have been used to mean "less than"—the person who typed it will be confused when it's interpreted as the start of an HTML tag.

If you want user_name and user_comment to be treated as plain text, you have to escape them, replacing structuring characters that have special meanings (like <) with their "escaped" equivalents (like &lt;). It's only safe to concatenate or substitute a string once it's free of these structuring characters.

And it's not just HTML: any textual data format with structuring characters has this problem. At some point, you want to use one of the structuring characters as a normal character—then what?

To solve this problem, sutf8 uses structuring bit patterns that can't appear as characters in a normal UTF-8 string. Since these bit patterns can't appear as normal characters, there's no need to escape them—the UTF-8 validation procedure built in to your programming language has already "escaped" your string for you. [1]

Going back to our HTML example, if we had used "SUML" instead of HTML (where tags are written like "\xFEtag\xFF" rather than "<tag>"), the code

print_html(user_name + " wrote " + user_comment)

would work as expected. The fact that user_name and user_comment are valid UTF-8 strings means they can't include any structure. All UTF-8 strings become plain text in a sutf8 format.

[1] This is related to the classic type-based solution, where you define a SafeString type to represent escaped strings. In sutf8, that SafeString type is just "valid UTF-8 string".

formal definition

An sutf8 string is an octet string S where:

  1. S can be broken down as S = concat(p1, …, pn), where each substring pi is either the single-octet string "\xFE", the single-octet string "\xFF", or a valid UTF-8 string.
  2. The number of 0xFE octets in S is equal to the number of 0xFF octets in S.
  3. Every 0xFF can be matched with an earlier 0xFE—that is, each prefix of S contains no more 0xFF octets than 0xFE octets.

The last two properties—that all the 0xFE and 0xFF octets are properly matched—lets you concatenate two sutf8 strings together or substitute one into another without causing structural problems.

sutf8 object notation

sutf8 object notation, or SUON, is a round-trippable encoding for JSON data which uses sutf8 grouping octets in place of quotation marks.

Here's what JSON objects and strings look like in SUON form, with ( and ) representing 0xFE and 0xFF octets:

    {"a":b, "c":d} -> {(a):b, (c):d}
"this is a string" -> (this is a string)
 "quotation\"mark" -> (quotation"mark)

Since UTF-8 strings can't contain grouping octets, there's no need for escape sequences in SUON. This makes SUON easier to produce (you can output string keys and values directly) and faster to decode (you never have to copy strings which include escape sequences).

display and editing

While sutf8 puts constraints on the placement of the 0xFE and 0xFF grouping octets, it's still fundamentally a textual format. Displaying sutf8 is like displaying UTF-8, but with special grouping glyphs rendered in place of 0xFE and 0xFF.

To enter a 0xFE or 0xFF octet, an sutf8 editor could use a key shortcut (e.g., ctrl+[/ctrl+]) or a repeated key press (e.g., [[/]]). An sutf8 editing system should make sure each grouping octet is properly paired whenever a text field is validated (on blur or submit) and whenever a document is saved.

D. J. Bernstein's netstrings solve the escaping problem by including the length as a prefix of each embedded string. Netstrings are easier to decode than sutf8 strings, but they're slightly harder to encode, since you need to know the length of a string in order to embed it.

R. Rivest's Canonical S-Expressions support various forms of quoting, including quoted strings with escaping and netstring-style length prefixing.

Ian Henderson <ian@ianhenderson.org>
originally posted on 20 may 2018; last edited on 4 jun 2018