TAML Grammar Reference¶
Hint
This page is aimed at format support implementors.
For a user manual (even when using a TAML library as developer), see TAML by Example.
TK: Use singular for headings.
All grammar is defined in terms of Unicode codepoint identity.
Where available, the canonical binary or at-rest encoding of TAML is UTF-8, while its runtime text-API representation should use the canonical representation of arbitrary Unicode strings in the target ecosystem.
Note
Where no standard Unicode text representation exists, it’s likely best to provide only a binary UTF-8 API.
Whitespace¶
Note
TK: Format as regex section
[ \t]+
Whitespace is meaningless except when separating otherwise-joined tokens.
Note that line breaks are not included here.
Comment¶
Note
TK: Format as regex section
//[^\r\n]+
At (nearly) any point in the document, a line comment can be written as follows:
// This is a comment. It stretches for the rest of the line.
// This is another comment.
The only limitation to comment placement is that the line up to that point must be otherwise complete.
Line break¶
Note
TK: Format as regex section
\r?\n
TAML does not use commas to delineate values, outside of inline lists and rows.
Instead, line breaks are a grammar token that separates comments, headings, key-value pairs and table rows.
Note
“Line break” more specifically refers to Unicode code point U+000A LINE FEED (LF), which can optionally be prefixed with a single U+000D CARRIAGE RETURN (CR).
This is the only position in which verbatim carriage return characters are legal.
Note that occurrences of the line feed character in quotes are not considered to be a line break token!
Correct the literal in question by either replacing all verbatim carriage return characters with \r
or deleting them.
Empty lines outside of quotes and lines containing only a comment always can be removed without changing the structure or contents of the document.
Hint
taml fmt
preserves single empty lines but collapses longer blank parts of the document.
taml fix
can fix your line endings for you without changing the meaning of quotes. (TODO)
It warns about any occurrence of the character it doesn’t fix by default, in either sense. (TODO)
Identifier¶
Note
TK: Format as regex section
[a-zA-Z_][a-zA-Z\-_0-9]*
`([^\\`\r]|\\\\|\\`|\\r)*`
Identifiers in TAML are arbitrary Unicode strings and can appear in two forms, verbatim and quoted:
Verbatim¶
Verbatim identifiers must start with an ASCII-letter or underscore (_
). They may contain only those codepoints plus ASCII digits and the hypen-minus character (-
).
Hint
Support for -
is a compatibility affordance.
When outlining a new configuration structure, I recommend for example a_b
over a-b
, as the former is treated as single “word” by most text editors. (Try double-clicking each.)
Quoted¶
Backtick (`
)-quoted identifiers are parsed as completely arbitrary Unicode strings.
Only the following characters are backlash-escaped:
\
as\\
`
as\`
All other sequences starting with a backslash are invalid in quoted strings and must lead to an error.
Warning
Identifiers formally may be empty or contain U+0000 NULL.
However, parsers for ecosystems where this cannot be safely supported are free to limit support here, as long as this limitation is prominently declared.
(A parser written in for example C# or Rust very much should support both, though. A parser written in C or C++ should consider not supporting NULL due to its common special meaning.)
TK: Define an error code that should be used here. Something like TAML-L0001?
Key¶
Only identifiers may be keys. Keys appear in section headers, enum variants and as part of key-value pairs like the following:
key: value
(value
is a unit variant here, but could be replaced with any other value.)
Value¶
A value is any one of the following:
`data literal`_, decimal, `enum variant`_, integer, list, string, struct_.
Warning
TAML processors should be as strict as at all sensible regarding value types. For example, if a string is expected, don’t accept an integer and vice versa.
In some cases, remapping TAML value types is a good idea, like when parsing rust_decimal values using Serde, which should still be written as decimals in TAML but internally processed as strings. Such remappings should be done explicitly on a case-by-case basis.
Integer¶
Note
TK: Format as regex section
-?(0|[1-9]\d*)
A whole number with base 10.
Note that -0
is legal and may be interpreted differently from 0
.
Additional leading zeroes are disallowed to avoid confusion with languages and/or parsing systems where this would denote base 8.
Hint
If your configuration requires setting a bitfield, consider accepting it as data literal e.g. like this instead:
some_bitfield: <bits:1000_0001 1111_0000>
another_encoding: <hex:81 F0>
Decimal¶
Note
TK: Format as regex section
-?(0|[1-9]\d*)\.\d+
A fractional base 10 number.
Note that -0
is legal and may be interpreted differently from 0
.
Additional leading zeroes are disallowed for consistency with integers. Additional trailing zeroes are considered idempotent and must not make a difference when parsing a value.
Note
Integers and decimals should be considered disjoint. Don’t accept one for the other unless not doing so would be unusually inconvenient.
Note
Decimals, like integers, are not required to fit any particular binary representation.
For example, they could be parsed and processed with arbitrary precision rather than as IEEE 754 float.
Warning
taml fmt
removes idempotent trailing zeroes from decimals.
serde_taml
excludes them while lexing, which also affects reserde
.
Absolutely do not make any distinction regarding additional trailing zeroes in decimals when writing a lexer or parser.
String¶
Note
TK: Format as regex section
"([^\\"\r]|\\\\|\\"|\\r)*"
Strings are written as quoted Unicode literals. The characters \
, "
and U+000D CARRIAGE RETURN (CR)
must be escaped as \\
, \"
and \r
, respectively.
The character U+0000 NULL may be unsupported in environments where processing it would be unreasonably error-prone.
Enum Variants¶
TK
Unit Variant¶
Unit variants are written as single identifiers.
Notable unit variants are the boolean values true
and false
, which are not associated with more specific grammar in TAML.
Sections¶
TAML’s grammar is, roughly speaking, split into three contexts:
structural sections
headings
tabular sections
Structural Sections¶
The initial context is a structural section. Structural sections can contain key-value pairs and nested sections, which can be structural sections.
first: 1
second: 2
# third
first: 3.1
second: 3.2
Each nested section is introduced by a heading nested exactly one deeper than the surrounding section’s.
It continues until a heading with at most equal depth is encountered or up to the end of the file. An empty nested heading can be used to semantically (but not grammatically!) return to its immediately surrounding structural section.
first: 1
second: 2
# third
first: 3.1
second: 3.2
## third
first: "3.3.1"
second: "3.3.2"
## fourth
first: "3.4.1"
second: "3.4.2"
#
fourth: 4
Headings¶
Tabular Sections¶
Tabular sections are a special shorthand to quickly define lists with structured content.
The following are equivalent:
# [[dishes].{id, name, [price].{currency, amount}]
<luid:d6fce69d-9c9d>, "A", EUR, 10.95
<luid:c37dcc6a-2002>, "B", EUR, 5.50
<luid:00000000-0000>, "Test Item", EUR, 0.0
# [dishes]
id: <luid:d6fce69d-9c9d>
name: "A"
## price
currency: EUR
amount: 10.95
# [dishes]
id: <luid:c37dcc6a-2002>
name: "B"
## price
currency: EUR
amount: 5.50
# [dishes]
id: <luid:00000000-0000>
name: "Test Item"
## price
currency: EUR
amount: 0.0
Hint
As of right now, there is intentionally no way to define common values once per table.
I haven’t found a way to express this that both is intuitive and won’t make copy/paste errors much more likely.
Row¶
TK