tomo/docs/patterns.md

5.6 KiB

Text Pattern Matching

As an alternative to full regular expressions, Tomo provides a limited string matching pattern syntax that is intended to solve 80% of use cases in under 1% of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's pattern matching code is a bit under 1k lines of code). Tomo's pattern matching syntax is highly readable and works well for matching literal text without getting leaning toothpick syndrome.

For more advanced use cases, consider linking against a C library for regular expressions or pattern matching.

Pattern is a domain-specific language, in other words, it's like a Text, but it has a distinct type. As a convenience, you can use $/.../ to write pattern literals instead of using the general-purpose DSL syntax of $Pattern"...".

Patterns are used in a small, but very powerful API that handles many text functions that would normally be handled by a more extensive API:

Text.has(pattern:Pattern -> Bool)
Text.each(pattern:Pattern, fn:func(m:Match), recursive=yes -> Text)
Text.find(pattern:Pattern, start=1 -> Match?)
Text.find_all(pattern:Pattern -> [Match])
Text.matches(pattern:Pattern -> [Text]?)
Text.map(pattern:Pattern, fn:func(m:Match -> Text), recursive=yes -> Text)
Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.split(pattern:Pattern -> [Text])
Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text])

Matches

Pattern matching functions work with a type called Match that has three fields:

  • text: The full text of the match.
  • index: The index in the text where the match was found.
  • captures: An array containing the matching text of each non-literal pattern group.

See Text Functions for the full API documentation.

Syntax

Patterns have three types of syntax:

  • { followed by an optional count (n, n-m, or n+), followed by an optional ! to negate the pattern, followed by an optional pattern name or Unicode character name, followed by a required }.

  • Any matching pair of quotes or parentheses or braces with a ? in the middle (e.g. "?" or (?)).

  • Any other character is treated as a literal to be matched exactly.

Named Patterns

Named patterns match certain pre-defined patterns that are commonly useful. To use a named pattern, use the syntax {name}. Names are case-insensitive and mostly ignore spaces, underscores, and dashes.

  • .. - Any character (note that a single . would mean the literal period character).
  • digit - A unicode digit
  • email - an email address
  • emoji - an emoji
  • end - the very end of the text
  • id - A unicode identifier
  • int - One or more digits with an optional - (minus sign) in front
  • ip - an IP address (IPv4 or IPv6)
  • ipv4 - an IPv4 address
  • ipv6 - an IPv6 address
  • nl/newline/crlf - A line break (either \r\n or \n)
  • num - One or more digits with an optional - (minus sign) in front and an optional . and more digits after
  • start - the very start of the text
  • uri - a URI
  • url - a URL (URI that specifically starts with http://, https://, ws://, wss://, or ftp://)
  • word - A unicode identifier (same as id)

For non-alphabetic characters, any single character is treated as matching exactly that character. For example, {1{} matches exactly one { character. Or, {1.} matches exactly one . character.

Patterns can also use any Unicode property name. Some helpful ones are:

  • hex - Hexidecimal digits
  • lower - Lowercase letters
  • space - The space character
  • upper - Uppercase letters
  • whitespace - Whitespace characters

Patterns may also use exact Unicode codepoint names. For example: {1 latin small letter A} matches a.

Negating Patterns

If an exclamation mark (!) is placed before a pattern's name, then characters are matched only when they don't match the pattern. For example, {!alpha} will match all characters except alphabetic ones.

Interpolating Text and Escaping

To escape a character in a pattern (e.g. if you want to match the literal character ?), you can use the syntax {1 ?}. This is almost never necessary unless you have text that looks like a Tomo text pattern and has something like { or (?) inside it.

However, if you're trying to do an exact match of arbitrary text values, you'll want to have the text automatically escaped. Fortunately, Tomo's injection-safe DSL text interpolation supports automatic text escaping. This means that if you use text interpolation with the $ sign to insert a text value, the value will be automatically escaped using the {1 ?} rule described above:

# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"

# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/

# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"

# No error:
>> some_text:find($/$user_input/)
= 0

If you prefer, you can also use this to insert literal characters:

>> $/literal $"{..}"/
= $/literal {1{}..}/

Repetitions

By default, named patterns match 1 or more repetitions, but you can specify how many repetitions you want by putting a number or range of numbers first using n (exactly n repetitions), n-m (between n and m repetitions), or n+ (n or more repetitions):

{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}