9.8 KiB
Text Pattern Matching
As an alternative to full regular expressions, Tomo provides a limited string matching pattern syntax that is intended to solve 80% of use cases in under 1% of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's pattern matching code is a bit under 1k lines of code). Tomo's pattern matching syntax is highly readable and works well for matching literal text without getting leaning toothpick syndrome.
For more advanced use cases, consider linking against a C library for regular expressions or pattern matching.
Pat
is a domain-specific language, in other words, it's
like a Text
, but it has a distinct type.
Patterns are used in a small, but very powerful API that handles many text functions that would normally be handled by a more extensive API:
matches_pattern(text:Text, pattern:Pat -> [Text]?)
replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)
translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)
has_pattern(text:Text, pattern:Pat -> Bool)
find_patterns(text:Text, pattern:Pat -> [PatternMatch])
by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))
each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)
map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)
split_pattern(text:Text, pattern:Pat -> [Text])
by_pattern_split(text:Text, pattern:Pat -> func(->Text?))
trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)
Matches
Pattern matching functions work with a type called PatternMatch
that has three fields:
text
: The full text of the match.index
: The index in the text where the match was found.captures
: An array containing the matching text of each non-literal pattern group.
See Text Functions for the full API documentation.
Syntax
Patterns have three types of syntax:
-
{
followed by an optional count (n
,n-m
, orn+
), followed by an optional!
to negate the pattern, followed by an optional pattern name or Unicode character name, followed by a required}
. -
Any matching pair of quotes or parentheses or braces with a
?
in the middle (e.g."?"
or(?)
). -
Any other character is treated as a literal to be matched exactly.
Named Patterns
Named patterns match certain pre-defined patterns that are commonly useful. To
use a named pattern, use the syntax {name}
. Names are case-insensitive and
mostly ignore spaces, underscores, and dashes.
..
- Any character (note that a single.
would mean the literal period character).digit
- A unicode digitemail
- an email addressemoji
- an emojiend
- the very end of the textid
- A unicode identifierint
- One or more digits with an optional-
(minus sign) in frontip
- an IP address (IPv4 or IPv6)ipv4
- an IPv4 addressipv6
- an IPv6 addressnl
/newline
/crlf
- A line break (either\r\n
or\n
)num
- One or more digits with an optional-
(minus sign) in front and an optional.
and more digits afterstart
- the very start of the texturi
- a URIurl
- a URL (URI that specifically starts withhttp://
,https://
,ws://
,wss://
, orftp://
)word
- A unicode identifier (same asid
)
For non-alphabetic characters, any single character is treated as matching
exactly that character. For example, {1{}
matches exactly one {
character. Or, {1.}
matches exactly one .
character.
Patterns can also use any Unicode property name. Some helpful ones are:
hex
- Hexidecimal digitslower
- Lowercase lettersspace
- The space characterupper
- Uppercase letterswhitespace
- Whitespace characters
Patterns may also use exact Unicode codepoint names. For example: {1 latin small letter A}
matches a
.
Negating Patterns
If an exclamation mark (!
) is placed before a pattern's name, then characters
are matched only when they don't match the pattern. For example, {!alpha}
will match all characters except alphabetic ones.
Interpolating Text and Escaping
To escape a character in a pattern (e.g. if you want to match the literal
character ?
), you can use the syntax {1 ?}
. This is almost never necessary
unless you have text that looks like a Tomo text pattern and has something like
{
or (?)
inside it.
However, if you're trying to do an exact match of arbitrary text values, you'll
want to have the text automatically escaped. Fortunately, Tomo's injection-safe
DSL text interpolation supports automatic text escaping. This means that if you
use text interpolation with the $
sign to insert a text value, the value will
be automatically escaped using the {1 ?}
rule described above:
# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"
# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/
# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"
# No error:
>> some_text:find($/$user_input/)
= 0
If you prefer, you can also use this to insert literal characters:
>> $/literal $"{..}"/
= $/literal {1{}..}/
Repetitions
By default, named patterns match 1 or more repetitions, but you can specify how
many repetitions you want by putting a number or range of numbers first using
n
(exactly n
repetitions), n-m
(between n
and m
repetitions), or n+
(n
or more repetitions):
{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}
Methods
matches_pattern
Returns an array of text segments that match the given pattern.
func matches_pattern(text:Text, pattern:Pat -> [Text]?)
text
: The text to search within.pattern
: The pattern to match.
Returns:
An optional array of matched text segments. Returns none
if no matches are found.
replace_pattern
Replaces occurrences of a pattern with a replacement string, supporting backreferences.
func replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)
text
: The text to modify.pattern
: The pattern to match.replacement
: The text to replace matches with.backref
: The symbol for backreferences in the replacement.recursive
: Ifyes
, applies replacements recursively.
Returns:
A new text with replacements applied.
translate_patterns
Replaces multiple patterns using a mapping of patterns to replacement texts.
func translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)
text
: The text to modify.replacements
: A table mapping patterns to their replacements.backref
: The symbol for backreferences in replacements.recursive
: Ifyes
, applies replacements recursively.
Returns:
A new text with all specified replacements applied.
has_pattern
Checks whether a given pattern appears in the text.
func has_pattern(text:Text, pattern:Pat -> Bool)
text
: The text to search.pattern
: The pattern to check for.
Returns:
yes
if a match is found, otherwise no
.
find_patterns
Finds all occurrences of a pattern in a text and returns them as PatternMatch
objects.
func find_patterns(text:Text, pattern:Pat -> [PatternMatch])
text
: The text to search.pattern
: The pattern to match.
Returns:
An array of PatternMatch
objects.
by_pattern
Returns an iterator function that yields PatternMatch
objects for each occurrence.
func by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))
text
: The text to search.pattern
: The pattern to match.
Returns:
An iterator function that yields PatternMatch
objects one at a time.
each_pattern
Applies a function to each occurrence of a pattern in the text.
func each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)
text
: The text to search.pattern
: The pattern to match.fn
: The function to apply to each match.recursive
: Ifyes
, applies the function recursively on modified text.
map_pattern
Transforms matches of a pattern using a mapping function.
func map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)
text
: The text to modify.pattern
: The pattern to match.fn
: A function that transforms matches.recursive
: Ifyes
, applies transformations recursively.
Returns:
A new text with the transformed matches.
split_pattern
Splits a text into segments using a pattern as the delimiter.
func split_pattern(text:Text, pattern:Pat -> [Text])
text
: The text to split.pattern
: The pattern to use as a separator.
Returns:
An array of text segments.
by_pattern_split
Returns an iterator function that yields text segments split by a pattern.
func by_pattern_split(text:Text, pattern:Pat -> func(->Text?))
text
: The text to split.pattern
: The pattern to use as a separator.
Returns:
An iterator function that yields text segments.
trim_pattern
Removes matching patterns from the beginning and/or end of a text.
func trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)
text
: The text to trim.pattern
: The pattern to trim (defaults to whitespace).left
: Ifyes
, trims from the beginning.right
: Ifyes
, trims from the end.
Returns:
The trimmed text.