# Text Pattern Matching As an alternative to full regular expressions, Tomo provides a limited text matching pattern syntax that is intended to solve 80% of use cases in under 1% of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's pattern matching code is a bit under 1k lines of code). Tomo's pattern matching syntax is highly readable and works well for matching literal text without getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome). For more advanced use cases, consider linking against a C library for regular expressions or pattern matching. `Pat` is a [domain-specific language](docs/langs.md), in other words, it's like a `Text`, but it has a distinct type. Patterns are used in a small, but very powerful API that handles many text functions that would normally be handled by a more extensive API: - [`by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))`](#by_pattern) - [`by_pattern_split(text:Text, pattern:Pat -> func(->Text?))`](#by_pattern_split) - [`each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)`](#each_pattern) - [`find_patterns(text:Text, pattern:Pat -> [PatternMatch])`](#find_patterns) - [`has_pattern(text:Text, pattern:Pat -> Bool)`](#has_pattern) - [`map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)`](#map_pattern) - [`matches_pattern(text:Text, pattern:Pat -> Bool)`](#matches_pattern) - [`pattern_captures(text:Text, pattern:Pat -> [Text]?)`](#pattern_captures) - [`replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)`](#replace_pattern) - [`split_pattern(text:Text, pattern:Pat -> [Text])`](#split_pattern) - [`translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)`](#translate_patterns) - [`trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)`](#trim_pattern) ## Matches Pattern matching functions work with a type called `PatternMatch` that has three fields: - `text`: The full text of the match. - `index`: The index in the text where the match was found. - `captures`: An array containing the matching text of each non-literal pattern group. See [Text Functions](text.md#Text-Functions) for the full API documentation. ## Syntax Patterns have three types of syntax: - `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an optional `!` to negate the pattern, followed by an optional pattern name or Unicode character name, followed by a required `}`. - Any matching pair of quotes or parentheses or braces with a `?` in the middle (e.g. `"?"` or `(?)`). - Any other character is treated as a literal to be matched exactly. ## Named Patterns Named patterns match certain pre-defined patterns that are commonly useful. To use a named pattern, use the syntax `{name}`. Names are case-insensitive and mostly ignore spaces, underscores, and dashes. - `..` - Any character (note that a single `.` would mean the literal period character). - `digit` - A unicode digit - `email` - an email address - `emoji` - an emoji - `end` - the very end of the text - `id` - A unicode identifier - `int` - One or more digits with an optional `-` (minus sign) in front - `ip` - an IP address (IPv4 or IPv6) - `ipv4` - an IPv4 address - `ipv6` - an IPv6 address - `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`) - `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after - `start` - the very start of the text - `uri` - a URI - `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`) - `word` - A unicode identifier (same as `id`) For non-alphabetic characters, any single character is treated as matching exactly that character. For example, `{1{}` matches exactly one `{` character. Or, `{1.}` matches exactly one `.` character. Patterns can also use any Unicode property name. Some helpful ones are: - `hex` - Hexidecimal digits - `lower` - Lowercase letters - `space` - The space character - `upper` - Uppercase letters - `whitespace` - Whitespace characters Patterns may also use exact Unicode codepoint names. For example: `{1 latin small letter A}` matches `a`. ## Negating Patterns If an exclamation mark (`!`) is placed before a pattern's name, then characters are matched only when they _don't_ match the pattern. For example, `{!alpha}` will match all characters _except_ alphabetic ones. ## Interpolating Text and Escaping To escape a character in a pattern (e.g. if you want to match the literal character `?`), you can use the syntax `{1 ?}`. This is almost never necessary unless you have text that looks like a Tomo text pattern and has something like `{` or `(?)` inside it. However, if you're trying to do an exact match of arbitrary text values, you'll want to have the text automatically escaped. Fortunately, Tomo's injection-safe DSL text interpolation supports automatic text escaping. This means that if you use text interpolation with the `$` sign to insert a text value, the value will be automatically escaped using the `{1 ?}` rule described above: ```tomo # Risk of code injection (would cause an error because 'xxx' is not a valid # pattern name: >> user_input := get_user_input() = "{xxx}" # Interpolation automatically escapes: >> $/$user_input/ = $/{1{}..xxx}/ # This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}" # No error: >> some_text:find($/$user_input/) = 0 ``` If you prefer, you can also use this to insert literal characters: ```tomo >> $/literal $"{..}"/ = $/literal {1{}..}/ ``` ## Repetitions By default, named patterns match 1 or more repetitions, but you can specify how many repetitions you want by putting a number or range of numbers first using `n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+` (`n` or more repetitions): ``` {4-5 alpha} 0x{hex} {4 digit}-{2 digit}-{2 digit} {2+ space} {0-1 question mark} ``` # Methods ### `by_pattern` Returns an iterator function that yields `PatternMatch` objects for each occurrence. ```tomo func by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?)) ``` - `text`: The text to search. - `pattern`: The pattern to match. **Returns:** An iterator function that yields `PatternMatch` objects one at a time. **Example:** ```tomo text := "one, two, three" for word in text:by_pattern($Pat"{id}"): say(word.text) ``` --- ### `by_pattern_split` Returns an iterator function that yields text segments split by a pattern. ```tomo func by_pattern_split(text:Text, pattern:Pat -> func(->Text?)) ``` - `text`: The text to split. - `pattern`: The pattern to use as a separator. **Returns:** An iterator function that yields text segments. **Example:** ```tomo text := "one two three" for word in text:by_pattern_split($Pat"{whitespace}"): say(word.text) ``` --- ### `each_pattern` Applies a function to each occurrence of a pattern in the text. ```tomo func each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes) ``` - `text`: The text to search. - `pattern`: The pattern to match. - `fn`: The function to apply to each match. - `recursive`: If `yes`, applies the function recursively on modified text. **Example:** ```tomo text := "one two three" text:each_pattern($Pat"{id}", func(m:PatternMatch): say(m.txt) ) ``` --- ### `find_patterns` Finds all occurrences of a pattern in a text and returns them as `PatternMatch` objects. ```tomo func find_patterns(text:Text, pattern:Pat -> [PatternMatch]) ``` - `text`: The text to search. - `pattern`: The pattern to match. **Returns:** An array of `PatternMatch` objects. **Example:** ```tomo text := "one! two three!" >> text:find_patterns($Pat"{id}!") = [PatternMatch(text="one!", index=1, captures=["one"]), PatternMatch(text="three!", index=10, captures=["three"])] ``` --- ### `has_pattern` Checks whether a given pattern appears in the text. ```tomo func has_pattern(text:Text, pattern:Pat -> Bool) ``` - `text`: The text to search. - `pattern`: The pattern to check for. **Returns:** `yes` if a match is found, otherwise `no`. **Example:** ```tomo text := "...okay..." >> text:has_pattern($Pat"{id}") = yes ``` --- ### `map_pattern` Transforms matches of a pattern using a mapping function. ```tomo func map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text) ``` - `text`: The text to modify. - `pattern`: The pattern to match. - `fn`: A function that transforms matches. - `recursive`: If `yes`, applies transformations recursively. **Returns:** A new text with the transformed matches. **Example:** ```tomo text := "I have #apples and #oranges and #plums" fruits := {"apples"=4, "oranges"=5} >> text:map_pattern($Pat'#{id}', func(match:PatternMatch): fruit := match.captures[1] "$(fruits[fruit] or 0) $fruit" ) = "I have 4 apples and 5 oranges and 0 plums" ``` --- ### `matches_pattern` Returns whether or not text matches a pattern completely. ```tomo func matches_pattern(text:Text, pattern:Pat -> Bool) ``` - `text`: The text to match against. - `pattern`: The pattern to match. **Returns:** `yes` if the whole text matches the pattern, otherwise `no`. **Example:** ```tomo >> "Hello!!!":matches_pattern($Pat"{id}") = no >> "Hello":matches_pattern($Pat"{id}") = yes ``` --- ### `pattern_captures` Returns an array of pattern captures for the given pattern. ```tomo func pattern_captures(text:Text, pattern:Pat -> [Text]?) ``` - `text`: The text to match against. - `pattern`: The pattern to match. **Returns:** An optional array of matched pattern captures. Returns `none` if the text does not match the pattern. **Example:** ```tomo >> "123 boxes":pattern_captures($Pat"{int} {id}") = ["123", "boxes"]? >> "xxx":pattern_captures($Pat"{int} {id}") = none ``` --- ### `replace_pattern` Replaces occurrences of a pattern with a replacement text, supporting backreferences. ```tomo func replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text) ``` - `text`: The text to modify. - `pattern`: The pattern to match. - `replacement`: The text to replace matches with. - `backref`: The symbol for backreferences in the replacement. - `recursive`: If `yes`, applies replacements recursively. **Returns:** A new text with replacements applied. **Example:** ```tomo >> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "some") = "I have some apples and some oranges" >> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "(@1)") = "I have (123) apples and (456) oranges" >> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "(?1)", backref="?") = "I have (123) apples and (456) oranges" >> "bad(fn(), bad(notbad))":replace_pattern($Pat"bad(?)", "good(@1)") = "good(fn(), good(notbad))" >> "bad(fn(), bad(notbad))":replace_pattern($Pat"bad(?)", "good(@1)", recursive=no) = "good(fn(), bad(notbad))" ``` --- ### `split_pattern` Splits a text into segments using a pattern as the delimiter. ```tomo func split_pattern(text:Text, pattern:Pat -> [Text]) ``` - `text`: The text to split. - `pattern`: The pattern to use as a separator. **Returns:** An array of text segments. **Example:** ```tomo >> "one two three":split_pattern($Pat"{whitespace}") = ["one", "two", "three"] ``` --- ### `translate_patterns` Replaces multiple patterns using a mapping of patterns to replacement texts. ```tomo func translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text) ``` - `text`: The text to modify. - `replacements`: A table mapping patterns to their replacements. - `backref`: The symbol for backreferences in replacements. - `recursive`: If `yes`, applies replacements recursively. **Returns:** A new text with all specified replacements applied. **Example:** ```tomo >> text := "foo(x, baz(1))" >> text:translate_patterns({ $Pat"{id}(?)"="call(fn('@1'), @2)", $Pat"{id}"="var('@1')", $Pat"{int}"="int(@1)", }) = "call(fn('foo'), var('x'), call(fn('baz'), int(1)))" ``` --- ### `trim_pattern` Removes matching patterns from the beginning and/or end of a text. ```tomo func trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text) ``` - `text`: The text to trim. - `pattern`: The pattern to trim (defaults to whitespace). - `left`: If `yes`, trims from the beginning. - `right`: If `yes`, trims from the end. **Returns:** The trimmed text. **Example:** ```tomo >> "123abc456":trim_pattern($Pat"{digit}") = "abc" ```