diff options
| author | Bruce Hill <bruce@bruce-hill.com> | 2025-03-03 13:45:30 -0500 |
|---|---|---|
| committer | Bruce Hill <bruce@bruce-hill.com> | 2025-03-03 13:45:30 -0500 |
| commit | f330f06c218a0903530cdc51c0fac245cb51ae75 (patch) | |
| tree | 7e4220790a846cbd05190149edd3a61ce795b752 /docs/patterns.md | |
| parent | 80475ad02d6b20d6c667c3be8bc939a83632bd3f (diff) | |
Add `recursive` argument to text:each() and text:map(), plus update docs
Diffstat (limited to 'docs/patterns.md')
| -rw-r--r-- | docs/patterns.md | 153 |
1 files changed, 153 insertions, 0 deletions
diff --git a/docs/patterns.md b/docs/patterns.md new file mode 100644 index 00000000..dd57ada9 --- /dev/null +++ b/docs/patterns.md @@ -0,0 +1,153 @@ +# Text Pattern Matching + +As an alternative to full regular expressions, Tomo provides a limited string +matching pattern syntax that is intended to solve 80% of use cases in under 1% +of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's +pattern matching code is a bit under 1k lines of code). Tomo's pattern matching +syntax is highly readable and works well for matching literal text without +getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome). + +For more advanced use cases, consider linking against a C library for regular +expressions or pattern matching. + +`Pattern` is a [domain-specific language](docs/langs.md), in other words, it's +like a `Text`, but it has a distinct type. As a convenience, you can use +`$/.../` to write pattern literals instead of using the general-purpose DSL +syntax of `$Pattern"..."`. + +Patterns are used in a small, but very powerful API that handles many text +functions that would normally be handled by a more extensive API: + +``` +Text.has(pattern:Pattern -> Bool) +Text.each(pattern:Pattern, fn:func(m:Match), recursive=yes -> Text) +Text.find(pattern:Pattern, start=1 -> Match?) +Text.find_all(pattern:Pattern -> [Match]) +Text.matches(pattern:Pattern -> [Text]?) +Text.map(pattern:Pattern, fn:func(m:Match -> Text), recursive=yes -> Text) +Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text]) +Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text]) +Text.split(pattern:Pattern -> [Text]) +Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text]) +``` + +## Matches + +Pattern matching functions work with a type called `Match` that has three fields: + +- `text`: The full text of the match. +- `index`: The index in the text where the match was found. +- `captures`: An array containing the matching text of each non-literal pattern group. + +See [Text Functions](text.md#Text-Functions) for the full API documentation. + +## Syntax + +Patterns have three types of syntax: + +- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an + optional `!` to negate the pattern, followed by an optional pattern name or + Unicode character name, followed by a required `}`. + +- Any matching pair of quotes or parentheses or braces with a `?` in the middle + (e.g. `"?"` or `(?)`). + +- Any other character is treated as a literal to be matched exactly. + +## Named Patterns + +Named patterns match certain pre-defined patterns that are commonly useful. To +use a named pattern, use the syntax `{name}`. Names are case-insensitive and +mostly ignore spaces, underscores, and dashes. + +- `..` - Any character (note that a single `.` would mean the literal period + character). +- `digit` - A unicode digit +- `email` - an email address +- `emoji` - an emoji +- `end` - the very end of the text +- `id` - A unicode identifier +- `int` - One or more digits with an optional `-` (minus sign) in front +- `ip` - an IP address (IPv4 or IPv6) +- `ipv4` - an IPv4 address +- `ipv6` - an IPv6 address +- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`) +- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after +- `start` - the very start of the text +- `uri` - a URI +- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`) +- `word` - A unicode identifier (same as `id`) + +For non-alphabetic characters, any single character is treated as matching +exactly that character. For example, `{1{}` matches exactly one `{` +character. Or, `{1.}` matches exactly one `.` character. + +Patterns can also use any Unicode property name. Some helpful ones are: + +- `hex` - Hexidecimal digits +- `lower` - Lowercase letters +- `space` - The space character +- `upper` - Uppercase letters +- `whitespace` - Whitespace characters + +Patterns may also use exact Unicode codepoint names. For example: `{1 latin +small letter A}` matches `a`. + +## Negating Patterns + +If an exclamation mark (`!`) is placed before a pattern's name, then characters +are matched only when they _don't_ match the pattern. For example, `{!alpha}` +will match all characters _except_ alphabetic ones. + +## Interpolating Text and Escaping + +To escape a character in a pattern (e.g. if you want to match the literal +character `?`), you can use the syntax `{1 ?}`. This is almost never necessary +unless you have text that looks like a Tomo text pattern and has something like +`{` or `(?)` inside it. + +However, if you're trying to do an exact match of arbitrary text values, you'll +want to have the text automatically escaped. Fortunately, Tomo's injection-safe +DSL text interpolation supports automatic text escaping. This means that if you +use text interpolation with the `$` sign to insert a text value, the value will +be automatically escaped using the `{1 ?}` rule described above: + +```tomo +# Risk of code injection (would cause an error because 'xxx' is not a valid +# pattern name: +>> user_input := get_user_input() += "{xxx}" + +# Interpolation automatically escapes: +>> $/$user_input/ += $/{1{}..xxx}/ + +# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}" + +# No error: +>> some_text:find($/$user_input/) += 0 +``` + +If you prefer, you can also use this to insert literal characters: + +```tomo +>> $/literal $"{..}"/ += $/literal {1{}..}/ +``` + +## Repetitions + +By default, named patterns match 1 or more repetitions, but you can specify how +many repetitions you want by putting a number or range of numbers first using +`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+` +(`n` or more repetitions): + +``` +{4-5 alpha} +0x{hex} +{4 digit}-{2 digit}-{2 digit} +{2+ space} +{0-1 question mark} +``` + |
