diff options
| author | Bruce Hill <bruce@bruce-hill.com> | 2025-04-01 14:05:10 -0400 |
|---|---|---|
| committer | Bruce Hill <bruce@bruce-hill.com> | 2025-04-01 14:05:10 -0400 |
| commit | 4d59fc2987e52da0274e6b204a9d2885613f74b7 (patch) | |
| tree | 8c262f99cb6ae9b550b9f8abf0ab0477044d087a /docs/patterns.md | |
| parent | 7a2c99de74f5870e1dea5b59d049678ad0ef8e44 (diff) | |
Move patterns into a module
Diffstat (limited to 'docs/patterns.md')
| -rw-r--r-- | docs/patterns.md | 152 |
1 files changed, 0 insertions, 152 deletions
diff --git a/docs/patterns.md b/docs/patterns.md deleted file mode 100644 index 728b978e..00000000 --- a/docs/patterns.md +++ /dev/null @@ -1,152 +0,0 @@ -# Text Pattern Matching - -As an alternative to full regular expressions, Tomo provides a limited string -matching pattern syntax that is intended to solve 80% of use cases in under 1% -of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's -pattern matching code is a bit under 1k lines of code). Tomo's pattern matching -syntax is highly readable and works well for matching literal text without -getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome). - -For more advanced use cases, consider linking against a C library for regular -expressions or pattern matching. - -`Pattern` is a [domain-specific language](docs/langs.md), in other words, it's -like a `Text`, but it has a distinct type. As a convenience, you can use -`$/.../` to write pattern literals instead of using the general-purpose DSL -syntax of `$Pattern"..."`. - -Patterns are used in a small, but very powerful API that handles many text -functions that would normally be handled by a more extensive API: - -``` -Text.has(pattern:Pattern -> Bool) -Text.each(pattern:Pattern, fn:func(m:Match), recursive=yes -> Text) -Text.find(pattern:Pattern, start=1 -> Match?) -Text.find_all(pattern:Pattern -> [Match]) -Text.matches(pattern:Pattern -> [Text]?) -Text.map(pattern:Pattern, fn:func(m:Match -> Text), recursive=yes -> Text) -Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text]) -Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text]) -Text.split(pattern:Pattern -> [Text]) -Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text]) -``` - -## Matches - -Pattern matching functions work with a type called `Match` that has three fields: - -- `text`: The full text of the match. -- `index`: The index in the text where the match was found. -- `captures`: An array containing the matching text of each non-literal pattern group. - -See [Text Functions](text.md#Text-Functions) for the full API documentation. - -## Syntax - -Patterns have three types of syntax: - -- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an - optional `!` to negate the pattern, followed by an optional pattern name or - Unicode character name, followed by a required `}`. - -- Any matching pair of quotes or parentheses or braces with a `?` in the middle - (e.g. `"?"` or `(?)`). - -- Any other character is treated as a literal to be matched exactly. - -## Named Patterns - -Named patterns match certain pre-defined patterns that are commonly useful. To -use a named pattern, use the syntax `{name}`. Names are case-insensitive and -mostly ignore spaces, underscores, and dashes. - -- `..` - Any character (note that a single `.` would mean the literal period - character). -- `digit` - A unicode digit -- `email` - an email address -- `emoji` - an emoji -- `end` - the very end of the text -- `id` - A unicode identifier -- `int` - One or more digits with an optional `-` (minus sign) in front -- `ip` - an IP address (IPv4 or IPv6) -- `ipv4` - an IPv4 address -- `ipv6` - an IPv6 address -- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`) -- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after -- `start` - the very start of the text -- `uri` - a URI -- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`) -- `word` - A unicode identifier (same as `id`) - -For non-alphabetic characters, any single character is treated as matching -exactly that character. For example, `{1{}` matches exactly one `{` -character. Or, `{1.}` matches exactly one `.` character. - -Patterns can also use any Unicode property name. Some helpful ones are: - -- `hex` - Hexidecimal digits -- `lower` - Lowercase letters -- `space` - The space character -- `upper` - Uppercase letters -- `whitespace` - Whitespace characters - -Patterns may also use exact Unicode codepoint names. For example: `{1 latin -small letter A}` matches `a`. - -## Negating Patterns - -If an exclamation mark (`!`) is placed before a pattern's name, then characters -are matched only when they _don't_ match the pattern. For example, `{!alpha}` -will match all characters _except_ alphabetic ones. - -## Interpolating Text and Escaping - -To escape a character in a pattern (e.g. if you want to match the literal -character `?`), you can use the syntax `{1 ?}`. This is almost never necessary -unless you have text that looks like a Tomo text pattern and has something like -`{` or `(?)` inside it. - -However, if you're trying to do an exact match of arbitrary text values, you'll -want to have the text automatically escaped. Fortunately, Tomo's injection-safe -DSL text interpolation supports automatic text escaping. This means that if you -use text interpolation with the `$` sign to insert a text value, the value will -be automatically escaped using the `{1 ?}` rule described above: - -```tomo -# Risk of code injection (would cause an error because 'xxx' is not a valid -# pattern name: ->> user_input := get_user_input() -= "{xxx}" - -# Interpolation automatically escapes: ->> $/$user_input/ -= $/{1{}..xxx}/ - -# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}" - -# No error: ->> some_text:find($/$user_input/) -= 0 -``` - -If you prefer, you can also use this to insert literal characters: - -```tomo ->> $/literal $"{..}"/ -= $/literal {1{}..}/ -``` - -## Repetitions - -By default, named patterns match 1 or more repetitions, but you can specify how -many repetitions you want by putting a number or range of numbers first using -`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+` -(`n` or more repetitions): - -``` -{4-5 alpha} -0x{hex} -{4 digit}-{2 digit}-{2 digit} -{2+ space} -{0-1 question mark} -``` |
