aboutsummaryrefslogtreecommitdiff
path: root/lib/patterns/README.md
diff options
context:
space:
mode:
authorBruce Hill <bruce@bruce-hill.com>2025-04-07 18:14:20 -0400
committerBruce Hill <bruce@bruce-hill.com>2025-04-07 18:14:20 -0400
commit3efd7d9cfbd330ebb45f39648ee96a3e429a06f9 (patch)
tree29acb9e2b2370bd155fed24fba79d01b553a24f3 /lib/patterns/README.md
parent15fabfb9be3e3620e4b96983a49017116cea40e2 (diff)
Move core libraries into their own folder
Diffstat (limited to 'lib/patterns/README.md')
-rw-r--r--lib/patterns/README.md444
1 files changed, 444 insertions, 0 deletions
diff --git a/lib/patterns/README.md b/lib/patterns/README.md
new file mode 100644
index 00000000..faf2854e
--- /dev/null
+++ b/lib/patterns/README.md
@@ -0,0 +1,444 @@
+# Text Pattern Matching
+
+As an alternative to full regular expressions, Tomo provides a limited text
+matching pattern syntax that is intended to solve 80% of use cases in under 1%
+of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's
+pattern matching code is a bit under 1k lines of code). Tomo's pattern matching
+syntax is highly readable and works well for matching literal text without
+getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).
+
+For more advanced use cases, consider linking against a C library for regular
+expressions or pattern matching.
+
+`Pat` is a [domain-specific language](docs/langs.md), in other words, it's
+like a `Text`, but it has a distinct type.
+
+Patterns are used in a small, but very powerful API that handles many text
+functions that would normally be handled by a more extensive API:
+
+- [`by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))`](#by_pattern)
+- [`by_pattern_split(text:Text, pattern:Pat -> func(->Text?))`](#by_pattern_split)
+- [`each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)`](#each_pattern)
+- [`find_patterns(text:Text, pattern:Pat -> [PatternMatch])`](#find_patterns)
+- [`has_pattern(text:Text, pattern:Pat -> Bool)`](#has_pattern)
+- [`map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)`](#map_pattern)
+- [`matches_pattern(text:Text, pattern:Pat -> Bool)`](#matches_pattern)
+- [`pattern_captures(text:Text, pattern:Pat -> [Text]?)`](#pattern_captures)
+- [`replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)`](#replace_pattern)
+- [`split_pattern(text:Text, pattern:Pat -> [Text])`](#split_pattern)
+- [`translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)`](#translate_patterns)
+- [`trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)`](#trim_pattern)
+
+## Matches
+
+Pattern matching functions work with a type called `PatternMatch` that has three fields:
+
+- `text`: The full text of the match.
+- `index`: The index in the text where the match was found.
+- `captures`: A list containing the matching text of each non-literal pattern group.
+
+See [Text Functions](text.md#Text-Functions) for the full API documentation.
+
+## Syntax
+
+Patterns have three types of syntax:
+
+- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
+ optional `!` to negate the pattern, followed by an optional pattern name or
+ Unicode character name, followed by a required `}`.
+
+- Any matching pair of quotes or parentheses or braces with a `?` in the middle
+ (e.g. `"?"` or `(?)`).
+
+- Any other character is treated as a literal to be matched exactly.
+
+## Named Patterns
+
+Named patterns match certain pre-defined patterns that are commonly useful. To
+use a named pattern, use the syntax `{name}`. Names are case-insensitive and
+mostly ignore spaces, underscores, and dashes.
+
+- `..` - Any character (note that a single `.` would mean the literal period
+ character).
+- `digit` - A unicode digit
+- `email` - an email address
+- `emoji` - an emoji
+- `end` - the very end of the text
+- `id` - A unicode identifier
+- `int` - One or more digits with an optional `-` (minus sign) in front
+- `ip` - an IP address (IPv4 or IPv6)
+- `ipv4` - an IPv4 address
+- `ipv6` - an IPv6 address
+- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`)
+- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
+- `start` - the very start of the text
+- `uri` - a URI
+- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
+- `word` - A unicode identifier (same as `id`)
+
+For non-alphabetic characters, any single character is treated as matching
+exactly that character. For example, `{1{}` matches exactly one `{`
+character. Or, `{1.}` matches exactly one `.` character.
+
+Patterns can also use any Unicode property name. Some helpful ones are:
+
+- `hex` - Hexidecimal digits
+- `lower` - Lowercase letters
+- `space` - The space character
+- `upper` - Uppercase letters
+- `whitespace` - Whitespace characters
+
+Patterns may also use exact Unicode codepoint names. For example: `{1 latin
+small letter A}` matches `a`.
+
+## Negating Patterns
+
+If an exclamation mark (`!`) is placed before a pattern's name, then characters
+are matched only when they _don't_ match the pattern. For example, `{!alpha}`
+will match all characters _except_ alphabetic ones.
+
+## Interpolating Text and Escaping
+
+To escape a character in a pattern (e.g. if you want to match the literal
+character `?`), you can use the syntax `{1 ?}`. This is almost never necessary
+unless you have text that looks like a Tomo text pattern and has something like
+`{` or `(?)` inside it.
+
+However, if you're trying to do an exact match of arbitrary text values, you'll
+want to have the text automatically escaped. Fortunately, Tomo's injection-safe
+DSL text interpolation supports automatic text escaping. This means that if you
+use text interpolation with the `$` sign to insert a text value, the value will
+be automatically escaped using the `{1 ?}` rule described above:
+
+```tomo
+# Risk of code injection (would cause an error because 'xxx' is not a valid
+# pattern name:
+>> user_input := get_user_input()
+= "{xxx}"
+
+# Interpolation automatically escapes:
+>> $/$user_input/
+= $/{1{}..xxx}/
+
+# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"
+
+# No error:
+>> some_text.find($/$user_input/)
+= 0
+```
+
+If you prefer, you can also use this to insert literal characters:
+
+```tomo
+>> $/literal $"{..}"/
+= $/literal {1{}..}/
+```
+
+## Repetitions
+
+By default, named patterns match 1 or more repetitions, but you can specify how
+many repetitions you want by putting a number or range of numbers first using
+`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
+(`n` or more repetitions):
+
+```
+{4-5 alpha}
+0x{hex}
+{4 digit}-{2 digit}-{2 digit}
+{2+ space}
+{0-1 question mark}
+```
+
+
+# Methods
+
+### `by_pattern`
+Returns an iterator function that yields `PatternMatch` objects for each occurrence.
+
+```tomo
+func by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))
+```
+
+- `text`: The text to search.
+- `pattern`: The pattern to match.
+
+**Returns:**
+An iterator function that yields `PatternMatch` objects one at a time.
+
+**Example:**
+```tomo
+text := "one, two, three"
+for word in text.by_pattern($Pat"{id}"):
+ say(word.text)
+```
+
+---
+
+### `by_pattern_split`
+Returns an iterator function that yields text segments split by a pattern.
+
+```tomo
+func by_pattern_split(text:Text, pattern:Pat -> func(->Text?))
+```
+
+- `text`: The text to split.
+- `pattern`: The pattern to use as a separator.
+
+**Returns:**
+An iterator function that yields text segments.
+
+**Example:**
+```tomo
+text := "one two three"
+for word in text.by_pattern_split($Pat"{whitespace}"):
+ say(word.text)
+```
+
+---
+
+### `each_pattern`
+Applies a function to each occurrence of a pattern in the text.
+
+```tomo
+func each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)
+```
+
+- `text`: The text to search.
+- `pattern`: The pattern to match.
+- `fn`: The function to apply to each match.
+- `recursive`: If `yes`, applies the function recursively on modified text.
+
+**Example:**
+```tomo
+text := "one two three"
+text.each_pattern($Pat"{id}", func(m:PatternMatch):
+ say(m.txt)
+)
+```
+
+---
+
+### `find_patterns`
+Finds all occurrences of a pattern in a text and returns them as `PatternMatch` objects.
+
+```tomo
+func find_patterns(text:Text, pattern:Pat -> [PatternMatch])
+```
+
+- `text`: The text to search.
+- `pattern`: The pattern to match.
+
+**Returns:**
+A list of `PatternMatch` objects.
+
+**Example:**
+```tomo
+text := "one! two three!"
+>> text.find_patterns($Pat"{id}!")
+= [PatternMatch(text="one!", index=1, captures=["one"]), PatternMatch(text="three!", index=10, captures=["three"])]
+```
+
+---
+
+### `has_pattern`
+Checks whether a given pattern appears in the text.
+
+```tomo
+func has_pattern(text:Text, pattern:Pat -> Bool)
+```
+
+- `text`: The text to search.
+- `pattern`: The pattern to check for.
+
+**Returns:**
+`yes` if a match is found, otherwise `no`.
+
+**Example:**
+```tomo
+text := "...okay..."
+>> text.has_pattern($Pat"{id}")
+= yes
+```
+
+---
+
+### `map_pattern`
+Transforms matches of a pattern using a mapping function.
+
+```tomo
+func map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)
+```
+
+- `text`: The text to modify.
+- `pattern`: The pattern to match.
+- `fn`: A function that transforms matches.
+- `recursive`: If `yes`, applies transformations recursively.
+
+**Returns:**
+A new text with the transformed matches.
+
+**Example:**
+```tomo
+text := "I have #apples and #oranges and #plums"
+fruits := {"apples"=4, "oranges"=5}
+>> text.map_pattern($Pat'#{id}', func(match:PatternMatch):
+ fruit := match.captures[1]
+ "$(fruits[fruit] or 0) $fruit"
+)
+= "I have 4 apples and 5 oranges and 0 plums"
+```
+
+---
+
+### `matches_pattern`
+Returns whether or not text matches a pattern completely.
+
+```tomo
+func matches_pattern(text:Text, pattern:Pat -> Bool)
+```
+
+- `text`: The text to match against.
+- `pattern`: The pattern to match.
+
+**Returns:**
+`yes` if the whole text matches the pattern, otherwise `no`.
+
+**Example:**
+```tomo
+>> "Hello!!!".matches_pattern($Pat"{id}")
+= no
+>> "Hello".matches_pattern($Pat"{id}")
+= yes
+```
+
+---
+
+### `pattern_captures`
+Returns a list of pattern captures for the given pattern.
+
+```tomo
+func pattern_captures(text:Text, pattern:Pat -> [Text]?)
+```
+
+- `text`: The text to match against.
+- `pattern`: The pattern to match.
+
+**Returns:**
+An optional list of matched pattern captures. Returns `none` if the text does
+not match the pattern.
+
+**Example:**
+```tomo
+>> "123 boxes".pattern_captures($Pat"{int} {id}")
+= ["123", "boxes"]?
+>> "xxx".pattern_captures($Pat"{int} {id}")
+= none
+```
+
+---
+
+### `replace_pattern`
+Replaces occurrences of a pattern with a replacement text, supporting backreferences.
+
+```tomo
+func replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)
+```
+
+- `text`: The text to modify.
+- `pattern`: The pattern to match.
+- `replacement`: The text to replace matches with.
+- `backref`: The symbol for backreferences in the replacement.
+- `recursive`: If `yes`, applies replacements recursively.
+
+**Returns:**
+A new text with replacements applied.
+
+**Example:**
+```tomo
+>> "I have 123 apples and 456 oranges".replace_pattern($Pat"{int}", "some")
+= "I have some apples and some oranges"
+
+>> "I have 123 apples and 456 oranges".replace_pattern($Pat"{int}", "(@1)")
+= "I have (123) apples and (456) oranges"
+
+>> "I have 123 apples and 456 oranges".replace_pattern($Pat"{int}", "(?1)", backref="?")
+= "I have (123) apples and (456) oranges"
+
+>> "bad(fn(), bad(notbad))".replace_pattern($Pat"bad(?)", "good(@1)")
+= "good(fn(), good(notbad))"
+
+>> "bad(fn(), bad(notbad))".replace_pattern($Pat"bad(?)", "good(@1)", recursive=no)
+= "good(fn(), bad(notbad))"
+```
+
+---
+
+### `split_pattern`
+Splits a text into segments using a pattern as the delimiter.
+
+```tomo
+func split_pattern(text:Text, pattern:Pat -> [Text])
+```
+
+- `text`: The text to split.
+- `pattern`: The pattern to use as a separator.
+
+**Returns:**
+A list of text segments.
+
+**Example:**
+```tomo
+>> "one two three".split_pattern($Pat"{whitespace}")
+= ["one", "two", "three"]
+```
+
+---
+
+### `translate_patterns`
+Replaces multiple patterns using a mapping of patterns to replacement texts.
+
+```tomo
+func translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)
+```
+
+- `text`: The text to modify.
+- `replacements`: A table mapping patterns to their replacements.
+- `backref`: The symbol for backreferences in replacements.
+- `recursive`: If `yes`, applies replacements recursively.
+
+**Returns:**
+A new text with all specified replacements applied.
+
+**Example:**
+```tomo
+>> text := "foo(x, baz(1))"
+>> text.translate_patterns({
+ $Pat"{id}(?)"="call(fn('@1'), @2)",
+ $Pat"{id}"="var('@1')",
+ $Pat"{int}"="int(@1)",
+})
+= "call(fn('foo'), var('x'), call(fn('baz'), int(1)))"
+```
+
+---
+
+### `trim_pattern`
+Removes matching patterns from the beginning and/or end of a text.
+
+```tomo
+func trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)
+```
+
+- `text`: The text to trim.
+- `pattern`: The pattern to trim (defaults to whitespace).
+- `left`: If `yes`, trims from the beginning.
+- `right`: If `yes`, trims from the end.
+
+**Returns:**
+The trimmed text.
+
+**Example:**
+```tomo
+>> "123abc456".trim_pattern($Pat"{digit}")
+= "abc"
+```