Add recursive argument to text:each() and text:map(), plus update docs

This commit is contained in:
Bruce Hill 2025-03-03 13:45:30 -05:00
parent 80475ad02d
commit f330f06c21
6 changed files with 236 additions and 192 deletions

View File

@ -33,6 +33,7 @@ Information about Tomo's built-in types can be found here:
- [Structs](structs.md)
- [Tables](tables.md)
- [Text](text.md)
- [Text Pattern Matching](patterns.md)
- [Threads](threads.md)
## Built-in Functions

153
docs/patterns.md Normal file
View File

@ -0,0 +1,153 @@
# Text Pattern Matching
As an alternative to full regular expressions, Tomo provides a limited string
matching pattern syntax that is intended to solve 80% of use cases in under 1%
of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's
pattern matching code is a bit under 1k lines of code). Tomo's pattern matching
syntax is highly readable and works well for matching literal text without
getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).
For more advanced use cases, consider linking against a C library for regular
expressions or pattern matching.
`Pattern` is a [domain-specific language](docs/langs.md), in other words, it's
like a `Text`, but it has a distinct type. As a convenience, you can use
`$/.../` to write pattern literals instead of using the general-purpose DSL
syntax of `$Pattern"..."`.
Patterns are used in a small, but very powerful API that handles many text
functions that would normally be handled by a more extensive API:
```
Text.has(pattern:Pattern -> Bool)
Text.each(pattern:Pattern, fn:func(m:Match), recursive=yes -> Text)
Text.find(pattern:Pattern, start=1 -> Match?)
Text.find_all(pattern:Pattern -> [Match])
Text.matches(pattern:Pattern -> [Text]?)
Text.map(pattern:Pattern, fn:func(m:Match -> Text), recursive=yes -> Text)
Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.split(pattern:Pattern -> [Text])
Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text])
```
## Matches
Pattern matching functions work with a type called `Match` that has three fields:
- `text`: The full text of the match.
- `index`: The index in the text where the match was found.
- `captures`: An array containing the matching text of each non-literal pattern group.
See [Text Functions](text.md#Text-Functions) for the full API documentation.
## Syntax
Patterns have three types of syntax:
- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
optional `!` to negate the pattern, followed by an optional pattern name or
Unicode character name, followed by a required `}`.
- Any matching pair of quotes or parentheses or braces with a `?` in the middle
(e.g. `"?"` or `(?)`).
- Any other character is treated as a literal to be matched exactly.
## Named Patterns
Named patterns match certain pre-defined patterns that are commonly useful. To
use a named pattern, use the syntax `{name}`. Names are case-insensitive and
mostly ignore spaces, underscores, and dashes.
- `..` - Any character (note that a single `.` would mean the literal period
character).
- `digit` - A unicode digit
- `email` - an email address
- `emoji` - an emoji
- `end` - the very end of the text
- `id` - A unicode identifier
- `int` - One or more digits with an optional `-` (minus sign) in front
- `ip` - an IP address (IPv4 or IPv6)
- `ipv4` - an IPv4 address
- `ipv6` - an IPv6 address
- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`)
- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
- `start` - the very start of the text
- `uri` - a URI
- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
- `word` - A unicode identifier (same as `id`)
For non-alphabetic characters, any single character is treated as matching
exactly that character. For example, `{1{}` matches exactly one `{`
character. Or, `{1.}` matches exactly one `.` character.
Patterns can also use any Unicode property name. Some helpful ones are:
- `hex` - Hexidecimal digits
- `lower` - Lowercase letters
- `space` - The space character
- `upper` - Uppercase letters
- `whitespace` - Whitespace characters
Patterns may also use exact Unicode codepoint names. For example: `{1 latin
small letter A}` matches `a`.
## Negating Patterns
If an exclamation mark (`!`) is placed before a pattern's name, then characters
are matched only when they _don't_ match the pattern. For example, `{!alpha}`
will match all characters _except_ alphabetic ones.
## Interpolating Text and Escaping
To escape a character in a pattern (e.g. if you want to match the literal
character `?`), you can use the syntax `{1 ?}`. This is almost never necessary
unless you have text that looks like a Tomo text pattern and has something like
`{` or `(?)` inside it.
However, if you're trying to do an exact match of arbitrary text values, you'll
want to have the text automatically escaped. Fortunately, Tomo's injection-safe
DSL text interpolation supports automatic text escaping. This means that if you
use text interpolation with the `$` sign to insert a text value, the value will
be automatically escaped using the `{1 ?}` rule described above:
```tomo
# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"
# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/
# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"
# No error:
>> some_text:find($/$user_input/)
= 0
```
If you prefer, you can also use this to insert literal characters:
```tomo
>> $/literal $"{..}"/
= $/literal {1{}..}/
```
## Repetitions
By default, named patterns match 1 or more repetitions, but you can specify how
many repetitions you want by putting a number or range of numbers first using
`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
(`n` or more repetitions):
```
{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}
```

View File

@ -264,153 +264,9 @@ finding the value because the two texts are equivalent under normalization.
# Patterns
As an alternative to full regular expressions, Tomo provides a limited string
matching pattern syntax that is intended to solve 80% of use cases in under 1%
of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's
pattern matching code is a bit under 1k lines of code). Tomo's pattern matching
syntax is highly readable and works well for matching literal text without
getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).
For more advanced use cases, consider linking against a C library for regular
expressions or pattern matching.
`Pattern` is a [domain-specific language](docs/langs.md), in other words, it's
like a `Text`, but it has a distinct type. As a convenience, you can use
`$/.../` to write pattern literals instead of using the general-purpose DSL
syntax of `$Pattern"..."`.
Patterns are used in a small, but very powerful API that handles many text
functions that would normally be handled by a more extensive API:
```
Text.has(pattern:Pattern -> Bool)
Text.find(pattern:Pattern, start=1 -> Match?)
Text.find_all(pattern:Pattern -> [Match])
Text.matches(pattern:Pattern -> [Text]?)
Text.map(pattern:Pattern, fn:func(m:Match -> Text) -> Text)
Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.split(pattern:Pattern -> [Text])
Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text])
```
Pattern matching functions work with a type called `Match` that has three fields:
- `text`: The full text of the match.
- `index`: The index in the text where the match was found.
- `captures`: An array containing the matching text of each non-literal pattern group.
See [Text Functions](#Text-Functions) for the full API documentation.
## Syntax
Patterns have three types of syntax:
- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
optional `!` to negate the pattern, followed by an optional pattern name or
Unicode character name, followed by a required `}`.
- Any matching pair of quotes or parentheses or braces with a `?` in the middle
(e.g. `"?"` or `(?)`).
- Any other character is treated as a literal to be matched exactly.
## Named Patterns
Named patterns match certain pre-defined patterns that are commonly useful. To
use a named pattern, use the syntax `{name}`. Names are case-insensitive and
mostly ignore spaces, underscores, and dashes.
- `..` - Any character (note that a single `.` would mean the literal period
character).
- `digit` - A unicode digit
- `email` - an email address
- `emoji` - an emoji
- `end` - the very end of the text
- `id` - A unicode identifier
- `int` - One or more digits with an optional `-` (minus sign) in front
- `ip` - an IP address (IPv4 or IPv6)
- `ipv4` - an IPv4 address
- `ipv6` - an IPv6 address
- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`)
- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
- `start` - the very start of the text
- `uri` - a URI
- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
- `word` - A unicode identifier (same as `id`)
For non-alphabetic characters, any single character is treated as matching
exactly that character. For example, `{1{}` matches exactly one `{`
character. Or, `{1.}` matches exactly one `.` character.
Patterns can also use any Unicode property name. Some helpful ones are:
- `hex` - Hexidecimal digits
- `lower` - Lowercase letters
- `space` - The space character
- `upper` - Uppercase letters
- `whitespace` - Whitespace characters
Patterns may also use exact Unicode codepoint names. For example: `{1 latin
small letter A}` matches `a`.
## Negating Patterns
If an exclamation mark (`!`) is placed before a pattern's name, then characters
are matched only when they _don't_ match the pattern. For example, `{!alpha}`
will match all characters _except_ alphabetic ones.
## Interpolating Text and Escaping
To escape a character in a pattern (e.g. if you want to match the literal
character `?`), you can use the syntax `{1 ?}`. This is almost never necessary
unless you have text that looks like a Tomo text pattern and has something like
`{` or `(?)` inside it.
However, if you're trying to do an exact match of arbitrary text values, you'll
want to have the text automatically escaped. Fortunately, Tomo's injection-safe
DSL text interpolation supports automatic text escaping. This means that if you
use text interpolation with the `$` sign to insert a text value, the value will
be automatically escaped using the `{1 ?}` rule described above:
```tomo
# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"
# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/
# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"
# No error:
>> some_text:find($/$user_input/)
= 0
```
If you prefer, you can also use this to insert literal characters:
```tomo
>> $/literal $"{..}"/
= $/literal {1{}..}/
```
## Repetitions
By default, named patterns match 1 or more repetitions, but you can specify how
many repetitions you want by putting a number or range of numbers first using
`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
(`n` or more repetitions):
```
{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}
```
Texts use a custom pattern matching syntax for text matching and replacement as
a lightweight, but powerful alternative to regular expressions. See [the
pattern documentation](patterns.md) for more details.
# Text Functions
@ -515,7 +371,7 @@ func by_match(text: Text, pattern: Pattern -> func(->Match?))
**Parameters:**
- `text`: The text to be iterated over looking for matches.
- `pattern`: The pattern to look for.
- `pattern`: The [pattern](patterns.md) to look for.
**Returns:**
An iterator function that returns one match result at a time, until it runs out
@ -546,7 +402,7 @@ func by_split(text: Text, pattern: Pattern = $// -> func(->Text?))
**Parameters:**
- `text`: The text to be iterated over in pattern-delimited chunks.
- `pattern`: The pattern to split the text on.
- `pattern`: The [pattern](patterns.md) to split the text on.
**Returns:**
An iterator function that returns one chunk of text at a time, separated by the
@ -639,6 +495,37 @@ An array of 32-bit integer Unicode code points (`[Int32]`).
---
## `each`
**Description:**
Iterates over each match of a [pattern](patterns.md) and passes the match to
the given function.
**Signature:**
```tomo
func each(text: Text, pattern: Pattern, fn: func(m: Match), recursive: Bool = yes -> Int?)
```
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The [pattern](patterns.md) to search for.
- `fn`: A function to be called on each match that was found.
- `recursive`: For each match, if recursive is set to `yes`, then call `each()`
recursively on its captures before calling `fn` on the match.
**Returns:**
None.
**Example:**
```tomo
>> " #one #two #three ":each($/#{word}/, func(m:Match):
say("Found word $(m.captures[1])")
)
```
---
## `ends_with`
**Description:**
@ -780,8 +667,8 @@ A new text based on the input UTF8 bytes after normalization has been applied.
## `find`
**Description:**
Finds the first occurrence of a pattern in the given text (if any).
See: [Patterns](#Patterns) for more information on patterns.
Finds the first occurrence of a [pattern](patterns.md) in the given text (if
any).
**Signature:**
```tomo
@ -791,12 +678,12 @@ func find(text: Text, pattern: Pattern, start: Int = 1 -> Int?)
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The pattern to search for.
- `pattern`: The [pattern](patterns.md) to search for.
- `start`: The index to start the search.
**Returns:**
`!Match` if the target pattern is not found, otherwise a `Match` struct
containing information about the match.
`!Match` if the target [pattern](patterns.md) is not found, otherwise a `Match`
struct containing information about the match.
**Example:**
```tomo
@ -815,8 +702,7 @@ containing information about the match.
## `find_all`
**Description:**
Finds all occurrences of a pattern in the given text.
See: [Patterns](#Patterns) for more information on patterns.
Finds all occurrences of a [pattern](patterns.md) in the given text.
**Signature:**
```tomo
@ -826,10 +712,10 @@ func find_all(text: Text, pattern: Pattern -> [Match])
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The pattern to search for.
- `pattern`: The [pattern](patterns.md) to search for.
**Returns:**
An array of every match of the pattern in the given text.
An array of every match of the [pattern](patterns.md) in the given text.
Note: if `text` or `pattern` is empty, an empty array will be returned.
**Example:**
@ -887,7 +773,7 @@ the length of the string.
## `has`
**Description:**
Checks if the `Text` contains a target pattern (see: [Patterns](#Patterns)).
Checks if the `Text` contains a target [pattern](patterns.md).
**Signature:**
```tomo
@ -897,7 +783,7 @@ func has(text: Text, pattern: Pattern -> Bool)
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The pattern to search for.
- `pattern`: The [pattern](patterns.md) to search for.
**Returns:**
`yes` if the target pattern is found, `no` otherwise.
@ -1004,9 +890,9 @@ The lowercase version of the text.
## `matches`
**Description:**
Checks if the `Text` matches target pattern (see: [Patterns](#Patterns)) and
returns an array of the matching text captures or a null value if the entire
text doesn't match the pattern.
Checks if the `Text` matches target [pattern](patterns.md) and returns an array
of the matching text captures or a null value if the entire text doesn't match
the pattern.
**Signature:**
```tomo
@ -1016,7 +902,7 @@ func matches(text: Text, pattern: Pattern -> [Text])
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The pattern to search for.
- `pattern`: The [pattern](patterns.md) to search for.
**Returns:**
An array of the matching text captures if the entire text matches the pattern,
@ -1036,19 +922,21 @@ or a null value otherwise.
## `map`
**Description:**
For each occurrence of the given pattern, replace the text with the result of
calling the given function on that match.
For each occurrence of the given [pattern](patterns.md), replace the text with
the result of calling the given function on that match.
**Signature:**
```tomo
func map(text: Text, pattern: Pattern, fn: func(text:Match)->Text -> Text)
func map(text: Text, pattern: Pattern, fn: func(text:Match)->Text -> Text, recursive: Bool = yes)
```
**Parameters:**
- `text`: The text to be searched.
- `pattern`: The pattern to search for.
- `pattern`: The [pattern](patterns.md) to search for.
- `fn`: The function to apply to each match.
- `recursive`: Whether to recursively map `fn` to each of the captures of the
pattern before handing them to `fn`.
**Returns:**
The text with the matching parts replaced with the result of applying the given
@ -1119,9 +1007,8 @@ The text repeated the given number of times.
## `replace`
**Description:**
Replaces occurrences of a pattern in the text with a replacement string.
See [Patterns](#patterns) for more information about patterns.
Replaces occurrences of a [pattern](patterns.md) in the text with a replacement
string.
**Signature:**
```tomo
@ -1131,7 +1018,7 @@ func replace(text: Text, pattern: Pattern, replacement: Text, backref: Pattern =
**Parameters:**
- `text`: The text in which to perform replacements.
- `pattern`: The pattern to be replaced.
- `pattern`: The [pattern](patterns.md) to be replaced.
- `replacement`: The text to replace the pattern with.
- `backref`: If non-empty, the replacement text will have occurrences of this
pattern followed by a number replaced with the corresponding backreference.
@ -1186,11 +1073,12 @@ The text with occurrences of the pattern replaced.
## `replace_all`
**Description:**
Takes a table mapping patterns to replacement texts and performs all the
replacements in the table on the whole text. At each position, the first
matching pattern's replacement is applied and the pattern matching moves on to
*after* the replacement text, so replacement text is not recursively modified.
See [`replace()`](#replace) for more information about replacement behavior.
Takes a table mapping [patterns](patterns.md) to replacement texts and performs
all the replacements in the table on the whole text. At each position, the
first matching pattern's replacement is applied and the pattern matching moves
on to *after* the replacement text, so replacement text is not recursively
modified. See [`replace()`](#replace) for more information about replacement
behavior.
**Signature:**
```tomo
@ -1200,8 +1088,8 @@ func replace_all(replacements:{Pattern,Text}, backref: Pattern = $/\/, recursive
**Parameters:**
- `text`: The text in which to perform replacements.
- `replacements`: A table mapping from patterns to the replacement text
associated with that pattern.
- `replacements`: A table mapping from [pattern](patterns.md) to the
replacement text associated with that pattern.
- `backref`: If non-empty, the replacement text will have occurrences of this
pattern followed by a number replaced with the corresponding backreference.
By default, the backreference pattern is a single backslash, so
@ -1295,8 +1183,7 @@ the string.
## `split`
**Description:**
Splits the text into an array of substrings based on a pattern.
See [Patterns](#patterns) for more information about patterns.
Splits the text into an array of substrings based on a [pattern](patterns.md).
**Signature:**
```tomo
@ -1306,8 +1193,8 @@ func split(text: Text, pattern: Pattern = "" -> [Text])
**Parameters:**
- `text`: The text to be split.
- `pattern`: The pattern used to split the text. If the pattern is the empty
string, the text will be split into individual grapheme clusters.
- `pattern`: The [pattern](patterns.md) used to split the text. If the pattern
is the empty string, the text will be split into individual grapheme clusters.
**Returns:**
An array of substrings resulting from the split.
@ -1415,8 +1302,7 @@ the string.
## `trim`
**Description:**
Trims the matching pattern from the left and/or right side of the text
See [Patterns](#patterns) for more information about patterns.
Trims the matching [pattern](patterns.md) from the left and/or right side of the text.
**Signature:**
```tomo
@ -1426,7 +1312,7 @@ func trim(text: Text, pattern: Pattern = $/{whitespace/, trim_left: Bool = yes,
**Parameters:**
- `text`: The text to be trimmed.
- `pattern`: The pattern that will be trimmed away.
- `pattern`: The [pattern](patterns.md) that will be trimmed away.
- `trim_left`: Whether or not to trim from the front of the text.
- `trim_right`: Whether or not to trim from the back of the text.

View File

@ -393,7 +393,7 @@ env_t *new_compilation_unit(CORD libname)
{"bytes", "Text$utf8_bytes", "func(text:Text -> [Byte])"},
{"codepoint_names", "Text$codepoint_names", "func(text:Text -> [Text])"},
{"ends_with", "Text$ends_with", "func(text,suffix:Text -> Bool)"},
{"each", "Text$each", "func(text:Text, pattern:Pattern, fn:func(match:Match))"},
{"each", "Text$each", "func(text:Text, pattern:Pattern, fn:func(match:Match), recursive=yes)"},
{"find", "Text$find", "func(text:Text, pattern:Pattern, start=1 -> Match?)"},
{"find_all", "Text$find_all", "func(text:Text, pattern:Pattern -> [Match])"},
{"from", "Text$from", "func(text:Text, first:Int -> Text)"},
@ -406,7 +406,7 @@ env_t *new_compilation_unit(CORD libname)
{"join", "Text$join", "func(glue:Text, pieces:[Text] -> Text)"},
{"lines", "Text$lines", "func(text:Text -> [Text])"},
{"lower", "Text$lower", "func(text:Text -> Text)"},
{"map", "Text$map", "func(text:Text, pattern:Pattern, fn:func(match:Match -> Text) -> Text)"},
{"map", "Text$map", "func(text:Text, pattern:Pattern, fn:func(match:Match -> Text), recursive=yes -> Text)"},
{"matches", "Text$matches", "func(text:Text, pattern:Pattern -> [Text]?)"},
{"quoted", "Text$quoted", "func(text:Text, color=no -> Text)"},
{"repeat", "Text$repeat", "func(text:Text, count:Int -> Text)"},

View File

@ -1042,7 +1042,7 @@ public Text_t Text$trim(Text_t text, Pattern_t pattern, bool trim_left, bool tri
return Text$slice(text, I(first+1), I(last+1));
}
public Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn)
public Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn, bool recursive)
{
Text_t ret = EMPTY_TEXT;
@ -1073,6 +1073,8 @@ public Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn)
};
for (int i = 0; captures[i].occupied; i++) {
Text_t capture = Text$slice(text, I(captures[i].index+1), I(captures[i].index+captures[i].length));
if (recursive)
capture = Text$map(capture, pattern, fn, recursive);
Array$insert(&m.captures, &capture, I(0), sizeof(Text_t));
}
@ -1093,7 +1095,7 @@ public Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn)
return ret;
}
public void Text$each(Text_t text, Pattern_t pattern, Closure_t fn)
public void Text$each(Text_t text, Pattern_t pattern, Closure_t fn, bool recursive)
{
int32_t first_grapheme = Text$get_grapheme(pattern, 0);
bool find_first = (first_grapheme != '{'
@ -1120,6 +1122,8 @@ public void Text$each(Text_t text, Pattern_t pattern, Closure_t fn)
};
for (int i = 0; captures[i].occupied; i++) {
Text_t capture = Text$slice(text, I(captures[i].index+1), I(captures[i].index+captures[i].length));
if (recursive)
Text$each(capture, pattern, fn, recursive);
Array$insert(&m.captures, &capture, I(0), sizeof(Text_t));
}

View File

@ -34,8 +34,8 @@ Array_t Text$find_all(Text_t text, Pattern_t pattern);
Closure_t Text$by_match(Text_t text, Pattern_t pattern);
PUREFUNC bool Text$has(Text_t text, Pattern_t pattern);
OptionalArray_t Text$matches(Text_t text, Pattern_t pattern);
Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn);
void Text$each(Text_t text, Pattern_t pattern, Closure_t fn);
Text_t Text$map(Text_t text, Pattern_t pattern, Closure_t fn, bool recursive);
void Text$each(Text_t text, Pattern_t pattern, Closure_t fn, bool recursive);
#define Pattern$hash Text$hash
#define Pattern$compare Text$compare