aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorBruce Hill <bruce@bruce-hill.com>2024-09-03 00:54:48 -0400
committerBruce Hill <bruce@bruce-hill.com>2024-09-03 00:54:48 -0400
commit5441e6f287608f4998d85cc39652ce1adaebb6a1 (patch)
treec8f624b5357b7f334fde4921105e7e00eb1d2e79 /docs
parent3df85ee6d8e26c6d69544be3352239c63108a554 (diff)
Update docs
Diffstat (limited to 'docs')
-rw-r--r--docs/text.md478
1 files changed, 326 insertions, 152 deletions
diff --git a/docs/text.md b/docs/text.md
index a1a8edd4..6f43d92b 100644
--- a/docs/text.md
+++ b/docs/text.md
@@ -273,6 +273,123 @@ created that has text with the codepoint `U+E9` as a key, then a lookup with
the same text but with `U+65 U+301` instead of `U+E9` will still succeed in
finding the value because the two texts are equivalent under normalization.
+# Patterns
+
+As an alternative to full regular expressions, Tomo provides a limited string
+matching pattern syntax that is intended to solve 80% of use cases in 1% of the
+code size (PCRE's codebase is roughly 150k lines of code, and Tomo's entire
+Text codebase is around 1.5K lines of code).
+
+For more advanced use cases, consider linking against a C library for regular
+expressions or pattern matching.
+
+Patterns are used in a small, but very powerful API that handles many text
+functions that would normally be handled by a more extensive API:
+
+```
+Text.find(pattern:Text, start=1, length=!&Int64?)->Int
+Text.find_all(pattern:Text)->[Text]
+Text.split(pattern:Text)->[Text]
+Text.replace(pattern:Text, replacement:Text)->[Text]
+Text.has(pattern:Text, where=Where.Anywhere)->Bool
+```
+
+See [Text Functions](#Text-Functions) for the full API documentation.
+
+## Syntax
+
+Patterns have three types of syntax:
+
+- `[..` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
+ optional `!` to negate the pattern, followed by an optional pattern name or
+ Unicode character name, followed by a required `]`.
+
+- Any matching pair of quotes or parentheses or braces with a `?` in the middle
+ (e.g. `"?"` or `(?)`).
+
+- Any other character is treated as a literal to be matched exactly.
+
+## Named Patterns
+
+Named patterns match certain pre-defined patterns that are commonly useful. To
+use a named pattern, use the syntax `[..name]`. Names are case-insensitive and
+mostly ignore spaces, underscores, and dashes.
+
+- ` ` - If no name is given, any character is accepted.
+- `digit` - A unicode digit
+- `email` - an email address
+- `emoji` - an emoji
+- `end` - the very end of the text
+- `id` - A unicode identifier
+- `int` - One or more digits with an optional `-` (minus sign) in front
+- `ip` - an IP address (IPv4 or IPv6)
+- `ipv4` - an IPv4 address
+- `ipv6` - an IPv6 address
+- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
+- `start` - the very start of the text
+- `uri` - a URI
+- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
+
+For non-alphabetic characters, any single character is treated as matching
+exactly that character. For example, `[..1 []` matches exactly one `[`
+character. Or, `[..1 (]` matches exactly one `(` character.
+
+Patterns can also use any Unicode property name. Some helpful ones are:
+
+- `hex` - Hexidecimal digits
+- `lower` - Lowercase letters
+- `space` - The space character
+- `upper` - Uppercase letters
+- `whitespace` - Whitespace characters
+
+Patterns may also use exact Unicode codepoint names. For example: `[..1 latin
+small letter A]` matches `a`.
+
+## Negating Patterns
+
+If an exclamation mark (`!`) is placed before a pattern's name, then characters
+are matched only when they _don't_ match the pattern. For example, `[..!alpha]`
+will match all characters _except_ alphabetic ones.
+
+## Repetitions
+
+By default, named patterns match 1 or more repetitions, but you can specify how
+many repetitions you want by putting a number or range of numbers first using
+`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
+(`n` or more repetitions):
+
+```
+[..4-5 alpha]
+0x[..hex]
+[..4 digit]-[..2 digit]-[..2 digit]
+[..2+ space]
+[..0-1 question mark]
+```
+
+## Some Examples
+
+URL query string parameters:
+
+```
+text := "example.com/page?a=b&c=d"
+>> text:find(before=$Pat`?`, $Pat`[..]`):split($Pat`&`)
+= ["a=b", "c=d"]
+```
+
+Remove or get file extension:
+
+```
+filename := "foo.txt"
+>> filename:without($Pat`.[:id:]`, where=End)
+= "foo"
+
+>> filename:find(before=$Pat`.`, $Pat`[:id:][:end:]`)
+= MatchResult.Success(match="txt")
+
+>> filename := "foo.tar.gz"
+>> ".":join(filename:split($Pat`.`):from(2))
+= "tar.gz"
+```
# Text Functions
@@ -301,10 +418,11 @@ A C-style string (`CString`) representing the text.
---
-## `bytes`
+## `utf8_bytes`
**Description:**
-Converts a `Text` value to an array of bytes.
+Converts a `Text` value to an array of bytes representing a UTF8 encoding of
+the text.
**Usage:**
```tomo
@@ -313,466 +431,522 @@ bytes(text: Text) -> [Int8]
**Parameters:**
-- `text`: The text to be converted to bytes.
+- `text`: The text to be converted to UTF8 bytes.
**Returns:**
-An array of bytes (`[Int8]`) representing the text.
+An array of bytes (`[Int8]`) representing the text in UTF8 encoding.
**Example:**
```tomo
->> "Amélie":bytes()
-= [65_i8, 109_i8, 101_i8, -52_i8, -127_i8, 108_i8, 105_i8, 101_i8]
+>> "Amélie":utf8_bytes()
+= [65_i8, 109_i8, -61_i8, -87_i8, 108_i8, 105_i8, 101_i8] : [Int8]
```
---
-## `character_names`
+## `codepoint_names`
**Description:**
-Returns a list of character names from the text.
+Returns an array of the names of each codepoint in the text.
**Usage:**
```tomo
-character_names(text: Text) -> [Text]
+codepoint_names(text: Text) -> [Text]
```
**Parameters:**
-- `text`: The text from which to extract character names.
+- `text`: The text from which to extract codepoint names.
**Returns:**
-A list of character names (`[Text]`).
+An array of codepoint names (`[Text]`).
**Example:**
```tomo
->> "Amélie":character_names()
-= ["LATIN CAPITAL LETTER A", "LATIN SMALL LETTER M", "LATIN SMALL LETTER E", "COMBINING ACUTE ACCENT", "LATIN SMALL LETTER L", "LATIN SMALL LETTER I", "LATIN SMALL LETTER E"]
+>> "Amélie":codepoint_names()
+= ["LATIN CAPITAL LETTER A", "LATIN SMALL LETTER M", "LATIN SMALL LETTER E WITH ACUTE", "LATIN SMALL LETTER L", "LATIN SMALL LETTER I", "LATIN SMALL LETTER E"]
```
---
-## `clusters`
+## `utf32_codepoints`
**Description:**
-Breaks the text into a list of unicode graphical clusters. Clusters are what
-you typically think of when you think of "letters" or "characters". If you're
-in a text editor and you hit the left or right arrow key, it will move the
-cursor by one graphical cluster.
+Returns an array of Unicode code points for UTF32 encoding of the text.
**Usage:**
```tomo
-clusters(text: Text) -> [Text]
+utf32_codepoints(text: Text) -> [Int32]
```
**Parameters:**
-- `text`: The text to be broken into graphical clusters.
+- `text`: The text from which to extract Unicode code points.
**Returns:**
-A list of graphical clusters (`[Text]`) within the text.
+An array of 32-bit integer Unicode code points (`[Int32]`).
**Example:**
```tomo
->> "Amélie":clusters()
-= ["A", "m", "é", "l", "i", "e"] : [Text]
+>> "Amélie":utf32_codepoints()
+= [65_i32, 109_i32, 233_i32, 108_i32, 105_i32, 101_i32] : [Int32]
```
---
-## `codepoints`
+## `from_c_string`
**Description:**
-Returns a list of Unicode code points for the text.
+Converts a C-style string to a `Text` value.
**Usage:**
```tomo
-codepoints(text: Text) -> [Int32]
+from_c_string(str: CString) -> Text
```
**Parameters:**
-- `text`: The text from which to extract Unicode code points.
+- `str`: The C-style string to be converted.
**Returns:**
-A list of Unicode code points (`[Int32]`).
+A `Text` value representing the C-style string.
**Example:**
```tomo
->> "Amélie":codepoints()
-= [65_i32, 109_i32, 101_i32, 769_i32, 108_i32, 105_i32, 101_i32] : [Int32]
+>> Text.from_c_string(CString("Hello"))
+= "Hello"
```
---
-## `from_c_string`
+## `from_codepoint_names`
**Description:**
-Converts a C-style string to a `Text` value.
+Returns text that has the given codepoint names (according to the Unicode
+specification) as its codepoints. Note: the text will be normalized, so the
+resulting text's codepoints may not exactly match the input codepoints.
**Usage:**
```tomo
-from_c_string(str: CString) -> Text
+from_codepoint_names(codepoint_names: [Text]) -> [Text]
```
**Parameters:**
-- `str`: The C-style string to be converted.
+- `codepoint_names`: The names of each codepoint in the desired text. Names
+ are case-insentive.
**Returns:**
-A `Text` value representing the C-style string.
+A new text with the specified codepoints after normalization has been applied.
+Any invalid names are ignored.
**Example:**
```tomo
->> Text.from_c_string(CString("Hello"))
-= "Hello"
+>> Text.from_codepoint_names([
+ "LATIN CAPITAL LETTER A WITH RING ABOVE",
+ "LATIN SMALL LETTER K",
+ "LATIN SMALL LETTER E",
+]
+= "Åke"
```
---
-## `has`
+## `from_codepoints`
**Description:**
-Checks if the `Text` contains a target substring.
+Returns text that has been constructed from the given UTF32 codepoints. Note:
+the text will be normalized, so the resulting text's codepoints may not exactly
+match the input codepoints.
**Usage:**
```tomo
-has(text: Text, target: Text, where: Where = Where.Anywhere) -> Bool
+from_codepoint_names(codepoints: [Int32]) -> [Text]
```
**Parameters:**
-- `text`: The text to be searched.
-- `target`: The substring to search for.
-- `where`: The location to search (`Where.Anywhere` by default). This can
- also be `Start` or `End`.
+- `codepoints`: The UTF32 codepoints in the desired text.
**Returns:**
-`yes` if the target substring is found, `no` otherwise.
+A new text with the specified codepoints after normalization has been applied.
**Example:**
```tomo
->> "hello world":has("wo")
-= yes
->> "hello world":has("wo", where=Start)
-= no
->> "hello world":has("he", where=Start)
-= yes
+>> Text.from_codepoints([197_i32, 107_i32, 101_i32])
+= "Åke"
```
---
-## `join`
+## `from_bytes`
**Description:**
-Joins a list of text pieces with a specified glue.
+Returns text that has been constructed from the given UTF8 bytes. Note: the
+text will be normalized, so the resulting text's UTF8 bytes may not exactly
+match the input.
**Usage:**
```tomo
-join(glue: Text, pieces: [Text]) -> Text
+from_codepoint_names(codepoints: [Int32]) -> [Text]
```
**Parameters:**
-- `glue`: The text used to join the pieces.
-- `pieces`: The list of text pieces to be joined.
+- `codepoints`: The UTF32 codepoints in the desired text.
**Returns:**
-A single `Text` value with the pieces joined by the glue.
+A new text based on the input UTF8 bytes after normalization has been applied.
**Example:**
```tomo
->> ", ":join(["one", "two", "three"])
-= "one, two, three"
+>> Text.from_bytes([-61_i8, -123_i8, 107_i8, 101_i8])
+= "Åke"
```
---
-## `lower`
+## `find`
**Description:**
-Converts all characters in the text to lowercase.
+Finds the first occurrence of a pattern in the given text (if any).
+See: [Patterns](#Patterns) for more information on patterns.
**Usage:**
```tomo
-lower(text: Text) -> Text
+find(text: Text, pattern: Text, start: Int = 1, length: &Int64? = !&Int64) -> Int
```
**Parameters:**
-- `text`: The text to be converted to lowercase.
+- `text`: The text to be searched.
+- `pattern`: The pattern to search for.
+- `start`: The index to start the search.
+- `length`: If non-null, this pointer's value will be set to the length of the
+ match, or `-1` if there is no match.
**Returns:**
-The lowercase version of the text.
+`0` if the target pattern is not found, otherwise the index where the match was
+found.
**Example:**
```tomo
->> "AMÉLIE":lower()
-= "amélie"
+>> " one two three ":find("[..id]", start=-999)
+= 0
+>> " one two three ":find("[..id]", start=999)
+= 0
+>> " one two three ":find("[..id]")
+= 2
+>> " one two three ":find("[..id]", start=5)
+= 8
+
+>> len := 0_i64
+>> " one ":find("[..id]", length=&len)
+= 4
+>> len
+= 3_i64
```
---
-## `num_bytes`
+## `find_all`
**Description:**
-Returns the number of bytes used by the text.
+Finds all occurrences of a pattern in the given text.
+See: [Patterns](#Patterns) for more information on patterns.
**Usage:**
```tomo
-num_bytes(text: Text) -> Int
+find_all(text: Text, pattern: Text) -> [Text]
```
**Parameters:**
-- `text`: The text to measure.
+- `text`: The text to be searched.
+- `pattern`: The pattern to search for.
**Returns:**
-The number of bytes used by the text.
+An array of every match of the pattern in the given text.
+Note: if `text` or `pattern` is empty, an empty array will be returned.
**Example:**
```tomo
->> "Amélie":num_bytes()
-= 8
+>> " one two three ":find_all("[..alpha]")
+= ["one", "two", "three"]
+
+>> " one two three ":find_all("[..!space]")
+= ["one", "two", "three"]
+
+>> " ":find_all("[..alpha]")
+= []
+
+>> " foo(baz(), 1) doop() ":find_all("[..id](?)")
+= ["foo(baz(), 1)", "doop()"]
+
+>> "":find_all("")
+= []
+
+>> "Hello":find_all("")
+= []
```
---
-## `num_clusters`
+## `has`
**Description:**
-Returns the number of clusters in the text.
+Checks if the `Text` contains a target pattern (see: [Patterns](#Patterns)).
**Usage:**
```tomo
-num_clusters(text: Text) -> Int
+has(text: Text, pattern: Text) -> Bool
```
**Parameters:**
-- `text`: The text to measure.
+- `text`: The text to be searched.
+- `pattern`: The pattern to search for.
**Returns:**
-The number of clusters in the text.
+`yes` if the target pattern is found, `no` otherwise.
**Example:**
```tomo
->> "Amélie":num_clusters()
-= 6
+>> "hello world":has("wo")
+= yes
+>> "hello world":has("[..alpha]")
+= yes
+>> "hello world":has("[..digit]")
+= no
+>> "hello world":has("[..start]he")
+= yes
```
---
-## `num_codepoints`
+## `join`
**Description:**
-Returns the number of Unicode code points in the text.
+Joins an array of text pieces with a specified glue.
**Usage:**
```tomo
-num_codepoints(text: Text) -> Int
+join(glue: Text, pieces: [Text]) -> Text
```
**Parameters:**
-- `text`: The text to measure.
+- `glue`: The text used to join the pieces.
+- `pieces`: The array of text pieces to be joined.
**Returns:**
-The number of Unicode code points in the text.
+A single `Text` value with the pieces joined by the glue.
**Example:**
```tomo
->> "Amélie":num_codepoints()
-= 7
+>> ", ":join(["one", "two", "three"])
+= "one, two, three"
```
---
-## `quoted`
+## `lines`
**Description:**
-Formats the text as a quoted string.
+Splits the text into an array of lines of text, preserving blank lines,
+ignoring trailing newlines, and handling `\r\n` the same as `\n`.
**Usage:**
```tomo
-quoted(text: Text, color: Bool = no) -> Text
+split(text: Text) -> [Text]
```
**Parameters:**
-- `text`: The text to be quoted.
-- `color`: Whether to add color formatting (default is `no`).
+- `text`: The text to be split into lines.
**Returns:**
-The text formatted as a quoted string.
+An array of substrings resulting from the split.
**Example:**
```tomo
->> "one$(\n)two":quoted()
-= "\"one\\ntwo\""
+>> "one$(\n)two$(\n)three":lines()
+= ["one", "two", "three"]
+>> "one$(\n)two$(\n)three$(\n)":lines()
+= ["one", "two", "three"]
+>> "one$(\n)two$(\n)three$(\n\n)":lines()
+= ["one", "two", "three", ""]
+>> "one$(\r\n)two$(\r\n)three$(\r\n)":lines()
+= ["one", "two", "three"]
+>> "":lines()
+= []
```
---
-## `replace`
+## `lower`
**Description:**
-Replaces occurrences of a pattern in the text with a replacement string.
+Converts all characters in the text to lowercase.
**Usage:**
```tomo
-replace(text: Text, pattern: Text, replacement: Text, limit: Int = -1) -> Text
+lower(text: Text) -> Text
```
**Parameters:**
-- `text`: The text in which to perform replacements.
-- `pattern`: The substring to be replaced.
-- `replacement`: The text to replace the pattern with.
-- `limit`: The maximum number of replacements (default is `-1`, meaning no limit).
+- `text`: The text to be converted to lowercase.
**Returns:**
-The text with occurrences of the pattern replaced.
+The lowercase version of the text.
**Example:**
```tomo
->> "Hello world":replace("world", "there")
-= "Hello there"
-
->> "xxxx":replace("x", "y", limit=2)
-= "yyxx"
+>> "AMÉLIE":lower()
+= "amélie"
```
---
-## `split`
+## `quoted`
**Description:**
-Splits the text into a list of substrings based on a delimiter.
+Formats the text as a quoted string.
**Usage:**
```tomo
-split(text: Text, split: Text) -> [Text]
+quoted(text: Text, color: Bool = no) -> Text
```
**Parameters:**
-- `text`: The text to be split.
-- `split`: The delimiter used to split the text.
+- `text`: The text to be quoted.
+- `color`: Whether to add color formatting (default is `no`).
**Returns:**
-A list of substrings resulting from the split.
+The text formatted as a quoted string.
**Example:**
```tomo
->> "one,two,three":split(",")
-= ["one", "two", "three"]
+>> "one$(\n)two":quoted()
+= "\"one\\ntwo\""
```
---
-## `title`
+## `replace`
**Description:**
-Converts the text to title case (capitalizing the first letter of each word).
+Replaces occurrences of a pattern in the text with a replacement string.
+See [Patterns](#patterns) for more information about patterns.
**Usage:**
```tomo
-title(text: Text) -> Text
+replace(text: Text, pattern: Text, replacement: Text) -> Text
```
**Parameters:**
-- `text`: The text to be converted to title case.
+- `text`: The text in which to perform replacements.
+- `pattern`: The pattern to be replaced.
+- `replacement`: The text to replace the pattern with.
**Returns:**
-The text in title case.
+The text with occurrences of the pattern replaced.
**Example:**
```tomo
->> "amélie":title()
-= "Amélie"
+>> "Hello world":replace("world", "there")
+= "Hello there"
+
+>> "Hello world":replace("[..id]", "xxx")
+= "xxx xxx"
```
---
-## `trimmed`
+## `split`
**Description:**
-Trims characters from the beginning and end of the text.
+Splits the text into an array of substrings based on a pattern.
+See [Patterns](#patterns) for more information about patterns.
**Usage:**
```tomo
-trimmed(text: Text, trim: Text = " {\n\r\t}", where: Where = Where.Anywhere) -> Text
+split(text: Text, pattern: Text = "") -> [Text]
```
**Parameters:**
-- `text`: The text to be trimmed.
-- `trim`: The set of characters to remove (default is `" {\n\r\t}"`).
-- `where`: Specifies where to trim (`Where.Anywhere` by default).
+- `text`: The text to be split.
+- `pattern`: The pattern used to split the text. If the pattern is the empty
+ string, the text will be split into individual grapheme clusters.
**Returns:**
-The trimmed text.
+An array of substrings resulting from the split.
**Example:**
```tomo
->> " xxx ":trimmed()
-= "xxx"
+>> "one,two,three":split(",")
+= ["one", "two", "three"]
+
+>> "abc":split()
+= ["a", "b", "c"]
->> "xxyyxx":trimmed("x", where=Start)
-= "yyxx"
+>> "a b c":split("[..space]")
+= ["a", "b", "c"]
+
+>> "a,b,c,":split(",")
+= ["a", "b", "c", ""]
```
---
-## `upper`
+## `title`
**Description:**
-Converts all characters in the text to uppercase.
+Converts the text to title case (capitalizing the first letter of each word).
**Usage:**
```tomo
-upper(text: Text) -> Text
+title(text: Text) -> Text
```
**Parameters:**
-- `text`: The text to be converted to uppercase.
+- `text`: The text to be converted to title case.
**Returns:**
-The uppercase version of the text.
+The text in title case.
**Example:**
```tomo
->> "amélie":upper()
-= "AMÉLIE"
+>> "amélie":title()
+= "Amélie"
```
---
-## `without`
+## `upper`
**Description:**
-Removes all occurrences of a target substring from the text.
+Converts all characters in the text to uppercase.
**Usage:**
```tomo
-without(text: Text, target: Text, where: Where = Where.Anywhere) -> Text
+upper(text: Text) -> Text
```
**Parameters:**
-- `text`: The text from which to remove substrings.
-- `target`: The substring to remove.
-- `where`: The location to remove the target (`Where.Anywhere` by default).
+- `text`: The text to be converted to uppercase.
**Returns:**
-The text with occurrences of the target removed.
+The uppercase version of the text.
**Example:**
```tomo
->> "banana":without("na")
-= "ba"
->> "banana":without("na", where=End)
-= "bana"
+>> "amélie":upper()
+= "AMÉLIE"
```