aboutsummaryrefslogtreecommitdiff
path: root/docs/patterns.md
blob: 728b978e7b305a95ecfed73b8d585e955b4ec10c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# Text Pattern Matching

As an alternative to full regular expressions, Tomo provides a limited string
matching pattern syntax that is intended to solve 80% of use cases in under 1%
of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's
pattern matching code is a bit under 1k lines of code). Tomo's pattern matching
syntax is highly readable and works well for matching literal text without
getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).

For more advanced use cases, consider linking against a C library for regular
expressions or pattern matching.

`Pattern` is a [domain-specific language](docs/langs.md), in other words, it's
like a `Text`, but it has a distinct type. As a convenience, you can use
`$/.../` to write pattern literals instead of using the general-purpose DSL
syntax of `$Pattern"..."`.

Patterns are used in a small, but very powerful API that handles many text
functions that would normally be handled by a more extensive API:

```
Text.has(pattern:Pattern -> Bool)
Text.each(pattern:Pattern, fn:func(m:Match), recursive=yes -> Text)
Text.find(pattern:Pattern, start=1 -> Match?)
Text.find_all(pattern:Pattern -> [Match])
Text.matches(pattern:Pattern -> [Text]?)
Text.map(pattern:Pattern, fn:func(m:Match -> Text), recursive=yes -> Text)
Text.replace(pattern:Pattern, replacement:Text, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.replace_all(replacements:{Pattern,Text}, placeholder:Pattern=$//, recursive=yes -> [Text])
Text.split(pattern:Pattern -> [Text])
Text.trim(pattern=$/{whitespace}/, trim_left=yes, trim_right=yes -> [Text])
```

## Matches

Pattern matching functions work with a type called `Match` that has three fields:

- `text`: The full text of the match.
- `index`: The index in the text where the match was found.
- `captures`: An array containing the matching text of each non-literal pattern group.

See [Text Functions](text.md#Text-Functions) for the full API documentation.

## Syntax

Patterns have three types of syntax:

- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
  optional `!` to negate the pattern, followed by an optional pattern name or
  Unicode character name, followed by a required `}`.

- Any matching pair of quotes or parentheses or braces with a `?` in the middle
  (e.g. `"?"` or `(?)`).

- Any other character is treated as a literal to be matched exactly.

## Named Patterns

Named patterns match certain pre-defined patterns that are commonly useful. To
use a named pattern, use the syntax `{name}`. Names are case-insensitive and
mostly ignore spaces, underscores, and dashes.

- `..` - Any character (note that a single `.` would mean the literal period
  character).
- `digit` - A unicode digit
- `email` - an email address
- `emoji` - an emoji
- `end` - the very end of the text
- `id` - A unicode identifier
- `int` - One or more digits with an optional `-` (minus sign) in front
- `ip` - an IP address (IPv4 or IPv6)
- `ipv4` - an IPv4 address
- `ipv6` - an IPv6 address
- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`)
- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
- `start` - the very start of the text
- `uri` - a URI
- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
- `word` - A unicode identifier (same as `id`)

For non-alphabetic characters, any single character is treated as matching
exactly that character. For example, `{1{}` matches exactly one `{`
character. Or, `{1.}` matches exactly one `.` character.

Patterns can also use any Unicode property name. Some helpful ones are:

- `hex` - Hexidecimal digits
- `lower` - Lowercase letters
- `space` - The space character
- `upper` - Uppercase letters
- `whitespace` - Whitespace characters

Patterns may also use exact Unicode codepoint names. For example: `{1 latin
small letter A}` matches `a`.

## Negating Patterns

If an exclamation mark (`!`) is placed before a pattern's name, then characters
are matched only when they _don't_ match the pattern. For example, `{!alpha}`
will match all characters _except_ alphabetic ones.

## Interpolating Text and Escaping

To escape a character in a pattern (e.g. if you want to match the literal
character `?`), you can use the syntax `{1 ?}`. This is almost never necessary
unless you have text that looks like a Tomo text pattern and has something like
`{` or `(?)` inside it.

However, if you're trying to do an exact match of arbitrary text values, you'll
want to have the text automatically escaped. Fortunately, Tomo's injection-safe
DSL text interpolation supports automatic text escaping. This means that if you
use text interpolation with the `$` sign to insert a text value, the value will
be automatically escaped using the `{1 ?}` rule described above:

```tomo
# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"

# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/

# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"

# No error:
>> some_text:find($/$user_input/)
= 0
```

If you prefer, you can also use this to insert literal characters:

```tomo
>> $/literal $"{..}"/
= $/literal {1{}..}/
```

## Repetitions

By default, named patterns match 1 or more repetitions, but you can specify how
many repetitions you want by putting a number or range of numbers first using
`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
(`n` or more repetitions):

```
{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}
```