aboutsummaryrefslogtreecommitdiff
path: root/examples/patterns/README.md
blob: 9e9d8601c679a7e0e59aae98afd9292c1d6d0c04 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
# Text Pattern Matching

As an alternative to full regular expressions, Tomo provides a limited text
matching pattern syntax that is intended to solve 80% of use cases in under 1%
of the code size (PCRE's codebase is roughly 150k lines of code, and Tomo's
pattern matching code is a bit under 1k lines of code). Tomo's pattern matching
syntax is highly readable and works well for matching literal text without
getting [leaning toothpick syndrome](https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome).

For more advanced use cases, consider linking against a C library for regular
expressions or pattern matching.

`Pat` is a [domain-specific language](docs/langs.md), in other words, it's
like a `Text`, but it has a distinct type.

Patterns are used in a small, but very powerful API that handles many text
functions that would normally be handled by a more extensive API:

- [`by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))`](#by_pattern)
- [`by_pattern_split(text:Text, pattern:Pat -> func(->Text?))`](#by_pattern_split)
- [`each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)`](#each_pattern)
- [`find_patterns(text:Text, pattern:Pat -> [PatternMatch])`](#find_patterns)
- [`has_pattern(text:Text, pattern:Pat -> Bool)`](#has_pattern)
- [`map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)`](#map_pattern)
- [`matches_pattern(text:Text, pattern:Pat -> Bool)`](#matches_pattern)
- [`pattern_captures(text:Text, pattern:Pat -> [Text]?)`](#pattern_captures)
- [`replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)`](#replace_pattern)
- [`split_pattern(text:Text, pattern:Pat -> [Text])`](#split_pattern)
- [`translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)`](#translate_patterns)
- [`trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)`](#trim_pattern)

## Matches

Pattern matching functions work with a type called `PatternMatch` that has three fields:

- `text`: The full text of the match.
- `index`: The index in the text where the match was found.
- `captures`: An array containing the matching text of each non-literal pattern group.

See [Text Functions](text.md#Text-Functions) for the full API documentation.

## Syntax

Patterns have three types of syntax:

- `{` followed by an optional count (`n`, `n-m`, or `n+`), followed by an
  optional `!` to negate the pattern, followed by an optional pattern name or
  Unicode character name, followed by a required `}`.

- Any matching pair of quotes or parentheses or braces with a `?` in the middle
  (e.g. `"?"` or `(?)`).

- Any other character is treated as a literal to be matched exactly.

## Named Patterns

Named patterns match certain pre-defined patterns that are commonly useful. To
use a named pattern, use the syntax `{name}`. Names are case-insensitive and
mostly ignore spaces, underscores, and dashes.

- `..` - Any character (note that a single `.` would mean the literal period
  character).
- `digit` - A unicode digit
- `email` - an email address
- `emoji` - an emoji
- `end` - the very end of the text
- `id` - A unicode identifier
- `int` - One or more digits with an optional `-` (minus sign) in front
- `ip` - an IP address (IPv4 or IPv6)
- `ipv4` - an IPv4 address
- `ipv6` - an IPv6 address
- `nl`/`newline`/`crlf` - A line break (either `\r\n` or `\n`)
- `num` - One or more digits with an optional `-` (minus sign) in front and an optional `.` and more digits after
- `start` - the very start of the text
- `uri` - a URI
- `url` - a URL (URI that specifically starts with `http://`, `https://`, `ws://`, `wss://`, or `ftp://`)
- `word` - A unicode identifier (same as `id`)

For non-alphabetic characters, any single character is treated as matching
exactly that character. For example, `{1{}` matches exactly one `{`
character. Or, `{1.}` matches exactly one `.` character.

Patterns can also use any Unicode property name. Some helpful ones are:

- `hex` - Hexidecimal digits
- `lower` - Lowercase letters
- `space` - The space character
- `upper` - Uppercase letters
- `whitespace` - Whitespace characters

Patterns may also use exact Unicode codepoint names. For example: `{1 latin
small letter A}` matches `a`.

## Negating Patterns

If an exclamation mark (`!`) is placed before a pattern's name, then characters
are matched only when they _don't_ match the pattern. For example, `{!alpha}`
will match all characters _except_ alphabetic ones.

## Interpolating Text and Escaping

To escape a character in a pattern (e.g. if you want to match the literal
character `?`), you can use the syntax `{1 ?}`. This is almost never necessary
unless you have text that looks like a Tomo text pattern and has something like
`{` or `(?)` inside it.

However, if you're trying to do an exact match of arbitrary text values, you'll
want to have the text automatically escaped. Fortunately, Tomo's injection-safe
DSL text interpolation supports automatic text escaping. This means that if you
use text interpolation with the `$` sign to insert a text value, the value will
be automatically escaped using the `{1 ?}` rule described above:

```tomo
# Risk of code injection (would cause an error because 'xxx' is not a valid
# pattern name:
>> user_input := get_user_input()
= "{xxx}"

# Interpolation automatically escapes:
>> $/$user_input/
= $/{1{}..xxx}/

# This is: `{ 1{ }` (one open brace) followed by the literal text "..xxx}"

# No error:
>> some_text:find($/$user_input/)
= 0
```

If you prefer, you can also use this to insert literal characters:

```tomo
>> $/literal $"{..}"/
= $/literal {1{}..}/
```

## Repetitions

By default, named patterns match 1 or more repetitions, but you can specify how
many repetitions you want by putting a number or range of numbers first using
`n` (exactly `n` repetitions), `n-m` (between `n` and `m` repetitions), or `n+`
(`n` or more repetitions):

```
{4-5 alpha}
0x{hex}
{4 digit}-{2 digit}-{2 digit}
{2+ space}
{0-1 question mark}
```


# Methods

### `by_pattern`
Returns an iterator function that yields `PatternMatch` objects for each occurrence.

```tomo
func by_pattern(text:Text, pattern:Pat -> func(->PatternMatch?))
```

- `text`: The text to search.
- `pattern`: The pattern to match.

**Returns:**
An iterator function that yields `PatternMatch` objects one at a time.

**Example:**
```tomo
text := "one, two, three"
for word in text:by_pattern($Pat"{id}"):
    say(word.text)
```

---

### `by_pattern_split`
Returns an iterator function that yields text segments split by a pattern.

```tomo
func by_pattern_split(text:Text, pattern:Pat -> func(->Text?))
```

- `text`: The text to split.
- `pattern`: The pattern to use as a separator.

**Returns:**
An iterator function that yields text segments.

**Example:**
```tomo
text := "one two three"
for word in text:by_pattern_split($Pat"{whitespace}"):
    say(word.text)
```

---

### `each_pattern`
Applies a function to each occurrence of a pattern in the text.

```tomo
func each_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch), recursive=yes)
```

- `text`: The text to search.
- `pattern`: The pattern to match.
- `fn`: The function to apply to each match.
- `recursive`: If `yes`, applies the function recursively on modified text.

**Example:**
```tomo
text := "one two three"
text:each_pattern($Pat"{id}", func(m:PatternMatch):
    say(m.txt)
)
```

---

### `find_patterns`
Finds all occurrences of a pattern in a text and returns them as `PatternMatch` objects.

```tomo
func find_patterns(text:Text, pattern:Pat -> [PatternMatch])
```

- `text`: The text to search.
- `pattern`: The pattern to match.

**Returns:**
An array of `PatternMatch` objects.

**Example:**
```tomo
text := "one! two three!"
>> text:find_patterns($Pat"{id}!")
= [PatternMatch(text="one!", index=1, captures=["one"]), PatternMatch(text="three!", index=10, captures=["three"])]
```

---

### `has_pattern`
Checks whether a given pattern appears in the text.

```tomo
func has_pattern(text:Text, pattern:Pat -> Bool)
```

- `text`: The text to search.
- `pattern`: The pattern to check for.

**Returns:**
`yes` if a match is found, otherwise `no`.

**Example:**
```tomo
text := "...okay..."
>> text:has_pattern($Pat"{id}")
= yes
```

---

### `map_pattern`
Transforms matches of a pattern using a mapping function.

```tomo
func map_pattern(text:Text, pattern:Pat, fn:func(m:PatternMatch -> Text), recursive=yes -> Text)
```

- `text`: The text to modify.
- `pattern`: The pattern to match.
- `fn`: A function that transforms matches.
- `recursive`: If `yes`, applies transformations recursively.

**Returns:**
A new text with the transformed matches.

**Example:**
```tomo
text := "I have #apples and #oranges and #plums"
fruits := {"apples"=4, "oranges"=5}
>> text:map_pattern($Pat'#{id}', func(match:PatternMatch):
    fruit := match.captures[1]
    "$(fruits[fruit] or 0) $fruit"
)
= "I have 4 apples and 5 oranges and 0 plums"
```

---

### `matches_pattern`
Returns whether or not text matches a pattern completely.

```tomo
func matches_pattern(text:Text, pattern:Pat -> Bool)
```

- `text`: The text to match against.
- `pattern`: The pattern to match.

**Returns:**
`yes` if the whole text matches the pattern, otherwise `no`.

**Example:**
```tomo
>> "Hello!!!":matches_pattern($Pat"{id}")
= no
>> "Hello":matches_pattern($Pat"{id}")
= yes
```

---

### `pattern_captures`
Returns an array of pattern captures for the given pattern.

```tomo
func pattern_captures(text:Text, pattern:Pat -> [Text]?)
```

- `text`: The text to match against.
- `pattern`: The pattern to match.

**Returns:**
An optional array of matched pattern captures. Returns `none` if the text does
not match the pattern.

**Example:**
```tomo
>> "123 boxes":pattern_captures($Pat"{int} {id}")
= ["123", "boxes"]?
>> "xxx":pattern_captures($Pat"{int} {id}")
= none
```

---

### `replace_pattern`
Replaces occurrences of a pattern with a replacement text, supporting backreferences.

```tomo
func replace_pattern(text:Text, pattern:Pat, replacement:Text, backref="@", recursive=yes -> Text)
```

- `text`: The text to modify.
- `pattern`: The pattern to match.
- `replacement`: The text to replace matches with.
- `backref`: The symbol for backreferences in the replacement.
- `recursive`: If `yes`, applies replacements recursively.

**Returns:**
A new text with replacements applied.

**Example:**
```tomo
>> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "some")
= "I have some apples and some oranges"

>> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "(@1)")
= "I have (123) apples and (456) oranges"

>> "I have 123 apples and 456 oranges":replace_pattern($Pat"{int}", "(?1)", backref="?")
= "I have (123) apples and (456) oranges"

>> "bad(fn(), bad(notbad))":replace_pattern($Pat"bad(?)", "good(@1)")
= "good(fn(), good(notbad))"

>> "bad(fn(), bad(notbad))":replace_pattern($Pat"bad(?)", "good(@1)", recursive=no)
= "good(fn(), bad(notbad))"
```

---

### `split_pattern`
Splits a text into segments using a pattern as the delimiter.

```tomo
func split_pattern(text:Text, pattern:Pat -> [Text])
```

- `text`: The text to split.
- `pattern`: The pattern to use as a separator.

**Returns:**
An array of text segments.

**Example:**
```tomo
>> "one two three":split_pattern($Pat"{whitespace}")
= ["one", "two", "three"]
```

---

### `translate_patterns`
Replaces multiple patterns using a mapping of patterns to replacement texts.

```tomo
func translate_patterns(text:Text, replacements:{Pat,Text}, backref="@", recursive=yes -> Text)
```

- `text`: The text to modify.
- `replacements`: A table mapping patterns to their replacements.
- `backref`: The symbol for backreferences in replacements.
- `recursive`: If `yes`, applies replacements recursively.

**Returns:**
A new text with all specified replacements applied.

**Example:**
```tomo
>> text := "foo(x, baz(1))"
>> text:translate_patterns({
    $Pat"{id}(?)"="call(fn('@1'), @2)",
    $Pat"{id}"="var('@1')",
    $Pat"{int}"="int(@1)",
})
= "call(fn('foo'), var('x'), call(fn('baz'), int(1)))"
```

---

### `trim_pattern`
Removes matching patterns from the beginning and/or end of a text.

```tomo
func trim_pattern(text:Text, pattern=$Pat"{space}", left=yes, right=yes -> Text)
```

- `text`: The text to trim.
- `pattern`: The pattern to trim (defaults to whitespace).
- `left`: If `yes`, trims from the beginning.
- `right`: If `yes`, trims from the end.

**Returns:**
The trimmed text.

**Example:**
```tomo
>> "123abc456":trim_pattern($Pat"{digit}")
= "abc"
```