2024-02-11 18:44:06 -08:00
|
|
|
# Strings
|
|
|
|
|
|
|
|
Strings are implemented as immutable UTF-8-encoded values using:
|
|
|
|
|
|
|
|
- The Boehm Cord library for efficient storage and concatenation.
|
|
|
|
- GNU libunistring for unicode functionality (grapheme cluster counts,
|
|
|
|
capitalization, etc.)
|
|
|
|
- My own BP library for simple pattern matching operations (similar to regex)
|
|
|
|
|
|
|
|
## Syntax
|
|
|
|
|
|
|
|
Strings have a flexible syntax designed to make it easy to hold values from
|
|
|
|
different languages without the need to have lots of escape sequences and
|
|
|
|
without using printf-style string formatting.
|
|
|
|
|
|
|
|
```
|
|
|
|
// Basic string:
|
|
|
|
str := "Hello world"
|
|
|
|
str2 := 'Also a string'
|
|
|
|
```
|
|
|
|
|
|
|
|
## Line Splits
|
|
|
|
|
|
|
|
Long strings can be split across multiple lines by having two or more dots at
|
|
|
|
the start of a new line on the same indentation level that started the string:
|
|
|
|
|
|
|
|
```
|
|
|
|
str := "This is a long
|
|
|
|
.... line that is split in code"
|
|
|
|
```
|
|
|
|
|
|
|
|
## Multi-line Strings
|
|
|
|
|
|
|
|
Multi-line strings have indented (i.e. at least one tab more than the start of
|
|
|
|
the string) text inside quotation marks. The leading and trailing newline are
|
|
|
|
ignored:
|
|
|
|
|
|
|
|
```
|
|
|
|
multi_line := "
|
|
|
|
This string has multiple lines.
|
|
|
|
Line two.
|
|
|
|
|
|
|
|
You can split a line
|
|
|
|
.... using two or more dots to make an elipsis.
|
|
|
|
|
|
|
|
Remember to include whitespace after the elipsis if desired.
|
|
|
|
|
|
|
|
Or don't if you're splitting a long word like supercalifragilisticexpia
|
|
|
|
....lidocious
|
|
|
|
|
|
|
|
This text is indented by one level in the string
|
|
|
|
|
|
|
|
"quotes" are ignored unless they're at the same indentation level as the
|
|
|
|
.... start of the string.
|
|
|
|
|
|
|
|
The end (no newline after this).
|
|
|
|
"
|
|
|
|
```
|
|
|
|
|
|
|
|
## String Interpolations
|
|
|
|
|
|
|
|
Inside a double quoted string, you can use curly braces (`{...}`) to insert an
|
|
|
|
expression that you want converted to a string. This is called string
|
|
|
|
interpolation:
|
|
|
|
|
|
|
|
```
|
|
|
|
// Interpolation:
|
|
|
|
str := "Sum: {1 + 2}"
|
|
|
|
// equivalent to "Sum: 3"
|
|
|
|
```
|
|
|
|
|
|
|
|
Single-quoted strings do not have interpolations:
|
|
|
|
|
|
|
|
```
|
|
|
|
// No interpolation here:
|
|
|
|
str := 'Sum: {1 + 2}'
|
|
|
|
```
|
|
|
|
|
|
|
|
## String Escapes
|
|
|
|
|
|
|
|
Unlike other languages, backslash is *not* a special character inside of a
|
|
|
|
string. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a
|
|
|
|
newline. Instead, a series of character escapes act as complete string literals
|
|
|
|
without quotation marks:
|
|
|
|
|
|
|
|
```
|
|
|
|
newline := \n
|
|
|
|
crlf := \r\n
|
|
|
|
quote := \"
|
|
|
|
```
|
|
|
|
|
|
|
|
These string literals can be used as interpolation values:
|
|
|
|
|
|
|
|
```
|
|
|
|
two_lines := "one{\n}two"
|
|
|
|
has_quotes := "some {\"}quotes{\"} here"
|
|
|
|
```
|
|
|
|
|
|
|
|
However, in general it is best practice to use multi-line strings to avoid these problems:
|
|
|
|
|
|
|
|
```
|
|
|
|
str := "
|
|
|
|
This has
|
|
|
|
multiple lines and "quotes" too!
|
|
|
|
"
|
|
|
|
```
|
|
|
|
|
|
|
|
### Multi-line Strings
|
|
|
|
|
|
|
|
There are two reasons for strings to span multiple lines in code: either you
|
|
|
|
have a string that contains newlines and you want to represent it without `\n`
|
|
|
|
escapes, or you have a long single-line string that you want to split across
|
|
|
|
multiple lines for readability. To support this, you can use newlines inside of
|
|
|
|
strings with indentation-sensitivity. For splitting long lines, use two or more
|
|
|
|
"."s at the same indentation level as the start of the string literal:
|
|
|
|
|
|
|
|
```
|
|
|
|
single_line := "This is a long string that
|
|
|
|
.... spans multiple lines"
|
|
|
|
```
|
|
|
|
For strings that contain newlines, you may put multiple indented lines inside
|
|
|
|
the quotes:
|
|
|
|
|
|
|
|
```
|
|
|
|
multi_line := "
|
|
|
|
line one
|
|
|
|
line two
|
|
|
|
this line is indented
|
|
|
|
last line
|
|
|
|
"
|
|
|
|
```
|
|
|
|
|
|
|
|
Strings may only end on lines with the same indentation as the starting quote
|
|
|
|
and nested quotes are ignored:
|
|
|
|
|
|
|
|
```
|
|
|
|
multi_line := "
|
|
|
|
Quotes in indented regions like this: " don't count
|
|
|
|
"
|
|
|
|
```
|
|
|
|
|
|
|
|
If there is a leading or trailing newline, it is ignored and not included in
|
|
|
|
the string.
|
|
|
|
|
|
|
|
```
|
|
|
|
str := "
|
|
|
|
one line
|
|
|
|
"
|
|
|
|
|
|
|
|
>>> str == "one line"
|
|
|
|
=== yes
|
|
|
|
```
|
|
|
|
|
|
|
|
Additional newlines *are* counted though:
|
|
|
|
|
|
|
|
```
|
|
|
|
str := "
|
|
|
|
|
|
|
|
blank lines
|
|
|
|
|
|
|
|
"
|
|
|
|
|
|
|
|
>>> str == "{\n}blank lines{\n}"
|
|
|
|
```
|
|
|
|
|
|
|
|
### Advanced $-Strings
|
|
|
|
|
|
|
|
Sometimes you need to use many `{`s or `"`s inside a string, but you don't want
|
|
|
|
to type `{\{}` or `{\"}` each time. In such cases, you can use the more
|
|
|
|
advanced form of strings. The advanced form lets you explicitly specify which
|
|
|
|
characters are used for interpolation and which characters are used for
|
|
|
|
opening/closing the string. Advanced strings begin with a dollar sign (`$`),
|
|
|
|
followed by what interpolation style to use, followed by the character to use
|
|
|
|
to delimit the string, followed by the string contents and a closing string
|
|
|
|
delimiter. The interpolation style can be a matching pair (`()`, `[]`, `{}`, or
|
|
|
|
`<>`) or any other single character. When the interpolation style is a matching
|
|
|
|
pair, the interpolation is any expression enclosed in that pair (e.g.
|
|
|
|
`${}"interpolate {1 + 2}"`). When the interpolation style is a single
|
|
|
|
character, the interpolation must be either a parenthesized expression or a
|
|
|
|
single term with no infix operators (e.g. a variable), for example:
|
|
|
|
`$@"Interpolate @var or @(1 + 2)"`.
|
|
|
|
|
|
|
|
Here are some examples:
|
|
|
|
|
|
|
|
```
|
|
|
|
$[]"In here, quotes delimit the string and square brackets interpolate: [1 + 2]"
|
|
|
|
$@"For single-letter interpolations, the interpolation is a single term like @my_var without a closing symbol"
|
|
|
|
$@"But you can parenthesize expressions like: @(x + y) if you need to"
|
|
|
|
$$"Double dollars means dollar signs interpolate: $my_var $(1 + 2)"
|
|
|
|
$${If you have a string with "quotes" and 'single quotes', you can choose something else like curly braces to delimit the string}
|
|
|
|
$?#Here hashes delimit the string and question marks interpolate: ?(1 + 2)#
|
|
|
|
```
|
|
|
|
|
|
|
|
When strings are delimited by matching pairs (`()`, `[]`, `{}`, or `<>`), they
|
|
|
|
can only be closed by a matched closing character at the same indentation
|
|
|
|
level, ignoring nested pairs:
|
|
|
|
|
|
|
|
```
|
|
|
|
$$(Inside parens, you can have (nested ()) parens no problem)
|
|
|
|
$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes"
|
|
|
|
$$(
|
|
|
|
When indented, an unmatched ) won't close the string
|
|
|
|
An unmatched ( won't mess things up either
|
|
|
|
Only matching pairs on the same indentation level are counted:
|
|
|
|
)
|
|
|
|
$$(Multi-line string with nested (parens) and
|
|
|
|
.. line continuation)
|
|
|
|
```
|
|
|
|
|
2024-02-13 16:59:51 -08:00
|
|
|
As a special case, when you use the same character for interpolation and string
|
|
|
|
delimiting, no interpolations are allowed:
|
2024-02-11 18:44:06 -08:00
|
|
|
|
|
|
|
```
|
2024-02-13 16:59:51 -08:00
|
|
|
plain := $""This string has {no interpolations}!"
|
2024-02-11 18:44:06 -08:00
|
|
|
```
|
|
|
|
|
|
|
|
**Note:** Normal doubly quoted strings with no dollar sign (e.g. `"foo"`) are a
|
2024-02-13 16:59:51 -08:00
|
|
|
shorthand for `${}"foo"`. Singly quoted strings with no dollar sign (e.g.
|
|
|
|
`'foo'`) are shorthand for `$''foo'`.
|
2024-02-11 18:44:06 -08:00
|
|
|
|
|
|
|
## Operations
|
|
|
|
|
|
|
|
### Concatenation
|
|
|
|
|
|
|
|
Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`.
|
|
|
|
|
|
|
|
Because string concatenation is typically an O(1) operation, there is no need
|
|
|
|
for a separate string builder class in the language and no need to use an array
|
|
|
|
of string fragments.
|
|
|
|
|
|
|
|
### String Length
|
|
|
|
|
|
|
|
String length is an ambiguous term in the context of UTF-8 strings. There are
|
|
|
|
several possible meanings, so each of these meanings is split into a separate
|
|
|
|
method:
|
|
|
|
|
2024-03-18 10:34:11 -07:00
|
|
|
- Number of grapheme clusters: `string:num_graphemes()`
|
|
|
|
- Size in bytes: `string:num_bytes()`
|
|
|
|
- Number of unicode codepoints: `string:num_codepoints()` (you probably want to
|
2024-02-11 18:44:06 -08:00
|
|
|
use graphemes, not codepoints in most applications)
|
|
|
|
|
2024-03-18 10:34:11 -07:00
|
|
|
Since the typical user expectation is that string length refers to "letters,"
|
|
|
|
the `#` length operator returns the number of grapheme clusters, which is the
|
|
|
|
closest unicode equivalent to "letters."
|
|
|
|
|
2024-02-11 18:44:06 -08:00
|
|
|
### Iteration
|
|
|
|
|
|
|
|
Iteration is *not* supported for strings because of the ambiguity between
|
2024-03-18 10:34:11 -07:00
|
|
|
bytes, codepoints, and graphemes. It is instead recommended that you explicitly
|
|
|
|
iterate over bytes, codepoints, graphemes, words, lines, etc:
|
2024-02-11 18:44:06 -08:00
|
|
|
|
|
|
|
### Subcomponents
|
|
|
|
|
2024-03-18 10:34:11 -07:00
|
|
|
- `string:bytes()` returns an array of `Int8` bytes
|
|
|
|
- `string:codepoints()` returns an array of `Int32` bytes
|
|
|
|
- `string:graphemes()` returns an array of grapheme cluster strings
|
|
|
|
- `string:words()` returns an array of word strings
|
|
|
|
- `string:lines()` returns an array of line strings
|
|
|
|
- `string:split(",", empty=no)` returns an array of strings split by the given delimiter
|
|
|
|
|
|
|
|
### Equality, Comparison, and Hashing
|
|
|
|
|
|
|
|
All text is compared and hashed using unicode normalization. Unicode provides
|
|
|
|
several different ways to represent the same text. For example, the single
|
|
|
|
codepoint `U+E9` (latin small e with accent) is rendered the same as the two
|
|
|
|
code points `U+65 U+301` (latin small e, acute combining accent) and has an
|
|
|
|
equivalent linguistic meaning. These are simply different ways to represent the
|
|
|
|
same "letter." In order to make it easy to write correct code that takes this
|
|
|
|
into account, Tomo uses unicode normalization for all string comparisons and
|
|
|
|
hashing. Normalization does the equivalent of converting text to a canonical
|
|
|
|
form before performing comparisons or hashing. This means that if a table is
|
|
|
|
created that has text with the codepoint `U+E9` as a key, then a lookup with
|
|
|
|
the same text but with `U+65 U+301` instead of `U+E9` will still succeed in
|
|
|
|
finding the value because the two strings are equivalent under normalization.
|
2024-02-11 18:44:06 -08:00
|
|
|
|
|
|
|
### Capitalization
|
|
|
|
|
2024-03-18 10:34:11 -07:00
|
|
|
- `x:capitalized()`
|
|
|
|
- `x:titlecased()`
|
|
|
|
- `x:uppercased()`
|
|
|
|
- `x:lowercased()`
|
2024-02-11 18:44:06 -08:00
|
|
|
|
|
|
|
### Patterns
|
|
|
|
|
2024-03-18 10:34:11 -07:00
|
|
|
- `string:has("target", at=Anywhere:enum(Anywhere, Start, End))->Bool` Check whether a pattern can be found
|
|
|
|
- `string:without("target", at=Anywhere:enum(Anywhere, Start, End))->Text`
|
|
|
|
- `string:trimmed("chars...", at=Anywhere:enum(Anywhere, Start, End))->Text`
|
|
|
|
- `string:find("target")->enum(Failure, Success(index:Int32))`
|
|
|
|
- `string:replace("target", "replacement", limit=Int.max)->Text` Returns a copy of the string with replacements
|
|
|
|
- `string:split("split")->[Text]`
|
|
|
|
- `string:join(["one", "two"])->Text`
|