9.2 KiB
Strings
Strings are implemented as immutable UTF-8-encoded values using:
- The Boehm Cord library for efficient storage and concatenation.
- GNU libunistring for unicode functionality (grapheme cluster counts, capitalization, etc.)
- My own BP library for simple pattern matching operations (similar to regex)
Syntax
Strings have a flexible syntax designed to make it easy to hold values from different languages without the need to have lots of escape sequences and without using printf-style string formatting.
// Basic string:
str := "Hello world"
str2 := 'Also a string'
Line Splits
Long strings can be split across multiple lines by having two or more dots at the start of a new line on the same indentation level that started the string:
str := "This is a long
.... line that is split in code"
Multi-line Strings
Multi-line strings have indented (i.e. at least one tab more than the start of the string) text inside quotation marks. The leading and trailing newline are ignored:
multi_line := "
This string has multiple lines.
Line two.
You can split a line
.... using two or more dots to make an elipsis.
Remember to include whitespace after the elipsis if desired.
Or don't if you're splitting a long word like supercalifragilisticexpia
....lidocious
This text is indented by one level in the string
"quotes" are ignored unless they're at the same indentation level as the
.... start of the string.
The end (no newline after this).
"
String Interpolations
Inside a double quoted string, you can use curly braces ({...}
) to insert an
expression that you want converted to a string. This is called string
interpolation:
// Interpolation:
str := "Sum: {1 + 2}"
// equivalent to "Sum: 3"
Single-quoted strings do not have interpolations:
// No interpolation here:
str := 'Sum: {1 + 2}'
String Escapes
Unlike other languages, backslash is not a special character inside of a
string. For example, "x\ny"
has the characters x
, \
, n
, y
, not a
newline. Instead, a series of character escapes act as complete string literals
without quotation marks:
newline := \n
crlf := \r\n
quote := \"
These string literals can be used as interpolation values:
two_lines := "one{\n}two"
has_quotes := "some {\"}quotes{\"} here"
However, in general it is best practice to use multi-line strings to avoid these problems:
str := "
This has
multiple lines and "quotes" too!
"
Multi-line Strings
There are two reasons for strings to span multiple lines in code: either you
have a string that contains newlines and you want to represent it without \n
escapes, or you have a long single-line string that you want to split across
multiple lines for readability. To support this, you can use newlines inside of
strings with indentation-sensitivity. For splitting long lines, use two or more
"."s at the same indentation level as the start of the string literal:
single_line := "This is a long string that
.... spans multiple lines"
For strings that contain newlines, you may put multiple indented lines inside the quotes:
multi_line := "
line one
line two
this line is indented
last line
"
Strings may only end on lines with the same indentation as the starting quote and nested quotes are ignored:
multi_line := "
Quotes in indented regions like this: " don't count
"
If there is a leading or trailing newline, it is ignored and not included in the string.
str := "
one line
"
>>> str == "one line"
=== yes
Additional newlines are counted though:
str := "
blank lines
"
>>> str == "{\n}blank lines{\n}"
Advanced $-Strings
Sometimes you need to use many {
s or "
s inside a string, but you don't want
to type {\{}
or {\"}
each time. In such cases, you can use the more
advanced form of strings. The advanced form lets you explicitly specify which
characters are used for interpolation and which characters are used for
opening/closing the string. Advanced strings begin with a dollar sign ($
),
followed by what interpolation style to use, followed by the character to use
to delimit the string, followed by the string contents and a closing string
delimiter. The interpolation style can be a matching pair (()
, []
, {}
, or
<>
) or any other single character. When the interpolation style is a matching
pair, the interpolation is any expression enclosed in that pair (e.g.
${}"interpolate {1 + 2}"
). When the interpolation style is a single
character, the interpolation must be either a parenthesized expression or a
single term with no infix operators (e.g. a variable), for example:
$@"Interpolate @var or @(1 + 2)"
.
Here are some examples:
$[]"In here, quotes delimit the string and square brackets interpolate: [1 + 2]"
$@"For single-letter interpolations, the interpolation is a single term like @my_var without a closing symbol"
$@"But you can parenthesize expressions like: @(x + y) if you need to"
$$"Double dollars means dollar signs interpolate: $my_var $(1 + 2)"
$${If you have a string with "quotes" and 'single quotes', you can choose something else like curly braces to delimit the string}
$?#Here hashes delimit the string and question marks interpolate: ?(1 + 2)#
When strings are delimited by matching pairs (()
, []
, {}
, or <>
), they
can only be closed by a matched closing character at the same indentation
level, ignoring nested pairs:
$$(Inside parens, you can have (nested ()) parens no problem)
$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes"
$$(
When indented, an unmatched ) won't close the string
An unmatched ( won't mess things up either
Only matching pairs on the same indentation level are counted:
)
$$(Multi-line string with nested (parens) and
.. line continuation)
As a special case, when you use the same character for interpolation and string delimiting, no interpolations are allowed:
plain := $""This string has {no interpolations}!"
Note: Normal doubly quoted strings with no dollar sign (e.g. "foo"
) are a
shorthand for ${}"foo"
. Singly quoted strings with no dollar sign (e.g.
'foo'
) are shorthand for $''foo'
.
Operations
Concatenation
Concatenation in the typical case is an O(1) operation: "{x}{y}"
or x ++ y
.
Because string concatenation is typically an O(1) operation, there is no need for a separate string builder class in the language and no need to use an array of string fragments.
String Length
String length is an ambiguous term in the context of UTF-8 strings. There are several possible meanings, so each of these meanings is split into a separate method:
- Number of grapheme clusters:
string:num_graphemes()
- Size in bytes:
string:num_bytes()
- Number of unicode codepoints:
string:num_codepoints()
(you probably want to use graphemes, not codepoints in most applications)
Since the typical user expectation is that string length refers to "letters,"
the #
length operator returns the number of grapheme clusters, which is the
closest unicode equivalent to "letters."
Iteration
Iteration is not supported for strings because of the ambiguity between bytes, codepoints, and graphemes. It is instead recommended that you explicitly iterate over bytes, codepoints, graphemes, words, lines, etc:
Subcomponents
string:bytes()
returns an array ofInt8
bytesstring:codepoints()
returns an array ofInt32
bytesstring:graphemes()
returns an array of grapheme cluster stringsstring:words()
returns an array of word stringsstring:lines()
returns an array of line stringsstring:split(",", empty=no)
returns an array of strings split by the given delimiter
Equality, Comparison, and Hashing
All text is compared and hashed using unicode normalization. Unicode provides
several different ways to represent the same text. For example, the single
codepoint U+E9
(latin small e with accent) is rendered the same as the two
code points U+65 U+301
(latin small e, acute combining accent) and has an
equivalent linguistic meaning. These are simply different ways to represent the
same "letter." In order to make it easy to write correct code that takes this
into account, Tomo uses unicode normalization for all string comparisons and
hashing. Normalization does the equivalent of converting text to a canonical
form before performing comparisons or hashing. This means that if a table is
created that has text with the codepoint U+E9
as a key, then a lookup with
the same text but with U+65 U+301
instead of U+E9
will still succeed in
finding the value because the two strings are equivalent under normalization.
Capitalization
x:capitalized()
x:titlecased()
x:uppercased()
x:lowercased()
Patterns
string:has("target", at=Anywhere:enum(Anywhere, Start, End))->Bool
Check whether a pattern can be foundstring:without("target", at=Anywhere:enum(Anywhere, Start, End))->Text
string:trimmed("chars...", at=Anywhere:enum(Anywhere, Start, End))->Text
string:find("target")->enum(Failure, Success(index:Int32))
string:replace("target", "replacement", limit=Int.max)->Text
Returns a copy of the string with replacementsstring:split("split")->[Text]
string:join(["one", "two"])->Text