Updated docs

This commit is contained in:
Bruce Hill 2024-09-03 01:18:22 -04:00
parent 5441e6f287
commit 7b44044b5e

View File

@ -7,6 +7,43 @@ These are _not_ C-style NULL-terminated character arrays. GNU libunistring is
used for full Unicode functionality (grapheme cluster counts, capitalization, used for full Unicode functionality (grapheme cluster counts, capitalization,
etc.). etc.).
## Implementation
Internally, Tomo text's implementation is based on [Raku's
strings](https://docs.raku.org/language/unicode). Strings store their grapheme
cluster count and either a compact array of 8-bit ASCII characters (for ASCII
text), an array of 32-bit normal-form grapheme cluster values (see below), or a
flat subarray of multiple texts that are either ASCII or graphemes (the
structure is not arbitrarily nested).
### Normal-Form Graphemes
In order to maintain compact storage, fast indexing, and fast slicing,
non-ASCII text is stored as 32-bit normal-form graphemes. A normal-form
grapheme is either a positive value representing a Unicode codepoint that
corresponds to a grapheme cluster (most Unicode letters used in natural
language fall into this category after normalization) or a negative value
representing an index into an internal array of "synthetic grapheme cluster
codepoints." Here are some examples:
- `A` is a normal codepoint that is also a grapheme cluster, so it would
be represented as the number `65`
- `ABC` is three separate grapheme clusters, one for `A`, one for `B`, one
for `C`.
- `Å` is also a single codepoint (`LATIN CAPITAL LETTER A WITH RING ABOVE`)
that is also a grapheme cluster, so it would be represented as the number
`197`.
- `家` (Japanese for "house") is a single codepoint (`CJK Unified
Ideograph-5BB6`) that is also a grapheme cluster, so it would be represented
as the number `23478`
-`👩🏽‍🚀` is a single graphical cluster, but it's made up of several
combining codepoints (`["WOMAN", "EMOJI MODIFIER FITZPATRICK TYPE-4", "ZERO
WITDH JOINER", "ROCKET"]`). Since this can't be represented with a single
codepoint, we must create a synthetic codepoint for it. If this was the `n`th
synthetic codepoint that we've found, then we would represent it with the
number `-n`, which can be used to look it up in a lookup table. The lookup
table holds the full sequence of codepoints used in the grapheme cluster.
## Syntax ## Syntax
Text has a flexible syntax designed to make it easy to hold values from Text has a flexible syntax designed to make it easy to hold values from
@ -20,7 +57,7 @@ str2 := 'Also text'
str3 := `Backticks too` str3 := `Backticks too`
``` ```
## Line Splits ### Line Splits
Long text can be split across multiple lines by having two or more dots at the Long text can be split across multiple lines by having two or more dots at the
start of a new line on the same indentation level that started the text: start of a new line on the same indentation level that started the text:
@ -30,7 +67,7 @@ str := "This is a long
....... line that is split in code" ....... line that is split in code"
``` ```
## Multi-line Text ### Multi-line Text
Multi-line text has indented (i.e. at least one tab more than the start of the Multi-line text has indented (i.e. at least one tab more than the start of the
text) text inside quotation marks. The leading and trailing newline are text) text inside quotation marks. The leading and trailing newline are
@ -81,7 +118,7 @@ Single-quoted text does not have interpolations:
str := 'Sum: $(1 + 2)' str := 'Sum: $(1 + 2)'
``` ```
## Text Escapes ### Text Escapes
Unlike other languages, backslash is *not* a special character inside of text. Unlike other languages, backslash is *not* a special character inside of text.
For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a newline. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a newline.
@ -111,64 +148,6 @@ str := "
" "
``` ```
### Multi-line Text
There are two reasons for text to span multiple lines in code: either you
have text that contains newlines and you want to represent it without `\n`
escapes, or you have a long single-line text that you want to split across
multiple lines for readability. To support this, you can use newlines inside of
text with indentation-sensitivity. For splitting long lines, use two or more
"."s at the same indentation level as the start of the text literal:
```
single_line := "This is a long text that
... spans multiple lines"
```
For text that contains newlines, you may put multiple indented lines inside
the quotes:
```
multi_line := "
line one
line two
this line is indented
last line
"
```
Text may only end on lines with the same indentation as the starting quote
and nested quotes are ignored:
```
multi_line := "
Quotes in indented regions like this: " don't count
"
```
If there is a leading or trailing newline, it is ignored and not included in
the text.
```
str := "
one line
"
>>> str == "one line"
=== yes
```
Additional newlines *are* counted though:
```
str := "
blank lines
"
>>> str == "{\n}blank lines{\n}"
```
### Customizable `$`-Text ### Customizable `$`-Text
Sometimes you might need to use a lot of literal `$`s or quotation marks in Sometimes you might need to use a lot of literal `$`s or quotation marks in
@ -231,11 +210,11 @@ shorthand for `${}"foo"`. Singly quoted text with no dollar sign (e.g.
### Concatenation ### Concatenation
Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`. Concatenation in the typical case is a fast operation: `"{x}{y}"` or `x ++ y`.
Because text concatenation is typically an O(1) operation, there is no need for Because text concatenation is typically fast, there is no need for a separate
a separate "string builder" class in the language and no need to use an array "string builder" class in the language and no need to use an array of text
of text fragments. fragments.
### Text Length ### Text Length