Add text slicing

This commit is contained in:
Bruce Hill 2024-09-02 23:56:08 -04:00
parent 5aa5a5e99b
commit 6c22999ede
3 changed files with 32 additions and 19 deletions

View File

@ -1,10 +1,11 @@
# Text
`Text` is Tomo's datatype to represent text. The name `Text` is used instead of
"string" because Tomo represents text as an immutable UTF-8-encoded value that
uses the Boehm Cord library for efficient storage and concatenation. These are
_not_ C-style NULL-terminated character arrays. GNU libunistring is used for
full Unicode functionality (grapheme cluster counts, capitalization, etc.).
"string" because Tomo text represents immutable, normalized unicode data with
fast indexing that has an implementation that is efficient for concatenation.
These are _not_ C-style NULL-terminated character arrays. GNU libunistring is
used for full Unicode functionality (grapheme cluster counts, capitalization,
etc.).
## Syntax
@ -238,25 +239,24 @@ of text fragments.
### Text Length
Text length is an ambiguous term in the context of UTF-8 text. There are
several possible meanings, so each of these meanings is split into a separate
method:
Text length gives you the number of grapheme clusters in the text, according to
the unicode standard. This corresponds to what you would intuitively think of
when you think of "letters" in a string. If you have text with an emoji that has
several joining modifiers attached to it, that text has a length of 1.
- Number of grapheme clusters: `text:num_clusters()`. This is probably what
you want to use, since it corresponds to the everyday notion of "letters".
- Size in bytes: `text:num_bytes()`
- Number of unicode codepoints: `text:num_codepoints()` (you probably want to
use clusters, not codepoints in most applications)
Since the typical user expectation is that text length refers to "letters,"
the `#` length operator returns the number of grapheme clusters, which is the
closest unicode equivalent to "letters."
```tomo
>> "hello".length
= 5
>> "👩🏽‍🚀".length
= 1
```
### Iteration
Iteration is *not* supported for text because of the ambiguity between bytes,
codepoints, and grapheme clusters. It is instead recommended that you
explicitly iterate over bytes, codepoints, graphemes, words, lines, etc:
Iteration is *not* supported for text. It is rarely ever the case that you will
need to iterate over text, but if you do, you can iterate over the length of
the text and retrieve 1-wide slices. Alternatively, you can split the text into
its constituent grapheme clusters with `text:split()` and iterate over those.
### Equality, Comparison, and Hashing

View File

@ -262,6 +262,7 @@ env_t *new_compilation_unit(CORD *libname)
{"quoted", "Text$quoted", "func(text:Text, color=no)->Text"},
{"replace", "Text$replace", "func(text:Text, pattern:Text, replacement:Text)->Text"},
{"split", "Text$split", "func(text:Text, pattern='')->[Text]"},
{"slice", "Text$slice", "func(text:Text, from=1, to=-1)->Text"},
{"title", "Text$title", "func(text:Text)->Text"},
{"trimmed", "Text$trimmed", "func(text:Text, trim=\" {\\n\\r\\t}\", where=Where.Anywhere)->Text"},
{"upper", "Text$upper", "func(text:Text)->Text"},

View File

@ -191,3 +191,15 @@ func main():
= 4
>> len
= 3_i64
//! Test text slicing:
>> "abcdef":slice()
= "abcdef"
>> "abcdef":slice(from=3)
= "cdef"
>> "abcdef":slice(to=-2)
= "abcde"
>> "abcdef":slice(from=2, to=4)
= "bcd"
>> "abcdef":slice(from=5, to=1)
= ""