diff --git a/docs/text.md b/docs/text.md index 70708fe..a1a8edd 100644 --- a/docs/text.md +++ b/docs/text.md @@ -1,10 +1,11 @@ # Text `Text` is Tomo's datatype to represent text. The name `Text` is used instead of -"string" because Tomo represents text as an immutable UTF-8-encoded value that -uses the Boehm Cord library for efficient storage and concatenation. These are -_not_ C-style NULL-terminated character arrays. GNU libunistring is used for -full Unicode functionality (grapheme cluster counts, capitalization, etc.). +"string" because Tomo text represents immutable, normalized unicode data with +fast indexing that has an implementation that is efficient for concatenation. +These are _not_ C-style NULL-terminated character arrays. GNU libunistring is +used for full Unicode functionality (grapheme cluster counts, capitalization, +etc.). ## Syntax @@ -238,25 +239,24 @@ of text fragments. ### Text Length -Text length is an ambiguous term in the context of UTF-8 text. There are -several possible meanings, so each of these meanings is split into a separate -method: +Text length gives you the number of grapheme clusters in the text, according to +the unicode standard. This corresponds to what you would intuitively think of +when you think of "letters" in a string. If you have text with an emoji that has +several joining modifiers attached to it, that text has a length of 1. -- Number of grapheme clusters: `text:num_clusters()`. This is probably what - you want to use, since it corresponds to the everyday notion of "letters". -- Size in bytes: `text:num_bytes()` -- Number of unicode codepoints: `text:num_codepoints()` (you probably want to - use clusters, not codepoints in most applications) - -Since the typical user expectation is that text length refers to "letters," -the `#` length operator returns the number of grapheme clusters, which is the -closest unicode equivalent to "letters." +```tomo +>> "hello".length += 5 +>> "👩🏽‍🚀".length += 1 +``` ### Iteration -Iteration is *not* supported for text because of the ambiguity between bytes, -codepoints, and grapheme clusters. It is instead recommended that you -explicitly iterate over bytes, codepoints, graphemes, words, lines, etc: +Iteration is *not* supported for text. It is rarely ever the case that you will +need to iterate over text, but if you do, you can iterate over the length of +the text and retrieve 1-wide slices. Alternatively, you can split the text into +its constituent grapheme clusters with `text:split()` and iterate over those. ### Equality, Comparison, and Hashing diff --git a/environment.c b/environment.c index ee277d2..dc897ae 100644 --- a/environment.c +++ b/environment.c @@ -262,6 +262,7 @@ env_t *new_compilation_unit(CORD *libname) {"quoted", "Text$quoted", "func(text:Text, color=no)->Text"}, {"replace", "Text$replace", "func(text:Text, pattern:Text, replacement:Text)->Text"}, {"split", "Text$split", "func(text:Text, pattern='')->[Text]"}, + {"slice", "Text$slice", "func(text:Text, from=1, to=-1)->Text"}, {"title", "Text$title", "func(text:Text)->Text"}, {"trimmed", "Text$trimmed", "func(text:Text, trim=\" {\\n\\r\\t}\", where=Where.Anywhere)->Text"}, {"upper", "Text$upper", "func(text:Text)->Text"}, diff --git a/test/text.tm b/test/text.tm index 39e8a6e..afd0305 100644 --- a/test/text.tm +++ b/test/text.tm @@ -191,3 +191,15 @@ func main(): = 4 >> len = 3_i64 + + //! Test text slicing: + >> "abcdef":slice() + = "abcdef" + >> "abcdef":slice(from=3) + = "cdef" + >> "abcdef":slice(to=-2) + = "abcde" + >> "abcdef":slice(from=2, to=4) + = "bcd" + >> "abcdef":slice(from=5, to=1) + = ""