Add text slicing

author: Bruce Hill <bruce@bruce-hill.com> 2024-09-02 23:56:08 -0400
committer: Bruce Hill <bruce@bruce-hill.com> 2024-09-02 23:56:08 -0400
commit: 6c22999eded909b77b1d7718cf3f1dc969b55779 (patch)
tree: 2dd833569ae33890bf61d891b84e4c2eaf9db16f
parent: 5aa5a5e99b322586eed9997a14b3d616540bef07 (diff)
3 files changed, 32 insertions, 19 deletions
diff --git a/docs/text.md b/docs/text.md
index 70708fe4..a1a8edd4 100644
--- a/docs/text.md
+++ b/docs/text.md
@@ -1,10 +1,11 @@
 # Text
 
 `Text` is Tomo's datatype to represent text. The name `Text` is used instead of
-"string" because Tomo represents text as an immutable UTF-8-encoded value that
-uses the Boehm Cord library for efficient storage and concatenation. These are
-_not_ C-style NULL-terminated character arrays. GNU libunistring is used for
-full Unicode functionality (grapheme cluster counts, capitalization, etc.).
+"string" because Tomo text represents immutable, normalized unicode data with
+fast indexing that has an implementation that is efficient for concatenation.
+These are _not_ C-style NULL-terminated character arrays. GNU libunistring is
+used for full Unicode functionality (grapheme cluster counts, capitalization,
+etc.).
 
 ## Syntax
 
@@ -238,25 +239,24 @@ of text fragments.
 
 ### Text Length
 
-Text length is an ambiguous term in the context of UTF-8 text. There are
-several possible meanings, so each of these meanings is split into a separate
-method:
+Text length gives you the number of grapheme clusters in the text, according to
+the unicode standard. This corresponds to what you would intuitively think of
+when you think of "letters" in a string. If you have text with an emoji that has
+several joining modifiers attached to it, that text has a length of 1.
 
-- Number of grapheme clusters: `text:num_clusters()`. This is probably what
-  you want to use, since it corresponds to the everyday notion of "letters".
-- Size in bytes: `text:num_bytes()`
-- Number of unicode codepoints: `text:num_codepoints()` (you probably want to
-  use clusters, not codepoints in most applications)
-
-Since the typical user expectation is that text length refers to "letters,"
-the `#` length operator returns the number of grapheme clusters, which is the
-closest unicode equivalent to "letters."
+```tomo
+>> "hello".length
+= 5
+>> "👩🏽‍🚀".length
+= 1
+```
 
 ### Iteration
 
-Iteration is *not* supported for text because of the ambiguity between bytes,
-codepoints, and grapheme clusters. It is instead recommended that you
-explicitly iterate over bytes, codepoints, graphemes, words, lines, etc:
+Iteration is *not* supported for text. It is rarely ever the case that you will
+need to iterate over text, but if you do, you can iterate over the length of
+the text and retrieve 1-wide slices. Alternatively, you can split the text into
+its constituent grapheme clusters with `text:split()` and iterate over those.
 
 ### Equality, Comparison, and Hashing
 
diff --git a/environment.c b/environment.c
index ee277d2a..dc897ae0 100644
--- a/environment.c
+++ b/environment.c
@@ -262,6 +262,7 @@ env_t *new_compilation_unit(CORD *libname)
             {"quoted", "Text$quoted", "func(text:Text, color=no)->Text"},
             {"replace", "Text$replace", "func(text:Text, pattern:Text, replacement:Text)->Text"},
             {"split", "Text$split", "func(text:Text, pattern='')->[Text]"},
+            {"slice", "Text$slice", "func(text:Text, from=1, to=-1)->Text"},
             {"title", "Text$title", "func(text:Text)->Text"},
             {"trimmed", "Text$trimmed", "func(text:Text, trim=\" {\\n\\r\\t}\", where=Where.Anywhere)->Text"},
             {"upper", "Text$upper", "func(text:Text)->Text"},
diff --git a/test/text.tm b/test/text.tm
index 39e8a6e1..afd0305a 100644
--- a/test/text.tm
+++ b/test/text.tm
@@ -191,3 +191,15 @@ func main():
 	= 4
 	>> len
 	= 3_i64
+
+	//! Test text slicing:
+	>> "abcdef":slice()
+	= "abcdef"
+	>> "abcdef":slice(from=3)
+	= "cdef"
+	>> "abcdef":slice(to=-2)
+	= "abcde"
+	>> "abcdef":slice(from=2, to=4)
+	= "bcd"
+	>> "abcdef":slice(from=5, to=1)
+	= ""
author	Bruce Hill <bruce@bruce-hill.com>	2024-09-02 23:56:08 -0400
committer	Bruce Hill <bruce@bruce-hill.com>	2024-09-02 23:56:08 -0400
commit	6c22999eded909b77b1d7718cf3f1dc969b55779 (patch)
tree	2dd833569ae33890bf61d891b84e4c2eaf9db16f
parent	5aa5a5e99b322586eed9997a14b3d616540bef07 (diff)