From 8ef46380189d2ce39a0ec2a489e704b059676bf9 Mon Sep 17 00:00:00 2001
From: Bruce Hill <bruce@bruce-hill.com>
Date: Sun, 11 Feb 2024 21:44:06 -0500
Subject: Improved strings and docs

---
 docs/strings.md | 300 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 300 insertions(+)
 create mode 100644 docs/strings.md

(limited to 'docs/strings.md')

diff --git a/docs/strings.md b/docs/strings.md
new file mode 100644
index 00000000..1893d70f
--- /dev/null
+++ b/docs/strings.md
@@ -0,0 +1,300 @@
+# Strings
+
+Strings are implemented as immutable UTF-8-encoded values using:
+
+- The Boehm Cord library for efficient storage and concatenation.
+- GNU libunistring for unicode functionality (grapheme cluster counts,
+  capitalization, etc.)
+- My own BP library for simple pattern matching operations (similar to regex)
+
+## Syntax
+
+Strings have a flexible syntax designed to make it easy to hold values from
+different languages without the need to have lots of escape sequences and
+without using printf-style string formatting.
+
+```
+// Basic string:
+str := "Hello world"
+str2 := 'Also a string'
+```
+
+## Line Splits
+
+Long strings can be split across multiple lines by having two or more dots at
+the start of a new line on the same indentation level that started the string:
+
+```
+str := "This is a long
+.... line that is split in code"
+```
+
+## Multi-line Strings
+
+Multi-line strings have indented (i.e. at least one tab more than the start of
+the string) text inside quotation marks. The leading and trailing newline are
+ignored:
+
+```
+multi_line := "
+    This string has multiple lines.
+    Line two.
+
+    You can split a line
+.... using two or more dots to make an elipsis.
+
+    Remember to include whitespace after the elipsis if desired.
+
+    Or don't if you're splitting a long word like supercalifragilisticexpia
+....lidocious
+
+        This text is indented by one level in the string
+
+    "quotes" are ignored unless they're at the same indentation level as the
+.... start of the string.
+
+    The end (no newline after this).
+"
+```
+
+## String Interpolations
+
+Inside a double quoted string, you can use curly braces (`{...}`) to insert an
+expression that you want converted to a string. This is called string
+interpolation:
+
+```
+// Interpolation:
+str := "Sum: {1 + 2}"
+// equivalent to "Sum: 3"
+```
+
+Single-quoted strings do not have interpolations:
+
+```
+// No interpolation here:
+str := 'Sum: {1 + 2}'
+```
+
+## String Escapes
+
+Unlike other languages, backslash is *not* a special character inside of a
+string. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a
+newline. Instead, a series of character escapes act as complete string literals
+without quotation marks:
+
+```
+newline := \n
+crlf := \r\n
+quote := \"
+```
+
+These string literals can be used as interpolation values:
+
+```
+two_lines := "one{\n}two"
+has_quotes := "some {\"}quotes{\"} here"
+```
+
+However, in general it is best practice to use multi-line strings to avoid these problems:
+
+```
+str := "
+    This has
+    multiple lines and "quotes" too!
+"
+```
+
+### Multi-line Strings
+
+There are two reasons for strings to span multiple lines in code: either you
+have a string that contains newlines and you want to represent it without `\n`
+escapes, or you have a long single-line string that you want to split across
+multiple lines for readability. To support this, you can use newlines inside of
+strings with indentation-sensitivity. For splitting long lines, use two or more
+"."s at the same indentation level as the start of the string literal:
+
+```
+single_line := "This is a long string that
+.... spans multiple lines"
+```
+For strings that contain newlines, you may put multiple indented lines inside
+the quotes:
+
+```
+multi_line := "
+    line one
+    line two
+        this line is indented
+    last line
+"
+```
+
+Strings may only end on lines with the same indentation as the starting quote
+and nested quotes are ignored:
+
+```
+nested := $$(I can have (parens) inside (parens inside (parens)))
+multi_line := "
+    Quotes in indented regions like this: " don't count
+"
+```
+
+If there is a leading or trailing newline, it is ignored and not included in
+the string.
+
+```
+str := "
+    one line
+"
+
+>>> str == "one line"
+=== yes
+```
+
+Additional newlines *are* counted though:
+
+```
+str := "
+    
+    blank lines
+
+"
+
+>>> str == "{\n}blank lines{\n}"
+```
+
+### Advanced $-Strings
+
+Sometimes you need to use many `{`s or `"`s inside a string, but you don't want
+to type `{\{}` or `{\"}` each time. In such cases, you can use the more
+advanced form of strings. The advanced form lets you explicitly specify which
+characters are used for interpolation and which characters are used for
+opening/closing the string. Advanced strings begin with a dollar sign (`$`),
+followed by what interpolation style to use, followed by the character to use
+to delimit the string, followed by the string contents and a closing string
+delimiter. The interpolation style can be a matching pair (`()`, `[]`, `{}`, or
+`<>`) or any other single character. When the interpolation style is a matching
+pair, the interpolation is any expression enclosed in that pair (e.g.
+`${}"interpolate {1 + 2}"`). When the interpolation style is a single
+character, the interpolation must be either a parenthesized expression or a
+single term with no infix operators (e.g. a variable), for example:
+`$@"Interpolate @var or @(1 + 2)"`.
+
+Here are some examples:
+
+```
+$[]"In here, quotes delimit the string and square brackets interpolate: [1 + 2]"
+$@"For single-letter interpolations, the interpolation is a single term like @my_var without a closing symbol"
+$@"But you can parenthesize expressions like: @(x + y) if you need to"
+$$"Double dollars means dollar signs interpolate: $my_var $(1 + 2)"
+$${If you have a string with "quotes" and 'single quotes', you can choose something else like curly braces to delimit the string}
+$?#Here hashes delimit the string and question marks interpolate: ?(1 + 2)#
+```
+
+When strings are delimited by matching pairs (`()`, `[]`, `{}`, or `<>`), they
+can only be closed by a matched closing character at the same indentation
+level, ignoring nested pairs:
+
+```
+$$(Inside parens, you can have (nested ()) parens no problem)
+$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes"
+$$(
+    When indented, an unmatched ) won't close the string
+    An unmatched ( won't mess things up either
+    Only matching pairs on the same indentation level are counted:
+)
+$$(Multi-line string with nested (parens) and
+.. line continuation)
+```
+
+As a special case, when `!` is used as an interpolation rule, no interpolations
+are allowed and `!` itself is treated as a literal character:
+
+```
+plain := $!"This string has {no interpolations}! Not even exclamation mark!"
+```
+
+**Note:** Normal doubly quoted strings with no dollar sign (e.g. `"foo"`) are a
+shorthand for `${}"foo"`. Singly quoted strings with no dollar sign are
+shorthand for `$!'foo'`.
+
+## Operations
+
+### Concatenation
+
+Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`.
+
+Because string concatenation is typically an O(1) operation, there is no need
+for a separate string builder class in the language and no need to use an array
+of string fragments.
+
+### String Length
+
+String length is an ambiguous term in the context of UTF-8 strings. There are
+several possible meanings, so each of these meanings is split into a separate
+method:
+
+- Number of grapheme clusters: `string.num_graphemes()`
+- Size in bytes: `string.num_bytes()`
+- Number of unicode codepoints: `string.num_codepoints()` (you probably want to
+  use graphemes, not codepoints in most applications)
+
+### Iteration
+
+Iteration is *not* supported for strings because of the ambiguity between
+bytes, codepoints, and graphemes. It is instead recommended that you use
+higher-abstraction functions.
+
+### Subcomponents
+
+- `string.bytes()` returns an array of `Int8` bytes
+- `string.codepoints()` returns an array of `Int32` bytes
+- `string.graphemes()` returns an array of grapheme cluster strings
+- `string.words()` returns an array of word strings
+- `string.lines()` returns an array of line strings
+- `string.split(",", empty=no)` returns an array of strings split by the given delimiter
+
+### Equality and Comparison
+
+By default, strings are compared using memory comparisons of the UTF-8 representation.
+
+- `x == y` is roughly equivalent to `strcmp(x, y) == 0`
+
+To compare normalized forms of strings, use:
+
+- `x.equivalent_to(y)` returns a boolean for whether the strings are the same
+- `x.compare_normalized(y)` returns `enum(Equal, Less, Greater)`
+
+### Capitalization
+
+- `x.capitalized()`
+- `x.titlecased()`
+- `x.uppercased()`
+- `x.lowercased()`
+
+### Patterns
+
+- `string.has($/pattern/, at=Anywhere:enum(Anywhere, Start, End))` Check whether a pattern can be found
+- `string.next($/pattern/)` Returns an `enum(NotFound, Found(match:Str, rest:Str))`
+- `string.matches($/pattern/)` Returns a list of matching strings
+- `string.replace($/pattern/, "replacement")` Returns a copy of the string with replacements
+- `string.without($/pattern/, at=Anywhere:enum(Anywhere, Start, End))`
+
+### Indentation
+
+- `string.indented(type:enum(Tab, Spaces(num:Int), count=1)` (e.g. `s.indented(Tab)`, `s.indented(Spaces(4), -1)`
+
+### Properties
+
+Unicode strings have various overlapping properties. For example, a grapheme
+might be both printable and alphabetic. It can be useful to query some of these
+properties for a given string.
+
+- `string.properties() -> flags(None, WhiteSpace, Alphabetic, …, Emoji, …)`
+- `string.is(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool`
+- `string.has_property(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool`
+
+Example: `if name.is(Uppercase)`
+Example: `if name.is(Alphabetic or Numeric)`
+Example: `if name.has_property(Math or Currency)`
-- 
cgit v1.2.3