From 8ef46380189d2ce39a0ec2a489e704b059676bf9 Mon Sep 17 00:00:00 2001 From: Bruce Hill Date: Sun, 11 Feb 2024 21:44:06 -0500 Subject: Improved strings and docs --- docs/strings.md | 300 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 300 insertions(+) create mode 100644 docs/strings.md (limited to 'docs/strings.md') diff --git a/docs/strings.md b/docs/strings.md new file mode 100644 index 00000000..1893d70f --- /dev/null +++ b/docs/strings.md @@ -0,0 +1,300 @@ +# Strings + +Strings are implemented as immutable UTF-8-encoded values using: + +- The Boehm Cord library for efficient storage and concatenation. +- GNU libunistring for unicode functionality (grapheme cluster counts, + capitalization, etc.) +- My own BP library for simple pattern matching operations (similar to regex) + +## Syntax + +Strings have a flexible syntax designed to make it easy to hold values from +different languages without the need to have lots of escape sequences and +without using printf-style string formatting. + +``` +// Basic string: +str := "Hello world" +str2 := 'Also a string' +``` + +## Line Splits + +Long strings can be split across multiple lines by having two or more dots at +the start of a new line on the same indentation level that started the string: + +``` +str := "This is a long +.... line that is split in code" +``` + +## Multi-line Strings + +Multi-line strings have indented (i.e. at least one tab more than the start of +the string) text inside quotation marks. The leading and trailing newline are +ignored: + +``` +multi_line := " + This string has multiple lines. + Line two. + + You can split a line +.... using two or more dots to make an elipsis. + + Remember to include whitespace after the elipsis if desired. + + Or don't if you're splitting a long word like supercalifragilisticexpia +....lidocious + + This text is indented by one level in the string + + "quotes" are ignored unless they're at the same indentation level as the +.... start of the string. + + The end (no newline after this). +" +``` + +## String Interpolations + +Inside a double quoted string, you can use curly braces (`{...}`) to insert an +expression that you want converted to a string. This is called string +interpolation: + +``` +// Interpolation: +str := "Sum: {1 + 2}" +// equivalent to "Sum: 3" +``` + +Single-quoted strings do not have interpolations: + +``` +// No interpolation here: +str := 'Sum: {1 + 2}' +``` + +## String Escapes + +Unlike other languages, backslash is *not* a special character inside of a +string. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a +newline. Instead, a series of character escapes act as complete string literals +without quotation marks: + +``` +newline := \n +crlf := \r\n +quote := \" +``` + +These string literals can be used as interpolation values: + +``` +two_lines := "one{\n}two" +has_quotes := "some {\"}quotes{\"} here" +``` + +However, in general it is best practice to use multi-line strings to avoid these problems: + +``` +str := " + This has + multiple lines and "quotes" too! +" +``` + +### Multi-line Strings + +There are two reasons for strings to span multiple lines in code: either you +have a string that contains newlines and you want to represent it without `\n` +escapes, or you have a long single-line string that you want to split across +multiple lines for readability. To support this, you can use newlines inside of +strings with indentation-sensitivity. For splitting long lines, use two or more +"."s at the same indentation level as the start of the string literal: + +``` +single_line := "This is a long string that +.... spans multiple lines" +``` +For strings that contain newlines, you may put multiple indented lines inside +the quotes: + +``` +multi_line := " + line one + line two + this line is indented + last line +" +``` + +Strings may only end on lines with the same indentation as the starting quote +and nested quotes are ignored: + +``` +nested := $$(I can have (parens) inside (parens inside (parens))) +multi_line := " + Quotes in indented regions like this: " don't count +" +``` + +If there is a leading or trailing newline, it is ignored and not included in +the string. + +``` +str := " + one line +" + +>>> str == "one line" +=== yes +``` + +Additional newlines *are* counted though: + +``` +str := " + + blank lines + +" + +>>> str == "{\n}blank lines{\n}" +``` + +### Advanced $-Strings + +Sometimes you need to use many `{`s or `"`s inside a string, but you don't want +to type `{\{}` or `{\"}` each time. In such cases, you can use the more +advanced form of strings. The advanced form lets you explicitly specify which +characters are used for interpolation and which characters are used for +opening/closing the string. Advanced strings begin with a dollar sign (`$`), +followed by what interpolation style to use, followed by the character to use +to delimit the string, followed by the string contents and a closing string +delimiter. The interpolation style can be a matching pair (`()`, `[]`, `{}`, or +`<>`) or any other single character. When the interpolation style is a matching +pair, the interpolation is any expression enclosed in that pair (e.g. +`${}"interpolate {1 + 2}"`). When the interpolation style is a single +character, the interpolation must be either a parenthesized expression or a +single term with no infix operators (e.g. a variable), for example: +`$@"Interpolate @var or @(1 + 2)"`. + +Here are some examples: + +``` +$[]"In here, quotes delimit the string and square brackets interpolate: [1 + 2]" +$@"For single-letter interpolations, the interpolation is a single term like @my_var without a closing symbol" +$@"But you can parenthesize expressions like: @(x + y) if you need to" +$$"Double dollars means dollar signs interpolate: $my_var $(1 + 2)" +$${If you have a string with "quotes" and 'single quotes', you can choose something else like curly braces to delimit the string} +$?#Here hashes delimit the string and question marks interpolate: ?(1 + 2)# +``` + +When strings are delimited by matching pairs (`()`, `[]`, `{}`, or `<>`), they +can only be closed by a matched closing character at the same indentation +level, ignoring nested pairs: + +``` +$$(Inside parens, you can have (nested ()) parens no problem) +$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes" +$$( + When indented, an unmatched ) won't close the string + An unmatched ( won't mess things up either + Only matching pairs on the same indentation level are counted: +) +$$(Multi-line string with nested (parens) and +.. line continuation) +``` + +As a special case, when `!` is used as an interpolation rule, no interpolations +are allowed and `!` itself is treated as a literal character: + +``` +plain := $!"This string has {no interpolations}! Not even exclamation mark!" +``` + +**Note:** Normal doubly quoted strings with no dollar sign (e.g. `"foo"`) are a +shorthand for `${}"foo"`. Singly quoted strings with no dollar sign are +shorthand for `$!'foo'`. + +## Operations + +### Concatenation + +Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`. + +Because string concatenation is typically an O(1) operation, there is no need +for a separate string builder class in the language and no need to use an array +of string fragments. + +### String Length + +String length is an ambiguous term in the context of UTF-8 strings. There are +several possible meanings, so each of these meanings is split into a separate +method: + +- Number of grapheme clusters: `string.num_graphemes()` +- Size in bytes: `string.num_bytes()` +- Number of unicode codepoints: `string.num_codepoints()` (you probably want to + use graphemes, not codepoints in most applications) + +### Iteration + +Iteration is *not* supported for strings because of the ambiguity between +bytes, codepoints, and graphemes. It is instead recommended that you use +higher-abstraction functions. + +### Subcomponents + +- `string.bytes()` returns an array of `Int8` bytes +- `string.codepoints()` returns an array of `Int32` bytes +- `string.graphemes()` returns an array of grapheme cluster strings +- `string.words()` returns an array of word strings +- `string.lines()` returns an array of line strings +- `string.split(",", empty=no)` returns an array of strings split by the given delimiter + +### Equality and Comparison + +By default, strings are compared using memory comparisons of the UTF-8 representation. + +- `x == y` is roughly equivalent to `strcmp(x, y) == 0` + +To compare normalized forms of strings, use: + +- `x.equivalent_to(y)` returns a boolean for whether the strings are the same +- `x.compare_normalized(y)` returns `enum(Equal, Less, Greater)` + +### Capitalization + +- `x.capitalized()` +- `x.titlecased()` +- `x.uppercased()` +- `x.lowercased()` + +### Patterns + +- `string.has($/pattern/, at=Anywhere:enum(Anywhere, Start, End))` Check whether a pattern can be found +- `string.next($/pattern/)` Returns an `enum(NotFound, Found(match:Str, rest:Str))` +- `string.matches($/pattern/)` Returns a list of matching strings +- `string.replace($/pattern/, "replacement")` Returns a copy of the string with replacements +- `string.without($/pattern/, at=Anywhere:enum(Anywhere, Start, End))` + +### Indentation + +- `string.indented(type:enum(Tab, Spaces(num:Int), count=1)` (e.g. `s.indented(Tab)`, `s.indented(Spaces(4), -1)` + +### Properties + +Unicode strings have various overlapping properties. For example, a grapheme +might be both printable and alphabetic. It can be useful to query some of these +properties for a given string. + +- `string.properties() -> flags(None, WhiteSpace, Alphabetic, …, Emoji, …)` +- `string.is(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool` +- `string.has_property(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool` + +Example: `if name.is(Uppercase)` +Example: `if name.is(Alphabetic or Numeric)` +Example: `if name.has_property(Math or Currency)` -- cgit v1.2.3