diff options
| author | Bruce Hill <bruce@bruce-hill.com> | 2024-08-18 19:45:04 -0400 |
|---|---|---|
| committer | Bruce Hill <bruce@bruce-hill.com> | 2024-08-18 19:45:04 -0400 |
| commit | c972b8ba5bd61860e294322336bc9a6e0b3b6d07 (patch) | |
| tree | 54c7f2d116183b5acabd3f5f5adc826d25f68922 | |
| parent | d705355fc95f85619b5f1299f0e95145e8165107 (diff) | |
Update array docs
| -rw-r--r-- | api/arrays.md | 84 | ||||
| -rw-r--r-- | docs/strings.md | 301 |
2 files changed, 84 insertions, 301 deletions
diff --git a/api/arrays.md b/api/arrays.md index 27214063..8933ff1e 100644 --- a/api/arrays.md +++ b/api/arrays.md @@ -5,10 +5,94 @@ type in a compact format. Arrays are immutable by default, but use copy-on-write semantics to efficiently mutate in place when possible. **Arrays are 1-indexed**, which means the first item in the array has index `1`. +## Syntax + +Arrays are written using square brackets and a list of comma-separated elements: + ```tomo nums := [10, 20, 30] ``` +Each element must have the same type (or be easily promoted to the same type). If +you want to have an empty array, you must specify what type goes inside the array +like this: + +```tomo +empty := [:Int] +``` + +### Array Comprehensions + +Arrays can also use comprehensions, where you specify how to dynamically create +all the elements by iteration instead of manually specifying each: + +```tomo +>> [i*10 for i in 3:to(8)] += [30, 40, 50, 60, 70, 80] +>> [i*10 for i in 3:to(8) if i != 4] += [30, 50, 60, 70, 80] +``` + +Comprehensions can be combined with regular items or other comprehensions: + +```tomo +>> [-1, i*10 for i in 3:to(8), i for i in 3] += [-1, 30, 40, 50, 60, 70, 80, 1, 2, 3] +``` + +## Indexing + +Array values are accessed using square bracket indexing. Since arrays are +1-indexed, the index `1` corresponds to the first item in the array. Negative +indices are used to refer to items from the back of the array, so `-1` is the +last item, `-2` is the second-to-last, and so on. + +```tomo +arr := [10, 20, 30, 40] +>> arr[1] += 10 + +>> arr[2] += 20 + +>> arr[-1] += 40 + +>> arr[-2] += 30 +``` + +If an array index of `0` or any value larger than the length of the array is +used, it will trigger a runtime error that will print what the invalid array +index was, the length of the array, and a stack trace. As a performance +operation, if array bounds checking proves to be a performance hot spot, you +can explicitly disable bounds checking by adding `arr[i; unchecked]` to the +array access. + +## Iteration + +You can iterate over the items in an array like this: + +```tomo +for item in array: + ... +``` + +Array iteration operates over the value of the array when the loop began, so +modifying the array during iteration is safe and will not result in the loop +iterating over any of the new values. + +## Concatenation + +Arrays can be concatenated with the `++` operator, which returns an array that +has the items from one appended to the other. This should not be confused with +the addition operator `+`, which does not work with arrays. + +```tomo +>> [1, 2] ++ [3, 4] += [1, 2, 3, 4] +``` + ## Array Methods ### `binary_search` diff --git a/docs/strings.md b/docs/strings.md deleted file mode 100644 index 8e740765..00000000 --- a/docs/strings.md +++ /dev/null @@ -1,301 +0,0 @@ -# Strings - -Strings are implemented as immutable UTF-8-encoded values using: - -- The Boehm Cord library for efficient storage and concatenation. -- GNU libunistring for unicode functionality (grapheme cluster counts, - capitalization, etc.) -- My own BP library for simple pattern matching operations (similar to regex) - -## Syntax - -Strings have a flexible syntax designed to make it easy to hold values from -different languages without the need to have lots of escape sequences and -without using printf-style string formatting. - -``` -// Basic string: -str := "Hello world" -str2 := 'Also a string' -str3 := `Backticks too` -``` - -## Line Splits - -Long strings can be split across multiple lines by having two or more dots at -the start of a new line on the same indentation level that started the string: - -``` -str := "This is a long -.... line that is split in code" -``` - -## Multi-line Strings - -Multi-line strings have indented (i.e. at least one tab more than the start of -the string) text inside quotation marks. The leading and trailing newline are -ignored: - -``` -multi_line := " - This string has multiple lines. - Line two. - - You can split a line -.... using two or more dots to make an elipsis. - - Remember to include whitespace after the elipsis if desired. - - Or don't if you're splitting a long word like supercalifragilisticexpia -....lidocious - - This text is indented by one level in the string - - "quotes" are ignored unless they're at the same indentation level as the -.... start of the string. - - The end (no newline after this). -" -``` - -## String Interpolations - -Inside a double quoted string, you can use a dollar sign (`$`) to insert an -expression that you want converted to a string. This is called string -interpolation: - -``` -// Interpolation: -my_var := 5 -str := "My var is $my_var!" -// Equivalent to "My var is 5!" - -// Using parentheses: -str := "Sum: $(1 + 2)" -// equivalent to "Sum: 3" -``` - -Single-quoted strings do not have interpolations: - -``` -// No interpolation here: -str := 'Sum: $(1 + 2)' -``` - -## String Escapes - -Unlike other languages, backslash is *not* a special character inside of a -string. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a -newline. Instead, a series of character escapes act as complete string literals -without quotation marks: - -``` -newline := \n -crlf := \r\n -quote := \" -``` - -These string literals can be used as interpolation values with or without -parentheses, depending on which you find more readable: - -``` -two_lines := "one$(\n)two" -has_quotes := "some $\"quotes$\" here" -``` - -However, in general it is best practice to use multi-line strings to avoid these problems: - -``` -str := " - This has - multiple lines and "quotes" too! -" -``` - -### Multi-line Strings - -There are two reasons for strings to span multiple lines in code: either you -have a string that contains newlines and you want to represent it without `\n` -escapes, or you have a long single-line string that you want to split across -multiple lines for readability. To support this, you can use newlines inside of -strings with indentation-sensitivity. For splitting long lines, use two or more -"."s at the same indentation level as the start of the string literal: - -``` -single_line := "This is a long string that -.... spans multiple lines" -``` -For strings that contain newlines, you may put multiple indented lines inside -the quotes: - -``` -multi_line := " - line one - line two - this line is indented - last line -" -``` - -Strings may only end on lines with the same indentation as the starting quote -and nested quotes are ignored: - -``` -multi_line := " - Quotes in indented regions like this: " don't count -" -``` - -If there is a leading or trailing newline, it is ignored and not included in -the string. - -``` -str := " - one line -" - ->>> str == "one line" -=== yes -``` - -Additional newlines *are* counted though: - -``` -str := " - - blank lines - -" - ->>> str == "{\n}blank lines{\n}" -``` - -### Customizable $-Strings - -Sometimes you might need to use a lot of literal `$`s or quotation marks in a -string. In such cases, you can use the more customizable form of strings. The -customizable form lets you explicitly specify which character to use for -interpolation and which characters to use for delimiting the string. - -The first character after the `$` is the custom interpolation character, which -can be any of the following symbols: `~!@#$%^&*+=\?`. If none of these -characters is used, the default interpolation character is `$`. Since this is -the default, you can disable interpolation altogether by using `$` here (i.e. a -double `$$`). - -The next thing in a customizable string is the character used to delimit the -string. The string delimiter can be any of the following symbols: `` "'`|/;([{< `` -If the string delimiter is one of `([{<`, then the string will continue until a -matching `)]}>` is found, not terminating unless the delimiters are balanced -(i.e. nested pairs of delimiters are considered part of the string). - -Here are some examples: - -``` -$"Equivalent to a normal string with dollar interps: $(1 + 2)" -$@"The same, but the AT symbol interpolates: @(1 + 2)" -$$"No interpolation here, $ is just a literal character" -$|This string is pipe-delimited, so it can have "quotes" and 'single quotes' and interpolates with dollar sign: $(1+2)| -$(This string is parens-delimited, so you can have (nested) parens without ending the string) -$=[This string is square-bracket delimited [which can be nested] and uses equals for interps: =(1 + 2)] -$@/look ma, regex literals!/ -``` - -When strings are delimited by matching pairs (`()`, `[]`, `{}`, or `<>`), they -can only be closed by a matched closing character at the same indentation -level, ignoring nested pairs: - -``` -$$(Inside parens, you can have (nested ()) parens no problem) -$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes" -$$( - When indented, an unmatched ) won't close the string - An unmatched ( won't mess things up either - Only matching pairs on the same indentation level are counted: -) -$$(Multi-line string with nested (parens) and -.. line continuation) -``` - -As a special case, when you use the same character for interpolation and string -delimiting, no interpolations are allowed: - -``` -plain := $""This string has {no interpolations}!" -``` - -**Note:** Normal doubly quoted strings with no dollar sign (e.g. `"foo"`) are a -shorthand for `${}"foo"`. Singly quoted strings with no dollar sign (e.g. -`'foo'`) are shorthand for `$''foo'`. - -## Operations - -### Concatenation - -Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`. - -Because string concatenation is typically an O(1) operation, there is no need -for a separate string builder class in the language and no need to use an array -of string fragments. - -### String Length - -String length is an ambiguous term in the context of UTF-8 strings. There are -several possible meanings, so each of these meanings is split into a separate -method: - -- Number of grapheme clusters: `string:num_graphemes()` -- Size in bytes: `string:num_bytes()` -- Number of unicode codepoints: `string:num_codepoints()` (you probably want to - use graphemes, not codepoints in most applications) - -Since the typical user expectation is that string length refers to "letters," -the `#` length operator returns the number of grapheme clusters, which is the -closest unicode equivalent to "letters." - -### Iteration - -Iteration is *not* supported for strings because of the ambiguity between -bytes, codepoints, and graphemes. It is instead recommended that you explicitly -iterate over bytes, codepoints, graphemes, words, lines, etc: - -### Subcomponents - -- `string:bytes()` returns an array of `Int8` bytes -- `string:codepoints()` returns an array of `Int32` bytes -- `string:graphemes()` returns an array of grapheme cluster strings -- `string:words()` returns an array of word strings -- `string:lines()` returns an array of line strings -- `string:split(",", empty=no)` returns an array of strings split by the given delimiter - -### Equality, Comparison, and Hashing - -All text is compared and hashed using unicode normalization. Unicode provides -several different ways to represent the same text. For example, the single -codepoint `U+E9` (latin small e with accent) is rendered the same as the two -code points `U+65 U+301` (latin small e, acute combining accent) and has an -equivalent linguistic meaning. These are simply different ways to represent the -same "letter." In order to make it easy to write correct code that takes this -into account, Tomo uses unicode normalization for all string comparisons and -hashing. Normalization does the equivalent of converting text to a canonical -form before performing comparisons or hashing. This means that if a table is -created that has text with the codepoint `U+E9` as a key, then a lookup with -the same text but with `U+65 U+301` instead of `U+E9` will still succeed in -finding the value because the two strings are equivalent under normalization. - -### Capitalization - -- `x:capitalized()` -- `x:titlecased()` -- `x:uppercased()` -- `x:lowercased()` - -### Patterns - -- `string:has("target", at=Where.Anywhere|Where.Start|Where.End)->Bool` Check whether a pattern can be found -- `string:without("target", at=Where.Anywhere|Where.Start|Where.End)->Text` -- `string:trimmed("chars...", at=Where.Anywhere|Where.Start|Where.End)->Text` -- `string:find("target")->enum(Failure, Success(index:Int32))` -- `string:replace("target", "replacement", limit=Int.max)->Text` Returns a copy of the string with replacements -- `string:split("split")->[Text]` -- `string:join(["one", "two"])->Text` |
