aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/strings.md300
-rw-r--r--parse.c42
2 files changed, 324 insertions, 18 deletions
diff --git a/docs/strings.md b/docs/strings.md
new file mode 100644
index 00000000..1893d70f
--- /dev/null
+++ b/docs/strings.md
@@ -0,0 +1,300 @@
+# Strings
+
+Strings are implemented as immutable UTF-8-encoded values using:
+
+- The Boehm Cord library for efficient storage and concatenation.
+- GNU libunistring for unicode functionality (grapheme cluster counts,
+ capitalization, etc.)
+- My own BP library for simple pattern matching operations (similar to regex)
+
+## Syntax
+
+Strings have a flexible syntax designed to make it easy to hold values from
+different languages without the need to have lots of escape sequences and
+without using printf-style string formatting.
+
+```
+// Basic string:
+str := "Hello world"
+str2 := 'Also a string'
+```
+
+## Line Splits
+
+Long strings can be split across multiple lines by having two or more dots at
+the start of a new line on the same indentation level that started the string:
+
+```
+str := "This is a long
+.... line that is split in code"
+```
+
+## Multi-line Strings
+
+Multi-line strings have indented (i.e. at least one tab more than the start of
+the string) text inside quotation marks. The leading and trailing newline are
+ignored:
+
+```
+multi_line := "
+ This string has multiple lines.
+ Line two.
+
+ You can split a line
+.... using two or more dots to make an elipsis.
+
+ Remember to include whitespace after the elipsis if desired.
+
+ Or don't if you're splitting a long word like supercalifragilisticexpia
+....lidocious
+
+ This text is indented by one level in the string
+
+ "quotes" are ignored unless they're at the same indentation level as the
+.... start of the string.
+
+ The end (no newline after this).
+"
+```
+
+## String Interpolations
+
+Inside a double quoted string, you can use curly braces (`{...}`) to insert an
+expression that you want converted to a string. This is called string
+interpolation:
+
+```
+// Interpolation:
+str := "Sum: {1 + 2}"
+// equivalent to "Sum: 3"
+```
+
+Single-quoted strings do not have interpolations:
+
+```
+// No interpolation here:
+str := 'Sum: {1 + 2}'
+```
+
+## String Escapes
+
+Unlike other languages, backslash is *not* a special character inside of a
+string. For example, `"x\ny"` has the characters `x`, `\`, `n`, `y`, not a
+newline. Instead, a series of character escapes act as complete string literals
+without quotation marks:
+
+```
+newline := \n
+crlf := \r\n
+quote := \"
+```
+
+These string literals can be used as interpolation values:
+
+```
+two_lines := "one{\n}two"
+has_quotes := "some {\"}quotes{\"} here"
+```
+
+However, in general it is best practice to use multi-line strings to avoid these problems:
+
+```
+str := "
+ This has
+ multiple lines and "quotes" too!
+"
+```
+
+### Multi-line Strings
+
+There are two reasons for strings to span multiple lines in code: either you
+have a string that contains newlines and you want to represent it without `\n`
+escapes, or you have a long single-line string that you want to split across
+multiple lines for readability. To support this, you can use newlines inside of
+strings with indentation-sensitivity. For splitting long lines, use two or more
+"."s at the same indentation level as the start of the string literal:
+
+```
+single_line := "This is a long string that
+.... spans multiple lines"
+```
+For strings that contain newlines, you may put multiple indented lines inside
+the quotes:
+
+```
+multi_line := "
+ line one
+ line two
+ this line is indented
+ last line
+"
+```
+
+Strings may only end on lines with the same indentation as the starting quote
+and nested quotes are ignored:
+
+```
+nested := $$(I can have (parens) inside (parens inside (parens)))
+multi_line := "
+ Quotes in indented regions like this: " don't count
+"
+```
+
+If there is a leading or trailing newline, it is ignored and not included in
+the string.
+
+```
+str := "
+ one line
+"
+
+>>> str == "one line"
+=== yes
+```
+
+Additional newlines *are* counted though:
+
+```
+str := "
+
+ blank lines
+
+"
+
+>>> str == "{\n}blank lines{\n}"
+```
+
+### Advanced $-Strings
+
+Sometimes you need to use many `{`s or `"`s inside a string, but you don't want
+to type `{\{}` or `{\"}` each time. In such cases, you can use the more
+advanced form of strings. The advanced form lets you explicitly specify which
+characters are used for interpolation and which characters are used for
+opening/closing the string. Advanced strings begin with a dollar sign (`$`),
+followed by what interpolation style to use, followed by the character to use
+to delimit the string, followed by the string contents and a closing string
+delimiter. The interpolation style can be a matching pair (`()`, `[]`, `{}`, or
+`<>`) or any other single character. When the interpolation style is a matching
+pair, the interpolation is any expression enclosed in that pair (e.g.
+`${}"interpolate {1 + 2}"`). When the interpolation style is a single
+character, the interpolation must be either a parenthesized expression or a
+single term with no infix operators (e.g. a variable), for example:
+`$@"Interpolate @var or @(1 + 2)"`.
+
+Here are some examples:
+
+```
+$[]"In here, quotes delimit the string and square brackets interpolate: [1 + 2]"
+$@"For single-letter interpolations, the interpolation is a single term like @my_var without a closing symbol"
+$@"But you can parenthesize expressions like: @(x + y) if you need to"
+$$"Double dollars means dollar signs interpolate: $my_var $(1 + 2)"
+$${If you have a string with "quotes" and 'single quotes', you can choose something else like curly braces to delimit the string}
+$?#Here hashes delimit the string and question marks interpolate: ?(1 + 2)#
+```
+
+When strings are delimited by matching pairs (`()`, `[]`, `{}`, or `<>`), they
+can only be closed by a matched closing character at the same indentation
+level, ignoring nested pairs:
+
+```
+$$(Inside parens, you can have (nested ()) parens no problem)
+$$"But only (), [], {}, and <> are matching pairs, you can't have nested quotes"
+$$(
+ When indented, an unmatched ) won't close the string
+ An unmatched ( won't mess things up either
+ Only matching pairs on the same indentation level are counted:
+)
+$$(Multi-line string with nested (parens) and
+.. line continuation)
+```
+
+As a special case, when `!` is used as an interpolation rule, no interpolations
+are allowed and `!` itself is treated as a literal character:
+
+```
+plain := $!"This string has {no interpolations}! Not even exclamation mark!"
+```
+
+**Note:** Normal doubly quoted strings with no dollar sign (e.g. `"foo"`) are a
+shorthand for `${}"foo"`. Singly quoted strings with no dollar sign are
+shorthand for `$!'foo'`.
+
+## Operations
+
+### Concatenation
+
+Concatenation in the typical case is an O(1) operation: `"{x}{y}"` or `x ++ y`.
+
+Because string concatenation is typically an O(1) operation, there is no need
+for a separate string builder class in the language and no need to use an array
+of string fragments.
+
+### String Length
+
+String length is an ambiguous term in the context of UTF-8 strings. There are
+several possible meanings, so each of these meanings is split into a separate
+method:
+
+- Number of grapheme clusters: `string.num_graphemes()`
+- Size in bytes: `string.num_bytes()`
+- Number of unicode codepoints: `string.num_codepoints()` (you probably want to
+ use graphemes, not codepoints in most applications)
+
+### Iteration
+
+Iteration is *not* supported for strings because of the ambiguity between
+bytes, codepoints, and graphemes. It is instead recommended that you use
+higher-abstraction functions.
+
+### Subcomponents
+
+- `string.bytes()` returns an array of `Int8` bytes
+- `string.codepoints()` returns an array of `Int32` bytes
+- `string.graphemes()` returns an array of grapheme cluster strings
+- `string.words()` returns an array of word strings
+- `string.lines()` returns an array of line strings
+- `string.split(",", empty=no)` returns an array of strings split by the given delimiter
+
+### Equality and Comparison
+
+By default, strings are compared using memory comparisons of the UTF-8 representation.
+
+- `x == y` is roughly equivalent to `strcmp(x, y) == 0`
+
+To compare normalized forms of strings, use:
+
+- `x.equivalent_to(y)` returns a boolean for whether the strings are the same
+- `x.compare_normalized(y)` returns `enum(Equal, Less, Greater)`
+
+### Capitalization
+
+- `x.capitalized()`
+- `x.titlecased()`
+- `x.uppercased()`
+- `x.lowercased()`
+
+### Patterns
+
+- `string.has($/pattern/, at=Anywhere:enum(Anywhere, Start, End))` Check whether a pattern can be found
+- `string.next($/pattern/)` Returns an `enum(NotFound, Found(match:Str, rest:Str))`
+- `string.matches($/pattern/)` Returns a list of matching strings
+- `string.replace($/pattern/, "replacement")` Returns a copy of the string with replacements
+- `string.without($/pattern/, at=Anywhere:enum(Anywhere, Start, End))`
+
+### Indentation
+
+- `string.indented(type:enum(Tab, Spaces(num:Int), count=1)` (e.g. `s.indented(Tab)`, `s.indented(Spaces(4), -1)`
+
+### Properties
+
+Unicode strings have various overlapping properties. For example, a grapheme
+might be both printable and alphabetic. It can be useful to query some of these
+properties for a given string.
+
+- `string.properties() -> flags(None, WhiteSpace, Alphabetic, …, Emoji, …)`
+- `string.is(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool`
+- `string.has_property(properties:flags(None, WhiteSpace, Alphabetic, …, Emoji, …)) -> Bool`
+
+Example: `if name.is(Uppercase)`
+Example: `if name.is(Alphabetic or Numeric)`
+Example: `if name.has_property(Math or Currency)`
diff --git a/parse.c b/parse.c
index 76e95495..6488b2c5 100644
--- a/parse.c
+++ b/parse.c
@@ -880,6 +880,8 @@ PARSER(parse_string) {
char open_quote, close_quote, open_interp = '\x03', close_interp = '\x02';
if (match(&pos, "\"")) {
open_quote = '"', close_quote = '"', open_interp = '{', close_interp = '}';
+ } else if (match(&pos, "'")) {
+ open_quote = '\'', close_quote = '\'';
} else if (match(&pos, "$")) {
open_interp = *(pos++);
close_interp = closing[(int)open_interp];
@@ -901,56 +903,60 @@ PARSER(parse_string) {
const char *chunk_start = pos;
int depth = 1;
bool leading_newline = false;
- for (; pos < ctx->file->text + ctx->file->len && depth > 0; ++pos) {
- if (*pos == open_interp) {
+ for (; pos < ctx->file->text + ctx->file->len && depth > 0; ) {
+ if (*pos == open_interp) { // Interpolation
+ const char *interp_start = pos;
if (chunk) {
ast_t *literal = NewAST(ctx->file, chunk_start, pos, StringLiteral, .cord=chunk);
chunks = new(ast_list_t, .ast=literal, .next=chunks);
chunk = NULL;
}
++pos;
- for (ast_t *interp; (interp=close_interp ? optional(ctx, &pos, parse_expr) : optional(ctx, &pos, parse_term)); spaces(&pos)) {
- chunks = new(ast_list_t, .ast=interp, .next=chunks);
- chunk_start = pos;
- }
+ ast_t *interp;
if (close_interp) {
- const char *closing = pos;
- spaces(&closing);
- if (*closing == close_interp) {
- pos = closing;
- }
+ whitespace(&pos);
+ interp = expect(ctx, interp_start, &pos, parse_expr, "I expected an interpolation expression here");
+ whitespace(&pos);
+ expect_closing(ctx, &pos, (char[]){close_interp, 0}, "I was expecting a '%c' to finish this interpolation", close_interp);
+ } else {
+ if (*pos == ' ' || *pos == '\t')
+ parser_err(ctx, pos, pos+1, "Whitespace is not allowed before an interpolation here");
+ interp = expect(ctx, interp_start, &pos, parse_term, "I expected an interpolation term here");
}
- } else if (!leading_newline && *pos == open_quote && closing[(int)open_quote]) {
+ chunks = new(ast_list_t, .ast=interp, .next=chunks);
+ chunk_start = pos;
+ } else if (!leading_newline && *pos == open_quote && closing[(int)open_quote]) { // Nested pair begin
if (get_indent(ctx->file, pos) == starting_indent) {
++depth;
}
chunk = CORD_cat_char(chunk, *pos);
- } else if (!leading_newline && *pos == close_quote) {
+ ++pos;
+ } else if (!leading_newline && *pos == close_quote) { // Nested pair end
if (get_indent(ctx->file, pos) == starting_indent) {
--depth;
if (depth == 0)
break;
}
chunk = CORD_cat_char(chunk, *pos);
- } else if (newline_with_indentation(&pos, string_indent)) {
+ ++pos;
+ } else if (newline_with_indentation(&pos, string_indent)) { // Newline
if (!leading_newline && !(chunk || chunks)) {
leading_newline = true;
} else {
chunk = CORD_cat_char(chunk, '\n');
}
- --pos;
- } else if (newline_with_indentation(&pos, starting_indent)) {
+ } else if (newline_with_indentation(&pos, starting_indent)) { // Line continuation (..)
if (*pos == close_quote) {
break;
} else if (some_of(&pos, ".") >= 2) {
// Multi-line split
- --pos;
continue;
} else {
parser_err(ctx, pos, strchrnul(pos, '\n'), "This multi-line string should be either indented or have '..' at the front");
}
- } else {
+ } else { // Plain character
chunk = CORD_cat_char(chunk, *pos);
+ ++pos;
}
}