2020-12-30 19:42:47 -08:00
|
|
|
# BP - Bruce's PEG Tool
|
2020-09-14 11:00:03 -07:00
|
|
|
|
2020-12-30 19:42:47 -08:00
|
|
|
BP is a parsing expression grammar (PEG) tool for the command line.
|
2020-09-14 11:00:03 -07:00
|
|
|
It's written in pure C with no dependencies.
|
|
|
|
|
2022-05-04 21:50:38 -07:00
|
|
|

|
|
|
|
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2021-08-28 14:27:27 -07:00
|
|
|
## Tutorial
|
|
|
|
|
|
|
|
Run `make tutorial` to run through the tutorial. It walks through some basic pattern matching.
|
|
|
|
|
|
|
|
|
2020-09-14 11:00:03 -07:00
|
|
|
## Usage
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2023-11-25 11:57:19 -08:00
|
|
|
```
|
|
|
|
bp [flags] <pattern> [<input files>...]
|
|
|
|
```
|
2020-09-14 11:00:03 -07:00
|
|
|
|
2023-11-25 11:57:19 -08:00
|
|
|
BP is optimized for matching literal strings, so the main pattern argument is
|
|
|
|
interpreted as a string literal. BP pattern syntax is inserted using curly
|
|
|
|
brace interpolations like `bp 'foo{..}baz'` (match the string literal "foo" up
|
|
|
|
to and including the next occurrence of "baz" on the same line).
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2020-09-14 11:00:03 -07:00
|
|
|
### Flags
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2020-09-14 11:00:03 -07:00
|
|
|
* `-h` `--help` print the usage and quit
|
|
|
|
* `-v` `--verbose` print verbose debugging info
|
2020-09-14 12:16:15 -07:00
|
|
|
* `-i` `--ignore-case` perform a case-insensitive match
|
2020-12-27 19:48:52 -08:00
|
|
|
* `-I` `--inplace` perform replacements or filtering in-place on files
|
2020-12-14 22:32:47 -08:00
|
|
|
* `-e` `--explain` print an explanation of the matches
|
2020-12-17 16:23:45 -08:00
|
|
|
* `-l` `--list-files` print only filenames containing matches
|
2021-01-20 16:12:46 -08:00
|
|
|
* `-r` `--replace <replacement>` replace the input pattern with the given replacement
|
|
|
|
* `-s` `--skip <skip pattern>` skip over the given pattern when looking for matches
|
2021-08-02 12:25:52 -07:00
|
|
|
* `-B` `--context-before <N>` change how many lines of context are printed before each match
|
|
|
|
* `-B` `--context-after <N>` change how many lines of context are printed after each match
|
|
|
|
* `-C` `--context <N>` change how many lines of context are printed before and after each match
|
2021-01-17 09:21:58 -08:00
|
|
|
* `-g` `--grammar <grammar file>` use the specified file as a grammar
|
|
|
|
* `-G` `--git` get filenames from git
|
2021-05-12 20:33:27 -07:00
|
|
|
* `-f` `--format` `auto|plain|fancy` set the output format (`fancy` includes colors and line numbers)
|
2020-09-14 11:00:03 -07:00
|
|
|
|
2020-12-12 16:31:53 -08:00
|
|
|
See `man ./bp.1` for more details.
|
2020-09-14 11:00:03 -07:00
|
|
|
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2020-12-30 19:42:47 -08:00
|
|
|
## BP Patterns
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2020-12-30 19:42:47 -08:00
|
|
|
BP patterns are a mixture of Parsing Expression Grammar and Regular
|
2020-09-14 11:00:03 -07:00
|
|
|
Expression syntax, with a preference for prefix operators instead of
|
|
|
|
suffix operators.
|
|
|
|
|
2020-09-28 16:54:17 -07:00
|
|
|
Pattern | Meaning
|
|
|
|
-------------------|---------------------
|
2021-05-10 23:49:17 -07:00
|
|
|
`"foo"`, `'foo'` | The literal string `foo`. There are no escape sequences within strings.
|
2020-09-28 16:54:17 -07:00
|
|
|
`pat1 pat2` | `pat1` followed by `pat2`
|
|
|
|
`pat1 / pat2` | `pat1` if it matches, otherwise `pat2`
|
2021-07-30 19:24:35 -07:00
|
|
|
`.. pat` | Any text up to and including `pat` (except newlines)
|
2021-01-20 15:23:57 -08:00
|
|
|
`.. % skip pat` | Any text up to and including `pat` (except newlines), skipping over instances of `skip`
|
2021-07-30 19:24:35 -07:00
|
|
|
`.. = repeat pat` | Any number of repetitions of `repeat` up to and including `pat`
|
2020-09-28 16:54:17 -07:00
|
|
|
`.` | Any single character (except newline)
|
|
|
|
`^^` | The start of the input
|
|
|
|
`^` | The start of a line
|
|
|
|
`$$` | The end of the input
|
|
|
|
`$` | The end of a line
|
|
|
|
`__` | Zero or more whitespace characters (including newlines)
|
|
|
|
`_` | Zero or more whitespace characters (excluding newlines)
|
|
|
|
`` `c `` | The literal character `c`
|
|
|
|
`` `a-z `` | The character range `a` through `z`
|
2020-12-19 18:53:51 -08:00
|
|
|
`` `a,b `` | The character `a` or the character `b`
|
2020-09-28 16:54:17 -07:00
|
|
|
`\n`, `\033`, `\x0A`, etc. | An escape sequence character
|
|
|
|
`\x00-xFF` | An escape sequence range (byte `0x00` through `0xFF` here)
|
|
|
|
`!pat` | `pat` does not match at the current position
|
2021-01-19 23:30:50 -08:00
|
|
|
`[pat]` | Zero or one occurrences of `pat` (optional pattern)
|
2020-09-28 16:54:17 -07:00
|
|
|
`5 pat` | Exactly 5 occurrences of `pat`
|
|
|
|
`2-4 pat` | Between 2 and 4 occurrences of `pat` (inclusive)
|
|
|
|
`5+ pat` | 5 or more occurrences of `pat`
|
2020-09-28 17:42:38 -07:00
|
|
|
`5+ pat % sep` | 5 or more occurrences of `pat`, separated by `sep` (e.g. `0+ int % ","` matches `1,2,3`)
|
2020-09-28 18:08:23 -07:00
|
|
|
`*pat` | 0 or more occurrences of `pat` (shorthand for `0+pat`)
|
|
|
|
`+pat` | 1 or more occurrences of `pat` (shorthand for `1+pat`)
|
2021-07-17 13:54:26 -07:00
|
|
|
`<pat` | `pat` matches just before the current position (lookbehind)
|
2020-09-28 16:54:17 -07:00
|
|
|
`>pat` | `pat` matches just in front of the current position (lookahead)
|
2022-05-12 09:11:28 -07:00
|
|
|
`@pat` | Capture `pat` (used for text replacement)
|
|
|
|
`@foo=pat` | Capture `pat` with the name `foo` attached (used for text replacement)
|
|
|
|
`@foo:pat` | Let `foo` be the text of `pat` (used for backreferences)
|
2020-12-30 15:30:19 -08:00
|
|
|
`pat => "replacement"` | Match `pat` and replace it with `replacement`
|
|
|
|
`(pat1 @keep=pat2) => "@keep"` | Match `pat1` followed by `pat2` and replace it with the text of `pat2`
|
2021-05-19 23:41:57 -07:00
|
|
|
`pat1~pat2` | `pat1` when `pat2` can be found within the result
|
|
|
|
`pat1!~pat2` | `pat1` when `pat2` can not be found within the result
|
2022-04-30 12:26:58 -07:00
|
|
|
`name: pat2` | `name` is defined to mean `pat`
|
|
|
|
`name:: pat2` | `name` is defined to mean `pat` and matches have `name` attached to the result as metadata
|
2020-09-28 16:54:17 -07:00
|
|
|
`# line comment` | A line comment
|
|
|
|
|
2020-12-12 16:31:53 -08:00
|
|
|
See `man ./bp.1` for more details.
|
2020-09-14 11:00:03 -07:00
|
|
|
|
2021-01-17 22:35:34 -08:00
|
|
|
|
|
|
|
## Grammar Files
|
|
|
|
|
|
|
|
BP comes packaged with some pattern definitions that can be useful when parsing
|
|
|
|
code of different languages. Firstly, there are a handful of general-purpose
|
|
|
|
patterns like:
|
|
|
|
|
|
|
|
Name | Meaning
|
|
|
|
--------------|--------------------
|
|
|
|
`string` | A string (either single- or double-quoted)
|
|
|
|
`parens` | A matched pair of parentheses (`()`)
|
|
|
|
`braces` | A matched pair of curly braces (`{}`)
|
|
|
|
`brackets` | A matched pair of square brackets (`[]`)
|
|
|
|
`anglebraces` | A matched pair of angle braces (`<>`)
|
|
|
|
`_` | Zero or more whitespace characters (excluding newline)
|
|
|
|
`__` | Zero or more whitespace characters, including newlines and comments
|
|
|
|
`Abc` | The characters `a-z` and `A-Z`
|
|
|
|
`Abc123` | The characters `a-z`, `A-Z`, and `0-9`
|
|
|
|
`int` | 1 or more numeric characters
|
|
|
|
`number` | An integer or floating point number
|
|
|
|
`Hex` | A hexadecimal character
|
|
|
|
`id` | An identifier
|
|
|
|
|
|
|
|
As well as these common definitions, BP also comes with a set of
|
|
|
|
language-specific or domain-specific grammars. These are not full language
|
|
|
|
grammars, but only implementation of some language-specific features, like
|
|
|
|
identifier rules (`id`), string syntax, and comment syntax (which affects `__`
|
|
|
|
and other rules). Some of the languages supported are:
|
|
|
|
|
|
|
|
- BP
|
|
|
|
- C++
|
|
|
|
- C
|
|
|
|
- Go
|
|
|
|
- HTML
|
|
|
|
- Javascript
|
|
|
|
- Lisp
|
|
|
|
- Lua
|
|
|
|
- Python
|
|
|
|
- Rust
|
|
|
|
- shell script
|
|
|
|
|
|
|
|
These grammar definitions can be found in [grammars](/grammars). To use a
|
|
|
|
grammar file, use `bp -g <path-to-file>` or `bp --grammar=<path-to-file>`. Once
|
|
|
|
BP is installed, however, you can use `bp -g <grammar-name>` directly, and BP
|
|
|
|
will figure out which grammar you mean (e.g. `bp -g lua ...`). BP first
|
|
|
|
searches `~/.config/bp/` for any grammar files you keep locally, then searches
|
2021-07-03 21:43:56 -07:00
|
|
|
`/etc/bp/` for system-wide grammar files.
|
2021-01-17 22:35:34 -08:00
|
|
|
|
|
|
|
Testing for these grammar files (other than `builtins`) is iffy at this point,
|
|
|
|
so use at your own risk! These grammar files are only approximations of syntax.
|
|
|
|
|
|
|
|
|
2021-08-01 21:38:20 -07:00
|
|
|
## Code Layout
|
|
|
|
|
2021-08-01 21:42:04 -07:00
|
|
|
File | Description
|
|
|
|
-------------------------------|-----------------------------------------------------
|
|
|
|
[bp.c](bp.c) | The main program.
|
2021-09-28 17:02:25 -07:00
|
|
|
[files.c](files.c) | Loading files into memory.
|
|
|
|
[match.c](match.c) | Pattern matching code (find occurrences of a bp pattern within an input string).
|
|
|
|
[pattern.c](pattern.c) | Pattern compiling code (compile a bp pattern from an input string).
|
2022-04-09 11:15:07 -07:00
|
|
|
[printmatch.c](printmatch.c) | Printing a visual explanation of a match.
|
2021-08-01 21:42:04 -07:00
|
|
|
[utf8.c](utf8.c) | UTF-8 helper code.
|
|
|
|
[utils.c](utils.c) | Miscellaneous helper functions.
|
2021-08-01 21:38:20 -07:00
|
|
|
|
|
|
|
|
2021-09-24 23:56:41 -07:00
|
|
|
## Lua Bindings
|
|
|
|
|
|
|
|
`bp` also comes with a set of Lua bindings, which can be found in the [Lua/
|
|
|
|
directory](Lua). The bindings are currently a work in progress, but are fully
|
|
|
|
usable at this point. Check [the Lua bindings README](Lua/README.md) for more
|
|
|
|
details.
|
|
|
|
|
|
|
|
|
2021-01-18 12:53:44 -08:00
|
|
|
## Performance
|
|
|
|
|
2021-08-01 22:06:45 -07:00
|
|
|
Currently, `bp`'s speed is comparable to hyper-optimized regex tools like
|
|
|
|
`grep`, `ag`, and `ripgrep` when it comes to simple patterns that begin with
|
|
|
|
string literals, but `bp`'s performance may be noticeably slower for complex
|
|
|
|
patterns on large quantities of text. The aforementioned regular expression
|
|
|
|
tools are usually implemented as efficient finite state machines, but `bp` is
|
|
|
|
more expressive and capable of matching arbitrarily nested patterns, which
|
|
|
|
precludes the possibility of using a finite state machine. Instead, `bp` uses a
|
|
|
|
fairly simple recursive virtual machine implementation with memoization. `bp`
|
|
|
|
also has a decent amount of overhead because of the metadata used for
|
|
|
|
visualizing and explaining pattern matches, as well as performing string
|
|
|
|
replacements. Overall, I would say that `bp` is a great drop-in replacement for
|
|
|
|
common shell scripting tasks, but you may want to keep the other tools around
|
|
|
|
in case you have to search through a truly massive codebase for something
|
|
|
|
complex.
|
2021-01-18 12:53:44 -08:00
|
|
|
|
|
|
|
|
2020-09-14 11:00:03 -07:00
|
|
|
## License
|
2021-01-17 22:35:34 -08:00
|
|
|
|
2020-12-30 19:42:47 -08:00
|
|
|
BP is provided under the MIT license with the [Commons Clause](https://commonsclause.com/)
|
2020-09-28 17:01:53 -07:00
|
|
|
(you can't sell this software without the developer's permission, but you're
|
|
|
|
otherwise free to use, modify, and redistribute it free of charge).
|
|
|
|
See [LICENSE](LICENSE) for details.
|