-
Notifications
You must be signed in to change notification settings - Fork 216
Add flat bril project blog post #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sampsyo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work here, y'all! I found your description of the flattening process detailed and useful, and it's awesome that you got all the way to implementing and evaluating a "native" interpreter. Fantastic!
I have a few questions within. And one more: would you be interested in contributing your new representation & interpreter back to the Bril monorepo? I imagine that other people might find it useful!
| 1. Infrastructure to convert existing Bril JSON files to/from our flattened format | ||
| 2. An alternate Bril interpreter that operates directly on the flattened data structure (as opposed to the existing one brili, which has to parse JSON) | ||
|
|
||
| Our complete implementation can be found at https://github.com/ngernest/flat-bril/tree/main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider integrating this link into the text, like "our complete implantation is [open source on GitHub][…]".
|
|
||
| ### Flat data structures | ||
|
|
||
| A key design decision was what the flattened data structures would look like. We clearly need similar fields to those of the original representation, but we need a strategy for flattening `Vec`s and `String`s and other non-flat types. Instead of storing variables, labels and function names as Strings and Vec<String>s we aggregate all the names referenced throughout a function into three contiguous arrays of bytes that are stored in our function representation. Then instead of using Strings, our instruction representation stores pairs of indices that can be used to extract the relevant name from the function-level contiguous array. One subtlety here is that if we want to use a Vec, we cannot just index into the array containing all the names, we would not know where the boundaries between each name are! We need to use an intermediary array that stores indices into the top-level name array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably want some more code backticks when you refer to Rust types. For example:
as Strings and Vecs
->
as
StringsandVec<String>s
|
|
||
| A key design decision was what the flattened data structures would look like. We clearly need similar fields to those of the original representation, but we need a strategy for flattening `Vec`s and `String`s and other non-flat types. Instead of storing variables, labels and function names as Strings and Vec<String>s we aggregate all the names referenced throughout a function into three contiguous arrays of bytes that are stored in our function representation. Then instead of using Strings, our instruction representation stores pairs of indices that can be used to extract the relevant name from the function-level contiguous array. One subtlety here is that if we want to use a Vec, we cannot just index into the array containing all the names, we would not know where the boundaries between each name are! We need to use an intermediary array that stores indices into the top-level name array. | ||
|
|
||
| For instance, in a function with three variable names v1, v2 and v3, the function-level variable array will look like v1v2v3. If an instruction has two arguments v1 and v2, our representation stores a pair of indices into an intermediate array that itself contains pairs of indices. A lookup would proceed as (0,1) → (0,0), (1,1) → v1, v2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lookup would proceed as (0,1) → (0,0), (1,1) → v1, v2
Not sure I understand this sentence… are (0, 0) and (1, 1) supposed to be pairs of indices in the name buffer? Assuming half-open intervals, wouldn't that be (0, 2) and (2, 4)? Or, with closed intervals, (0, 1) and (2, 3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote this section in the .md file, hopefully it's clearer now!
(In hindsight, our design is a bit confusing in that we use close intervals throughout, and we enforce the invariant that the bytes for the actual variable-name strings must be stored contiguously, and their indexes in the interstitial indexes array must also be stored contiguously.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it; makes sense.
| } | ||
| ``` | ||
|
|
||
| Although our function representation does contain Vec s to enable construction, these will be flattened to slices once the entire Function has been created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Although our function representation does contain Vec s to enable construction, these will be flattened to slices once the entire Function has been created | |
| Although our function representation does contain `Vec`s to enable construction, these will be flattened to slices once the entire Function has been created. |
|
|
||
| ### Zerocopy | ||
|
|
||
| In order to facilitate conversion between slices of bytes and our flat data structures we used the zerocopy crate. In practice this meant adding the IntoBytes and FromBytes traits to our flat representations, and specifying their byte representations using the repr attribute. It ended up being quite challenging to get this working. Zerocopy was quite finicky about what was allowable in a struct using the zerocopy traits. We ended up needing to create new “extra-flat” versions of many of our data structures to get this to work and experimenting with different repr options. For instance we discovered that zerocopy could not convert pairs of `u32`s, or `Option`s to bytes, so we needed to create a new `I32Pair` struct that itself implemented the zerocopy traits (`I32` as opposed to `u32` because we used -1 to represent the case where the `Option` is None). We probably could have used these extra-flat data structures everywhere, rather than converting between multiple versions, but since we had written most of our flattening logic using the original data structures we decided to stick with using new extra-flat versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a hyperlink to the zerocopy crate. And probably also to the Rust documentation for repr.
|
|
||
| ## Evaluation | ||
|
|
||
| For our evaluation, we decided to test flat bril on 70 core bril benchmarks. To check the correctness of our implementation, we used Turnt to verify that all benchmarks using our flat bril interpreter returned the same result as that of the reference Brili interpreter, for which we were successful. Additionally, to check the correctness of our infrastructure converting JSON files to/from our flattened format, we manually checked that the final json output converted back matched that of the original. To measure the performance impacts, we did the following using Hyperfine: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding hyperlinks to Turnt, the reference interpreter, and Hyperfine.
|
|
||
| Below is a table showing the time taken for json roundtrips. We tested this on all the core benchmarks, but due to space constraints, we only list a few here. These are averaged over 10 runs, with a warmup of 3. | ||
|
|
||
| | Benchmark | Mean [ms] | Min [ms] | Max [ms] | Relative | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "Relative" column here? What's being compared to what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(By default, the Markdown table that Hyperfine outputs includes this "Relative" column, but there is no baseline here for us to compare against, so the column is inappropriate.) I removed the column -- thanks!
| | `nop` | 521.6 ± 114.0 | 438.3 | 744.3 | 1.19 ± 0.26 | | ||
| | `perfect` | 503.1 ± 37.1 | 462.7 | 578.5 | 1.15 ± 0.09 | | ||
| | `reverse` | 486.1 ± 61.1 | 432.5 | 616.1 | 1.11 ± 0.14 | | ||
| | `rot13` | 438.5 ± 7.8 | 428.4 | 452.8 | 1.00 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really the case that a JSON -> flat -> JSON round-trip takes ~500 ms (half a second)? That seems like a really long time! Because these programs are pretty small, I'd have expected it to be, like, less than a millisecond. Any guesses about what is taking so long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch! We realized that the original JSON roundtrip benchmark results were faulty for two reasons:
-
The command we passed to Hyperfine for benchmarking was of the form
bril2json < SOME_BRIL_FILE | cargo run -- ..., and Samply revealed that a lot of the ~500ms was dominated bybril2json's file I/O. I updated our code so that the JSON benchmarks don't require piping the output ofbril2jsonto our executable. -
We forgot to do
cargo build --releasebefore benchmarking (so some optimizations from thereleasebuild might have been missed)
I've rerun the benchmarks using the --release build and updated the figure in the blogpost!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. FWIW, there might also be a significant cost to using cargo run instead of just running the executable directly…
|
|
||
| Besides comparing our interpreter with the other Brili implementations, we also used the Samply profiler to figure out where our interpreter was spending most of its time. The stack chart below (obtained from the Firefox Profiler UI) demonstrates how our interpreter exectuable spends its time on the benchmark `catalan.bril` – we chose to highlight this benchmark since it has many recursive function calls and the difference between our interpreter and the TypeScript/Rust Brili interpreters’ performance is noticeable. | ||
|
|
||
| The stack chart shows that most of the runtime of the executable spent was in the interpreter-related functions: `mmap`-ing the flat file format was relatively quick, and so was converting the JSON to a flat file format (omitted from the screenshot). Zooming into the stack chart (second screenshot below) and focusing on the function calls towards the bottom, we see that when interpreting individual individual instructions, our interpreter calls a lot of standard library functions that manipulate `HashMap`s and `Vec`s. This is because the environment datatype in our interpreter is defined as `HashMap<&str, BrilValue>` (mapping variable names to a Bril value), i.e. keys in the hashmaps are (references to) strings. We realized that strings were the canonical way to represent variables in the environment, since different indexes in the arguments field of an instruction may point to the same underlying variable. However, the fast Rust Brili interpreter canonicalises variable names into a numerical representation (as described in their blogpost), allowing for faster look-ups in their environment. We suspect that this is one of the reasons why the Rust Brili interpreter out-performs our flat interpreter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this investigation! I agree that it is likely that this is the performance bottleneck here. It would be interesting to try to alleviate this bottleneck by requiring that variable name indices be unique…
| Our interpreter compares favorably to Rust Brili in terms of peak physical memory usage (max resident set size). Notably, our interpreter has an order of magnitude fewer page faults and involuntary context switches compared to the TypeScript interpreter (250 vs 2036 and 210 vs 2480), although it’s unclear whether this is due to our choice of implementation language (Rust vs TypeScript) as opposed to our data structure flattening strategy, since Rust Brili also has similar metrics. | ||
|
|
||
| ### What else could be done? | ||
| At the moment, our implementation creates `Vec`-based buffers to store the variables/labels it sees within a Bril program. Since Bril programs are unlikely to contain excessively large no. of variables/labels, perhaps we could use Rust’s `smallvec` or `arrayvec` libraries to create these buffers instead of the standard library’s `Vec` data structure. The aforementioned libraries offer specialized short vectors which minimize heap allocations, which might offer performance benefits over heap-allocated `Vec`s. However, we would need to properly benchmark these supposed benefits to see if swapping to these libraries is worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. -> number (you're not under space limitations here :)
…rfine shell startup calibration
|
Thanks Adrian for the comments! I rewrote a few sections of the blogpost and reran some benchmarks based on the comments. I'll put up a PR for the Bril monorepo later! |
|
Looks great; thanks!! |
Closes #501
@ngernest
@katherinewu312