Skip to content

Commit 6432541

Browse files
authored
Document martian 4.0 features. (#7)
1 parent 3bf9167 commit 6432541

File tree

2 files changed

+234
-9
lines changed

2 files changed

+234
-9
lines changed

content/advanced-features/index.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,12 +49,18 @@ pipeline DUPLICATE_FINDER(
4949
Disabled pipelines or stages will not run, and their outputs will be populated
5050
with null values. Downstream stages must be prepared to deal with this case.
5151

52+
Note that the value being bound to `disabled` must be a boolean. Martian does
53+
not have a concept of "falsey" values the way for example Python or JavaScript
54+
do.
55+
5256
## Parallelization
5357
Subject to resource constraints, Martian parallelizes work by breaking
5458
pipeline logic into chunks and parallelizing them in two ways. First,
5559
stages can run in parallel if they don't depend on each other's outputs.
5660
Second, individual stages may split themselves into several chunks.
5761

62+
### Chunking
63+
5864
Stages which split are specified in mro as, for example,
5965
```
6066
stage SUM_SQUARES(
@@ -73,6 +79,53 @@ as well as potentially setting thread and memory requirements for each chunk
7379
and the join. The after the chunks run, the join phase aggregates the output
7480
from all of the chunks into the single output of the stage.
7581

82+
### `map call` (4.0 preview)
83+
To run a stage or sub-pipeline once for each element in an array or map, one can
84+
say
85+
```
86+
stage SQUARE(
87+
in float value,
88+
out float square,
89+
src comp "square",
90+
)
91+
92+
stage SUM(
93+
in float[] values,
94+
out float sum,
95+
src comp "sum",
96+
)
97+
98+
pipeline SUM_SQUARES(
99+
in float[] values,
100+
out float sum,
101+
)
102+
{
103+
map call SQUARE(
104+
value = split self.values,
105+
)
106+
107+
call SUM(
108+
values = SQUARE.square,
109+
)
110+
111+
return (
112+
value = SUM.sum,
113+
)
114+
}
115+
```
116+
which is mostly equivalent to the chunked version shown above in this case,
117+
however it opens up new possibilities for pipelines. Furthermore, because the
118+
next call can be mapped over the outputs of the previous one, in some cases one
119+
can improve efficiency by avoiding the useless `join` and `split` in between.
120+
121+
One or more parameters to a `map call` must be declared as being `split`. If
122+
more than one parameter is bound in this way, all of them must be bound to
123+
either arrays with the same lengths or typed maps with the same keys. If the
124+
split argument was an array, the output of the stage will be an array with the
125+
same length. If it was a typed map, the output will be a typed map with the
126+
same keys. Splitting an untyped map is not permitted, as there is no way to
127+
confirm compatibility of the values.
128+
76129
## Resource consumption
77130
Martian is designed to run stages in parallel. Either locally or in cluster
78131
mode, it tries to ensure sufficient threads and memory are available for each

content/writing-pipelines/index.md

Lines changed: 181 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -115,9 +115,117 @@ files for [Vim](https://www.vim.org/), [Atom](https://atom.io/),
115115
and [Sublime](https://www.sublimetext.com/) text editors. If your favorite
116116
editor is missing from this list, pull requests are welcome.
117117

118+
### Types
119+
120+
#### Built-in types
121+
122+
Martian has several supported built-in basic types:
123+
- `int`: A 64-bit signed integer type.
124+
- `float`: A double-precision floating point value.
125+
- `bool`: `true` or `false`
126+
- `string`: A utf-8 string. String literals in pipeline source code are double-quoted and recognize json-style escape sequences.
127+
- `file`: A generic type indicating that the value will contain a path to a regular file. Always use absolute paths, as each stage will run in its own working directory.
128+
- `path`: A generic type indicating that the value will contain a path to a directory.
129+
- `map`: an untyped map type, with string keys and arbitrary values. For new code targetting martian 4.0, it is strongly recommended to use a typed map or struct instead.
130+
131+
Implicit conversion from `string` to `file` or `path` is permitted, as well as
132+
from `int` to `float` (though not the reverse).
133+
134+
#### User-defined file types
135+
136+
Users may define new file types using the `filetype` directive. These behave
137+
like `file`, and can be implicitly converted to `string` or `file`, but not from
138+
one user file type to another. This allows pipelines to be more clear about
139+
the format of files being passed around, and to exploit the type checking to
140+
ensure that files are being used conistently as they are passed around.
141+
142+
#### Structured data types (Martian 4.0 preview)
143+
144+
A `struct` is related to the same concepts in other languages like C, a named
145+
tuple in Python, or an object in javascript. They can be declared,
146+
```
147+
struct MyType(
148+
int foo,
149+
float bar,
150+
)
151+
```
152+
Members of a struct can be extracted using the familiar `.` syntax, e.g.
153+
```
154+
call FOO(
155+
foo = STRUCT_OUTPUT.struct.foo,
156+
)
157+
```
158+
The output of a stage or pipeline is always a `struct`. Because of this, the
159+
name of a stage or pipeline can be used as a type, to indicate a structure with
160+
the same members and types as the outputs of the stage or pipeline.
161+
162+
Martian structures support a form of "[duck typing](https://en.wikipedia.org/wiki/Duck_typing)".
163+
If one has a struct `MyType` as declared above, and another type
164+
```
165+
struct MyBiggerType(
166+
int foo,
167+
int bar,
168+
string baz,
169+
)
170+
```
171+
then a value of type `MyBiggerType` may be used for the input to a stage or
172+
pipeline which asks for a `MyType`. This is because for every field in `MyType`
173+
there is a field with the same name in `MyBiggerType`, and that field in
174+
`MyBiggerType` has a type that is assignable to the type for that field in
175+
`MyType`. Because of this, one can easily take a subset of the data from a
176+
`struct` with only the values one actually needs. Values with struct types may
177+
also always be used for untyped map values.
178+
179+
As an additional convenience, martian supports a "wildcard expansion" of a
180+
struct value when calling a stage, e.g.
181+
```
182+
call STAGE2(
183+
foo = self.foo,
184+
* = STAGE1,
185+
)
186+
```
187+
This is equivalent to `input = STAGE1.output` for every output of `STAGE1` that
188+
is an input to `STAGE2`. To prevent ambiguity, only one wildcard expansion is
189+
allowed for each call, and it is an error if one of the outputs of `STAGE1` was
190+
already assigned in the input call explicitly (e.g. `foo` in the example).
191+
192+
#### Collection types
193+
194+
Martian also supports collections of values as arrays or typed maps. These are
195+
declared using a syntax that is familar to users of C-style languages. Arrays
196+
are declared as for example `int[]`. Typed maps (available in the martian 4.0
197+
preview) always have string keys, and
198+
are declared as for example `map<int>`. These can be combined as for example
199+
`map<int[][]>[]`.
200+
201+
In order to prevent confusing data flows, maps cannot be directly nested. That
202+
is, `map<map<int>>` is not permitted, nor is it permitted to nest untyped maps,
203+
e.g. `map<map>`. It _is_ permitted to have a map of structs, and those structs
204+
may contain further maps.
205+
206+
Because the type system has no way to enforce the length of an array or the keys
207+
of a map, there is no support for indexing into one. If one knows the keys
208+
ahead of time, use a struct.
209+
210+
A struct can be assigned to a typed map value if every field in the struct has a
211+
type that can be assigned to the type of the map. For example a struct with
212+
only `int` and `float` fields may be assigned to a value of type `map<float>`.
213+
214+
Typed maps may be converted to untyped maps, and `map<T>` may be converted to
215+
`map<U>` if type `T` is convertible to type `U`. The same applies for arrays,
216+
e.g. converting `T[]` to `U[]`.
217+
218+
A very important convenience is "projection" through structs. Using the
219+
`MyType` example struct from the previous section, if we have a value `FOO` of
220+
type `map<MyType[]>` then `FOO.bar` has type `map<float[]>`.
221+
118222
### Composability
119223

120-
Pipelines specify input and output parameters the same way stages do, so they may themselves also act as stages. This allows for the composition of an arbitrary mix of individual stages and pipelines into still larger pipelines. We refer to pipelines as "subpipelines" when they are composed into other pipelines.
224+
Pipelines specify input and output parameters the same way stages do, so they
225+
may themselves also act as stages. This allows for the composition of an
226+
arbitrary mix of individual stages and pipelines into still larger pipelines.
227+
We refer to pipelines as "subpipelines" when they are composed into other
228+
pipelines.
121229

122230
Because parameter binding is done by stage name, pipelines cannot call the same
123231
stage or sub-pipeline twice without aliasing it like so:
@@ -148,6 +256,39 @@ pipeline ADD_KEYS(
148256
}
149257
```
150258

259+
## Top-level file outputs
260+
261+
When a top-level pipeline completes, any outputs with file type are moved into
262+
the pipestance directory's `outs` subdirectory. Symbolic links are added to
263+
the original locations of those files in the stage output directories.
264+
265+
For an output with `file` or `path` type, the name of a file in the top-level
266+
output directory will be the name of the output parameter of the pipeline. If
267+
it is a user-defined file type, e.g. json, then the type will be appended to
268+
the name as an extension, e.g. `.json`.
269+
270+
If a pipeline is defined like for example
271+
```
272+
pipeline PIPE(
273+
out json foo "help text" "special_file",
274+
)
275+
```
276+
then the string "help text" will be displayed in the console as a label for
277+
the output file, and the default filename (which would be `foo.json`) is
278+
overridden to `special_file`. These annotations apply when defining struct
279+
types as well.
280+
281+
In martian 4.0, if an output is a struct type, then in the top-level `outs`
282+
directory there will be a _directory_ for that value, containing files from
283+
within that structure. Nested structures are handled recursively as deeper
284+
directories.
285+
286+
An array of files will become a directory, with files named for the array index,
287+
e.g. for `json[] foo` there will be `foo/1.json` and so on. For typed maps,
288+
`map<json>`, the outputs would be `foo/<key>.json` for each key in the map.
289+
Arrays or typed maps of structs containing files and up as nested directories
290+
as one would expect.
291+
151292
## Organizing Code
152293

153294
### MRO Files
@@ -157,7 +298,9 @@ by convention they are written in files that have an ```.mro``` extension.
157298

158299
### Preprocessing with @include
159300

160-
Martian supports lexical preprocessing with an ```@include``` directive, which takes the path to another MRO file as an argument. This directive is evaluated by splicing the contents of the included file into the file where the directive is given, replacing the directive itself. This evaluation is recursive, and Martian keeps track of the inclusion tree in order to be able to report errors using per-source file line numbers.
301+
Martian supports organizing one's pipeline definitions into multiple files which
302+
can use an `@include` directive to import the stages, pipelines, and types
303+
defined in other files.
161304

162305
`_my_stages.mro`
163306

@@ -193,7 +336,12 @@ pipeline DUPLICATE_FINDER(
193336

194337
### Stage Code vs Pipeline Code
195338

196-
By convention, the ```@include``` directive allows the developer to organize code into header files, although there is no formal distinction between header and non-header MRO files in Martian. Typically, stages that are logically grouped together are declared in one file, for example ```_sorting_stages.mro```, and that file would be included into another MRO file that declares a pipeline that calls these included stages. By convention, MRO files containing stage declarations should be named with the suffix ```_stages```.
339+
The ```@include``` directive allows the developer to organize code. Typically,
340+
stages that are logically grouped together are declared in one file, for example
341+
```_sorting_stages.mro```, and that file would be included into another MRO
342+
file that declares a pipeline that calls these included stages. By convention,
343+
MRO files containing stage declarations should be named with the suffix
344+
```_stages```.
197345

198346
### Martian Project Directory Structure
199347

@@ -226,7 +374,10 @@ martian_project/
226374

227375
## Formatting Code
228376

229-
Martian includes a canonical code formatting utility called `mrf`. It parses your MRO code into its abstract syntax tree and re-emits the code with canonical whitespacing. In particular, `mrf` performs intelligent column-wise alignment of parameter fields so that this:
377+
Martian includes a canonical code formatting utility called `mrf`. It parses
378+
your MRO code into its abstract syntax tree and re-emits the code with
379+
canonical whitespace. In particular, `mrf` performs intelligent column-wise
380+
alignment of parameter fields so that this:
230381

231382
~~~~
232383
stage SORT_ITEMS (in txt unsorted,
@@ -246,17 +397,38 @@ stage SORT_ITEMS(
246397
)
247398
~~~~
248399

249-
`mrf` is an "opinionated" formatter, inspired by tools like `gofmt`, therefore we will borrow
250-
[their explanation](https://blog.golang.org/go-fmt-your-code) of the benefits of canonical code formatting:
400+
`mrf` is an "opinionated" formatter, inspired by tools like `gofmt`, therefore
401+
we will borrow [their explanation](https://blog.golang.org/go-fmt-your-code) of
402+
the benefits of canonical code formatting:
251403

252404
- **Easier to write**: never worry about minor formatting concerns while hacking away.
253405
- **Easier to read**: when all code looks the same you need not mentally convert others' formatting style into something you can understand.
254406
- **Easier to maintain**: mechanical changes to the source don't cause unrelated changes to the file's formatting; diffs show only the real changes.
255407
- **Uncontroversial**: never have a debate about spacing or brace position ever again!
256408

257-
`mrf` takes a list of MRO filenames as arguments. By default, it will output the formatted code back to `stdout`. If given the `--rewrite` option, it will write the formatted code back into the original files. If given the `--all` option, it will rewrite all MRO files found in your `MROPATH`. For consistency of your MRO codebase, consider configuring editor save-hooks or git commit-hooks that run `mrf --rewrite` or `mrf --all`.
258-
259-
`mrf` does not support any arguments that affect the formatting, otherwise it would not be canonical!
409+
`mrf` takes a list of MRO filenames as arguments. By default, it will output
410+
the formatted code back to `stdout`. If given the `--rewrite` option, it will
411+
write the formatted code back into the original files. If given the `--all`
412+
option, it will rewrite all MRO files found in your `MROPATH`. For consistency
413+
of your MRO codebase, consider configuring editor save-hooks or git
414+
commit-hooks that run `mrf --rewrite` or `mrf --all`.
415+
416+
`mrf` does not support any arguments that affect the formatting, otherwise it
417+
would not be canonical!
418+
419+
If you run `mrf` with the `--includes` flag, it will (attempt to) fix up
420+
`@include` directives. Specifically, if a pipeline in an `mro` source file
421+
uses a stage, it will ensure that the file defining that stage is _directly_
422+
included, and that files which are not directly depended on are not included.
423+
It will only add `@include` statements referring to files in the root of your
424+
`MROPATH` or in the transitive closure of the existing includes. The reason
425+
for the convention of direct inclusions is the same as the reasons explained
426+
in the [clang include-what-you-use][] tool - briefly, if a file you depend
427+
on stops depending on another file, and you only included it transitively,
428+
then your pipeline will fail to compile if the intermediate pipeline removes
429+
its `@include`.
430+
431+
[clang include-what-you-use]: https://github.com/include-what-you-use/include-what-you-use/blob/master/docs/WhyIWYU.md
260432

261433
## Compiling<sup>*</sup> Code
262434

0 commit comments

Comments
 (0)