The Specification File
At the core of Merlin is the specification (spec) file. This file is used to define how workflows should be created and executed by Merlin, and is also utilized as a way for users to keep records of their studies.
Merlin enables several blocks in the spec file, each with their own purpose:
Block Name | Required? | Description |
---|---|---|
description |
Yes | General information about the study |
env |
No | Fixed constants and other values that are globally set and referenced |
global.parameters |
No | Parameters that are user varied and applied to the workflow |
batch |
No | Settings for submission to batch systems |
study |
Yes | Steps that the study is composed of and are executed in a defined order |
merlin |
No | Worker settings and sample generation handling |
user |
No | YAML anchor definitions |
This module will go into detail on every block and the properties available within each.
The description
Block
Since Merlin is built as extension of Maestro, most of the behavior of the description
block is inherited directly from Maestro. Therefore, we recommend reading Maestro's documentation on the description
Block for the most accurate description of how it should be used.
There is one difference between Merlin and Maestro when it comes to the description
block: the use of variables. With Merlin, the description
block can use variables defined in the env
block.
Using Variables in the description
Block
The env
Block
Since Merlin is built as extension of Maestro, the behavior of the env
block is inherited directly from Maestro. Therefore, we recommend reading Maestro's documentation on the env
block for the most accurate description of how it should be used.
For more information on how variables defined in this block can be used, check out the Variables page (specifically the Token Syntax, User Variables, and Environment Variables sections).
The global.parameters
Block
Since Merlin is built as extension of Maestro, the behavior of the global.parameters
block is inherited directly from Maestro. Therefore, we recommend reading Maestro's documentation on the global.parameters
block for the most accurate description of how it should be used.
It would also be a good idea to read through Specifying Study Parameters from Maestro which goes into further detail on how to use parameters in your study. There you will also find details how to programmatically generate parameters using pgen
.
A Basic global.parameters
Block
The batch
Block
Warning
Although the batch
block exists in both Maestro and Merlin spec files, this block will differ slightly in Merlin.
Tip
This block is frequently used in conjunction with the LAUNCHER
and VLAUNCHER
variables.
The batch
block is an optional block that enables specification of HPC scheduler information to enable writing steps that are decoupled from particular machines and thus more portable/reusable. Below are the base properties for this block.
Property Name | Required? | Type | Description |
---|---|---|---|
bank |
Yes | str | Account to charge computing time to |
dry_run |
No | bool | Execute a dry run of the study |
launch_args |
No | str | Extra arguments for the parallel launch command |
launch_pre |
No | str | Any configuration needed before the scheduler launch command (srun , jsrun , etc.) |
nodes |
No | int | The number of nodes to use for all workers. This can be overridden in the resources property of the merlin block. If this is unset the number of nodes will be queried from the environment, failing that, the number of nodes will be set to 1. |
queue |
Yes | str | Scheduler queue/partition to submit jobs (study steps) to |
shell |
No | str | Optional specification path to the shell to use for execution. Defaults to /bin/bash |
type |
Yes | str | Type of scheduler managing execution. One of: local , flux , slurm , lsf , pbs |
walltime |
No | str | The total walltime of the batch allocation (hh:mm:ss or mm:ss or ss) |
worker_launch |
No | str | Override the parallel launch defined in Merlin |
If using flux
as your batch type, there are a couple more properties that you can define here:
Property Name | Type | Description |
---|---|---|
flux_exec |
str | Optional flux exec command to launch workers on all nodes if flux_exec_workers is True |
flux_exec_workers |
bool | Optional flux argument to launch workers on all nodes |
flux_path |
str | Optional path to flux bin |
flux_start_opts |
str | Optional flux start options |
Below are examples of different scheduler set ups. The only required keys in each of these examples are: type
, queue
, and bank
.
The study
Block
Warning
Although the study
block exists in both Maestro and Merlin spec files, this block will differ slightly in Merlin.
The study
block is where the steps to be executed in the Merlin study are defined. The steps that are defined here will ultimately create the DAG that's executed by Merlin.
This block represents the unexpanded set of steps that the study is composed of. Here, unexpanded means no parameter nor sample substitution; the steps only contain references to the parameters and/or samples. Steps are given as a list (- prefixed) of properties:
Property Name | Required? | Type | Description |
---|---|---|---|
name |
Yes | str | Unique name for identifying and referring to a step |
description |
Yes | str | A general description of what this step is intended to do |
run |
Yes | dict | Properties that describe the actual specification of the step |
The run
Property
The run
property contains several subproperties that define what a step does and how it relates to other steps. This is where you define the concrete shell commands the task needs to execute, step dependencies that dictate the topology of the DAG, and any parameter
, env
, or sample
tokens to inject.
Property Name | Required? | Type | Description |
---|---|---|---|
cmd |
Yes | str | The actual commands to be executed for this step |
depends |
No | List[str] | List of other steps which must successfully execute before this task can be executed |
max_retries |
No | int | The maximum number of retries allowed for this step |
restart |
No | str | Similar to cmd , providing optional alternate commands to run upon restarting, e.g. after a scheduler timeout |
retry_delay |
No | int | The time in seconds to delay a retry by |
shell |
No | str | The shell to execute cmd in (e.g. /bin/bash , /usr/bin/env , python ) (default: /bin/bash ) |
task_queue |
No | str | The name of the task queue to assign this step to. Workers will watch task queues to find tasks to execute. (default: merlin ) |
Example
study:
- name: create_data
description: Use a python script to create some data
run:
cmd: | # (1)
echo "Creating data..."
python create_data.py --outfile data.npy
echo "Data created"
restart: | # (2)
echo "Restarted the data creation..."
python create_data.py --outfile data_restart.npy
echo "Data created upon restart"
max_retries: 3 # (3)
retry_delay: 5 # (4)
task_queue: create # (5)
- name: transpose_data
description: Use python to transpose the data
run:
cmd: | # (6)
import numpy as np
import os
data_file = "$(create_data.workspace)/data.npy"
if not os.path.exists(data_file):
data_file = "$(create_data.workspace)/data_restart.npy"
initial_data = np.load(data_file)
transposed_data = np.transpose(initial_data)
np.save("$(WORKSPACE)/transposed_data.npy", transposed_data)
shell: /usr/bin/env python3 # (7)
task_queue: transpose
depends: [create_data] # (8)
- The
|
character allows thecmd
to become a multi-line string - The
restart
command will be ran if the initial execution ofcmd
exits with a$(MERLIN_RESTART)
Return Code - Only allow this step to retry itself 3 times
- Delay by 5 seconds on each retry
- All tasks created by this step will get sent to the
create
queue. They will live in this queue on the broker until a worker picks them up for execution. - This step uses two variables
$(create_data.workspace)
and$(WORKSPACE)
. They point tocreate_data
's output directory andtranspose_data
's output directory, respectively. Read the section on Reserved Variables for more information on these and other variables. - Setting our shell to be
python3
allows us to write python in thecmd
rather than bash scripting - Since this step depends on the
create_data
step, it will not be ran untilcreate_data
finishes processing
There are also a few optional properties for describing resource requirements to pass to the scheduler and associated $(LAUNCHER)
tokens used to execute applications on HPC systems.
Property Name | Required? | Type | Description |
---|---|---|---|
batch |
No | dict | Override the batch block for this step |
nodes |
No | int | Number of nodes to reserve for executing this step: primarily used by $(LAUNCHER) expansion |
procs |
No | int | Number of processors needed for step execution: primarily used by $(LAUNCHER) expansion |
walltime |
No | str | Specifies maximum amount of time to reserve HPC resources for |
Example
batch:
type: flux
queue: pbatch
bank: baasic
study:
- name: create_data
description: Use a python script to create some data
run:
cmd:
echo "Creating data..."
$(LAUNCHER) python create_data.py # (1)
echo "Data created"
nodes: 2
procs: 4
walltime: "30:00"
task_queue: create
- The
$(LAUNCHER)
token here will be expanded toflux run -N 2 -n 4 -t 1800.0s
Additionally, there are scheduler specific properties that can be used. The sections below will highlight these properties.
Slurm Specific Properties
Merlin supports the following properties for Slurm:
Property Name | Equivalent srun Option |
Type | Description | Default |
---|---|---|---|---|
cores per task |
-c , --cpus-per-task |
int | Number of cores to use for each task | 1 |
reservation |
--reservation |
str | Reservation to schedule this step to; overrides batch block | None |
slurm |
N/A | str | Verbatim flags only for the srun parallel launch. This will be expanded as follows for steps that use LAUNCHER or VLAUNCHER : srun -N <nodes> -n <procs> ... <slurm> . |
None |
Example
The following example will run example_slurm_step
with Slurm specific options cores per task
and slurm
. This will tell Merlin that this step needs 2 nodes, 4 cores per task, and to begin this at noon.
batch:
type: slurm
queue: pbatch
bank: baasic
study:
- name: example_slurm_step
description: A step using slurm specific options
run:
cmd: |
$(LAUNCHER) python3 do_something.py
nodes: 2
cores per task: 4
slurm: --begin noon
Here, $(LAUNCHER)
will become srun -N 2 -c 4 --begin noon
.
Flux Specific Properties
Merlin supports the following Flux properties:
Property Name | Equivalent flux run Option |
Type | Description | Default |
---|---|---|---|---|
cores per task |
-c , --cores-per-task |
int | Number of cores to use for each task | 1 |
gpus per task |
-g , --gpus-per-task |
int | Number of gpus to use for each task | 0 |
flux |
N/A | str | Verbatim flags for the flux parallel launch. This will be expanded as follows for steps that use LAUNCHER or VLAUNCHER : flux mini run ... <flux> |
None |
Example
The following example will run example_flux_step
with Flux specific options cores per task
and gpus per task
. This will tell Merlin that this step needs 2 nodes, 4 cores per task, and 1 gpu per task.
batch:
type: flux
queue: pbatch
bank: baasic
study:
- name: example_flux_step
description: A step using flux specific options
run:
cmd: |
$(LAUNCHER) python3 do_something.py
nodes: 2
cores per task: 4
gpus per task: 1
Here, $(LAUNCHER)
will become flux run -N 2 -c 4 -g 1
.
LSF Specific Properties
Merlin supports the following properties for LSF:
Property Name | Equivalent jsrun Option |
Type | Description | Default |
---|---|---|---|---|
bind |
-b , --bind |
str | Flag for MPI binding of tasks on a node | rs |
cores per task |
-c , --cpu_per_rs |
int | Number of cores to use for each task | 1 |
exit_on_error |
-X , --exit_on_error |
int | Flag to exit on error. A value of 1 enables this and 0 disables it. |
1 |
gpus per task |
-g , --gpu_per_rs |
int | Number of gpus to use for each task | 0 |
num resource set |
-n , --nrs |
int | Number of resource sets. The nodes property will set this same flag for LSF so only do one or the other. |
1 |
launch_distribution |
-d , --launch_distribution |
str | The distribution of resources | plane:{procs/nodes} |
lsf |
N/A | str | Verbatim flags only for the lsf parallel launch. This will be expanded as follows for steps that use LAUNCHER or VLAUNCHER : jsrun ... <lsf> |
None |
Example
The following example will run example_lsf_step
with LSF specific options exit_on_error
and bind
. This will tell Merlin that this step needs 2 nodes, to not exit on error, and to not have any binding.
batch:
type: lsf
queue: pbatch
bank: baasic
study:
- name: example_lsf_step
description: A step using lsf specific options
run:
cmd: |
$(LAUNCHER) python3 do_something.py
nodes: 2
exit_on_error: 0
bind: none
Here, $(LAUNCHER)
will become jsrun -N 2 -X 0 -b none
.
The merlin
Block
The merlin
block is where you can customize Celery workers and generate samples to be used throughout the workflow.
This block is split into two main properties:
Property Name | Required? | Type | Description |
---|---|---|---|
resources |
No | dict | Define the task server configuration and workers to run the tasks |
samples |
No | dict | Define samples to be referenced in your study steps |
Both of these properties have multiple subproperties so we'll take a deeper dive into each one below.
Resources
Note
Currently the only task server that Merlin supports is Celery.
The resources
property of the merlin
block allows users to customize task server configuration and create custom workers to run tasks. This property has the following subproperties:
Property Name | Required? | Type | Description |
---|---|---|---|
task_server |
No | str | The type of task server to use. Currently "celery" is the only option. (default: celery) |
overlap |
No | bool | Flag to determine if multiple workers can pull tasks from overlapping queues. (default: False) |
workers |
No | List[dict] | A list of worker definitions. |
The workers
subproperty is where you can create custom workers to process your workflow. The keys that you provide under this property will become the names of your custom workers.
Example
The following merlin
block will create two workers named data_creation_worker
and data_transpose_worker
.
Each worker can be customized with the following settings:
Setting Name | Type | Description |
---|---|---|
args |
str | Arguments to provide to the worker. Check out Configuring Celery Workers and/or Celery's worker options for more info on what can go here. Tip The most common arguments used with |
batch |
dict | Override the main batch config for this worker. Tip This setting is useful if other workers are running flux, but some component of the workflow requires the native scheduler or cannot run under flux. Another possibility is to have the default |
machines |
List[str] | A list of machines to run the given steps provided in the steps setting here. A full OUTPUT_PATH and the steps argument are both required for this setting. Currently all machines in the list must have access to the OUTPUT_PATH . Note You'll need an allocation on any machine that you list here. You'll then have to run |
nodes |
int | Number of nodes for this worker to run on. (defaults to all nodes on your allocation) |
steps |
List[str] | A list of step names for this worker to "watch". The worker will actually be watching the task_queue associated with the steps listed here. (default: [all] ) |
Custom Worker for Each Step
This example showcases how to define custom workers that watch different steps in your workflow. Here, data_creation_worker
will execute tasks created from the create_data
step that are sent to the create
queue, and data_transpose_worker
will execute tasks created from the transpose_data
step that are sent to the transpose
queue.
We're also showing how to vary worker arguments using some of the most common arguments for workers.
study:
- name: create_data
description: Use a python script to create some data
run:
cmd: |
echo "Creating data..."
python create_data.py
echo "Data created"
task_queue: create # (1)
- name: transpose_data
description: Use python to transpose the data
run:
cmd: |
import numpy as np
initial_data = np.load("$(create_data.workspace)/data.npy")
transposed_data = np.transpose(initial_data)
np.save("$(WORKSPACE)/transposed_data.npy", transposed_data)
shell: /usr/bin/env python3
task_queue: transpose
depends: [create_data]
merlin:
resources:
workers:
data_creation_worker:
args: -l INFO --concurrency 4 --prefetch-multiplier 1 -O fair # (2)
steps: [create_data] # (3)
data_transpose_worker:
args: -l INFO --concurrency 1 --prefetch-multiplier 1
steps: [transpose_data]
- The name of the queue for this step is important as that is where the tasks required to execute this step will be stored on the broker until a worker (in this case
data_creation_worker
) pulls the tasks and executes them. - Arguments here can be broken down as follows:
-l
sets the log level--concurrency
sets the number of worker processes to spin up on each node that this worker is running on (Celery's default is to set--concurrency
to be the number of CPUs on your node). More info on this can be found on Celery's concurrency documentation.--prefetch-multiplier
sets the number of messages to prefetch at a time multiplied by the number of concurrent processes (Celery's default is to set--prefetch-multiplier
to 4). More info on this can be found on Celery's prefetch multiplier documentation.-O fair
sets the scheduling algorithm to be fair. This aims to distribute tasks more evenly based on the current workload of each worker.
- Here we tell
data_creation_worker
to watch thecreate_data
step. What this actually means is that thedata_creation_worker
will go monitor thetask_queue
associated with thecreate_data
step, which in this case iscreate
. Any tasks sent to thecreate
queue will be pulled and executed by thedata_creation_worker
.
Custom Workers to Run Across Multiple Machines
This example showcases how you can define custom workers to be able to run on multiple machines. Here, we're assuming that both machines quartz
and ruby
have access to our OUTPUT_PATH
.
env:
variables:
OUTPUT_PATH: /path/to/shared/filespace/
CONCURRENCY: 1
merlin:
resources:
workers:
cross_machine_worker:
args: -l INFO --concurrency $(CONCURRENCY) # (1)
machines: [quartz, ruby] # (2)
nodes: 2 # (3)
- Variables can be used within worker customization. They can even be used to name workers!
- This worker will be able to start on both
quartz
andruby
so long as you have an allocation on both and executemerlin run-workers
from both machines. - This worker will only start on 2 nodes of our allocation.
Samples
The samples
property of the merlin
block allows users to generate, store, and create references to samples that can be used throughout a workflow.
This property comes with several subproperties to assist with the handling of samples:
Property Name | Type | Description |
---|---|---|
column_labels |
List[str] | The names of the samples stored in file . This will be how you reference samples in your workflow using token syntax. |
file |
str | The name of the samples file where your samples are stored. Must be either .npy, .csv, or .tab. |
generate |
dict | Properties that describe how the samples should be generated |
level_max_dirs |
int | The number of sample output directories to generate at each level in the sample hierarchy of a step. See the "Modifying The Hierarchy Structure" example in The Sample Hierarchy section for an example of how this is used. |
Currently, within the generate
property there is only one subproperty:
Property Name | Type | Description |
---|---|---|
cmd |
str | The command to execute that will generate samples |
Basic Sample Generation & Usage
study:
- name: echo_samples
description: Echo the values of our samples
run:
cmd: echo "var1 - $(VAR_1) ; var2 - $(VAR_2)" # (1)
merlin:
samples:
generate:
cmd: spellbook make-samples -n 25 -outfile=$(MERLIN_INFO)/samples.npy # (2)
file: $(MERLIN_INFO)/samples.npy # (3)
column_labels: [VAR_1, VAR_2] # (4)
- Samples are referenced in steps using token syntax
- Generate 25 samples using Merlin Spellbook
- Tell Merlin where the sample files are stored
- Label the samples so that we can use them in our study with token syntax
The user
Block
Warning
Any anchors/aliases you wish to use must be defined before you use them. For instance, if you want to use an alias in your study
block then you must put the user
block containing the anchor definition before the study
block in your spec file.
Tip
This block is especially useful if you have a large chunk of code that's re-used in multiple steps.
The user
block allows other variables in the workflow file to be propogated through to the workflow. This block uses YAML Anchors and Aliases; anchors define a chunk of configuration and their alias is used to refer to that specific chunk of configuration elsewhere.
To define an anchor, utilize the &
syntax. For example, the following user block will define an anchor python3_run
. The python3_run
anchor creates a shorthand for running a simple print statement in Python 3:
user:
python3:
run: &python3_run
cmd: |
print("OMG is this in python3?")
shell: /usr/bin/env python3
You can reference an anchor by utilizing the <<: *
syntax to refer to its alias. Continuing with the example above, the following study block will reference the python3_run
anchor:
study:
- name: *step_name
description: do something in python
run:
<<: *python3_run
task_queue: pyth3_q
Here we're merging the anchor run
value with the existing values of run
. Therefore, this step will be expanded to:
study:
- name: python3_hello
description: do something in python
run:
cmd: |
print("OMG is this in python3?")
shell: /usr/bin/env python3
task_queue: pyth3_q
Notice that the existing task_queue
value was not overridden.
Full Specification
Below is a full YAML specification file for Merlin. To fully understand what's going on in this example spec file, see the Feature Demo page.
description:
name: $(NAME)
description: Run 10 hello worlds.
batch:
type: local
env:
variables:
OUTPUT_PATH: ./studies
N_SAMPLES: 10
WORKER_NAME: demo_worker
VERIFY_QUEUE: default_verify_queue
NAME: feature_demo
SCRIPTS: $(MERLIN_INFO)/scripts
HELLO: $(SCRIPTS)/hello_world.py
FEATURES: $(SCRIPTS)/features.json
user:
study:
run:
hello: &hello_run
cmd: |
python3 $(HELLO) -outfile hello_world_output_$(MERLIN_SAMPLE_ID).json $(X0) $(X1) $(X2)
max_retries: 1
python3:
run: &python3_run
cmd: |
print("OMG is this in python?")
print("Variable X2 is $(X2)")
shell: /usr/bin/env python3
python2:
run: &python2_run
cmd: |
print "OMG is this in python2? Change is bad."
print "Variable X2 is $(X2)"
shell: /usr/bin/env python2
study:
- name: hello
description: |
process a sample with hello world
run:
<<: *hello_run
task_queue: hello_queue
- name: collect
description: |
process the output of the hello world samples, extracting specific features;
run:
cmd: |
echo $(MERLIN_GLOB_PATH)
echo $(hello.workspace)
ls $(hello.workspace)/X2.$(X2)/$(MERLIN_GLOB_PATH)/hello_world_output_*.json > files_to_collect.txt
spellbook collect -outfile results.json -instring "$(cat files_to_collect.txt)"
depends: [hello_*]
task_queue: collect_queue
- name: translate
description: |
process the output of the hello world samples some more
run:
cmd: spellbook translate -input $(collect.workspace)/results.json -output results.npz -schema $(FEATURES)
depends: [collect]
task_queue: translate_queue
- name: learn
description: |
train a learner on the results
run:
cmd: spellbook learn -infile $(translate.workspace)/results.npz
depends: [translate]
task_queue: learn_queue
- name: make_new_samples
description: |
make a grid of new samples to pass to the predictor
run:
cmd: spellbook make-samples -n $(N_NEW) -sample_type grid -outfile grid_$(N_NEW).npy
task_queue: make_samples_queue
- name: predict
description: |
make a new prediction from new samples
run:
cmd: spellbook predict -infile $(make_new_samples.workspace)/grid_$(N_NEW).npy -outfile prediction_$(N_NEW).npy -reg $(learn.workspace)/random_forest_reg.pkl
depends: [learn, make_new_samples]
task_queue: predict_queue
- name: verify
description: |
if learn and predict succeeded, output a dir to signal study completion
run:
cmd: |
if [[ -f $(learn.workspace)/random_forest_reg.pkl && -f $(predict.workspace)/prediction_$(N_NEW).npy ]]
then
touch FINISHED
exit $(MERLIN_SUCCESS)
else
exit $(MERLIN_SOFT_FAIL)
fi
depends: [learn, predict]
task_queue: $(VERIFY_QUEUE)
- name: python3_hello
description: |
do something in python
run:
<<: *python3_run
task_queue: pyth3_q
- name: python2_hello
description: |
do something in python2, because change is bad
run:
<<: *python2_run
task_queue: pyth2_hello
global.parameters:
X2:
values : [0.5]
label : X2.%%
N_NEW:
values : [10]
label : N_NEW.%%
merlin:
resources:
task_server: celery
overlap: False
workers:
$(WORKER_NAME):
args: -l INFO --concurrency 3 --prefetch-multiplier 1 -Ofair
samples:
generate:
cmd: |
cp -r $(SPECROOT)/scripts $(SCRIPTS)
spellbook make-samples -n $(N_SAMPLES) -outfile=$(MERLIN_INFO)/samples.npy
# can be a file glob of numpy sample files.
file: $(MERLIN_INFO)/samples.npy
column_labels: [X0, X1]
level_max_dirs: 25