3. Hello, World!

This hands-on module walks through the steps of building and running a simple merlin workflow.

Prerequisites

Estimated time

  • 30 minutes

You will learn

  • The components of a merlin workflow specification.

  • How to run a simple merlin workflow.

  • How to interpret the results of your workflow.

3.1. Get Example Files

merlin example is a command line tool that makes it easy to get a basic workflow up and running. To see a list of all the examples provided with merlin you can run:

$ merlin example list

For this tutorial we will be using the hello example. Run the following commands:

$ merlin example hello
$ cd hello/

This will create and move into directory called hello, which contains these files:

  • my_hello.yaml – this spec file is partially blank. You will fill in the gaps as you follow this module’s steps.

  • hello.yaml – this is a complete spec without samples. You can always reference it as an example.

  • hello_samples.yaml – same as before, but with samples added.

  • make_samples.py – this is a small python script that generates samples.

  • requirements.txt – this is a text file listing this workflow’s python dependencies.

3.2. Specification File

Central to Merlin is something called a specification file, or a “spec” for short. The spec defines all aspects of your workflow. The spec is formatted in yaml. If you’re unfamiliar with yaml, it’s worth reading up on for a few minutes.

Warning

Stray whitespace can break yaml; make sure your indentation is consistent.

Let’s build our spec piece by piece. For each spec section listed below, fill in the blank yaml entries of my_hello.yaml with the given material.

3.2.1. Section: description

Just what it sounds like. Name and briefly summarize your workflow.

description:
    name: hello world workflow
    description: say hello in 2 languages

3.2.2. Section: global.parameters

Global parameters are constants that you want to vary across simulations. Steps that contain a global parameter or depend on other steps that contain a global parameter are run for each index over parameter values. The label is the pattern for a filename that will be created for each value.

global.parameters:
    GREET:
        values : ["hello","hola"]
        label  : GREET.%%
    WORLD:
        values : ["world","mundo"]
        label  : WORLD.%%

Note

%% is a special token that defines where the value in the label is placed. In this case the parameter labels will be GREET.hello, GREET.hola, etc. The label can take a custom text format, so long as the %% token is included to be able to substitute the parameter’s value in the appropriate place.

So this will give us 1) an English result, and 2) a Spanish one (you could add as many more languages as you want, as long as both parameters hold the same number of values).

3.2.3. Section: study

This is where you define workflow steps. While the convention is to list steps as sequentially as possible, the only factor in determining step order is the dependency directed acyclic graph (DAG) created by the depends field.

study:
    - name: step_1
      description: say hello
      run:
          cmd: echo "$(GREET), $(WORLD)!"

    - name: step_2
      description: print a success message
      run:
          cmd: print("Hurrah, we did it!")
          depends: [step_1]
          shell: /usr/bin/env python3

Note

The - denotes a list item in YAML. To add elements, simply add new elements prefixed with a hyphen

$(GREET) and $(WORLD) expand the global parameters separately into their two values. .. $(step_1.workspace) gets the path to step_1. The default value for shell is /bin/bash. In step_2 we override this to use python instead. Steps must be defined as nodes in a DAG, so no cyclical dependencies are allowed. Our step DAG currently looks like this:

../../_images/dag1.png

Since our global parameters have 2 values, this is actually what the DAG looks like:

../../_images/dag2.png

It looks like running step_2 twice is redundant. Instead of doing that, we can collapse it back into a single step, by having it wait for both parameterized versions of step_1 to finish. Add _* to the end of the step name in step_1’s depend entry. Go from this:

depends: [step_1]

…to this:

depends: [step_1_*]

Now the DAG looks like this:

../../_images/dag3.png

Your full hello world spec my_hello.yaml should now look like this (an exact match of hello.yaml):

description:
    name: hello
    description: a very simple merlin workflow

global.parameters:
    GREET:
        values : ["hello","hola"]
        label  : GREET.%%
    WORLD:
        values : ["world","mundo"]
        label  : WORLD.%%

study:
    - name: step_1
      description: say hello
      run:
          cmd: echo "$(GREET), $(WORLD)!"

    - name: step_2
      description: print a success message
      run:
          cmd: print("Hurrah, we did it!")
          depends: [step_1_*]
          shell: /usr/bin/env python3

The order of the spec sections doesn’t matter.

Note

At this point, my_hello.yaml is still maestro-compatible. The primary difference is that maestro won’t understand anything in the merlin block, which we will still add later. If you want to try it, run: $ maestro run my_hello.yaml

3.3. Try It!

First, we’ll run merlin locally. On the command line, run:

$ merlin run --local my_hello.yaml

If your spec is bugless, you should see a few messages proclaiming successful step completion, like this (for now we’ll ignore the warning):

       *
   *~~~~~
  *~~*~~~*      __  __           _ _
 /   ~~~~~     |  \/  |         | (_)
     ~~~~~     | \  / | ___ _ __| |_ _ __
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~
   ~~~*~~~*    Machine Learning for HPC Workflows



[2020-02-07 09:35:49: WARNING] Workflow specification missing
 encouraged 'merlin' section! Run 'merlin example' for examples.
Using default configuration with no sampling.
[2020-02-07 09:35:49: INFO] Study workspace is 'hello_20200207-093549'.
[2020-02-07 09:35:49: INFO] Reading app config from file ~/.merlin/app.yaml
[2020-02-07 09:35:49: INFO] Calculating task groupings from DAG.
[2020-02-07 09:35:49: INFO] Converting graph to celery tasks.
[2020-02-07 09:35:49: INFO] Launching tasks.
[2020-02-07 09:35:49: INFO] Executing step 'step1_HELLO.hello' in 'hello_20200207-093549/step1/HELLO.hello'...
[2020-02-07 09:35:54: INFO] Step 'step1_HELLO.hello' in 'hello_20200207-093549/step1/HELLO.hello' finished successfully.
[2020-02-07 09:35:54: INFO] Executing step 'step2_HELLO.hello' in 'hello_20200207-093549/step2/HELLO.hello'...
[2020-02-07 09:35:59: INFO] Step 'step2_HELLO.hello' in 'hello_20200207-093549/step2/HELLO.hello' finished successfully.
[2020-02-07 09:35:59: INFO] Executing step 'step1_HELLO.hola' in 'hello_20200207-093549/step1/HELLO.hola'...
[2020-02-07 09:36:04: INFO] Step 'step1_HELLO.hola' in 'hello_20200207-093549/step1/HELLO.hola' finished successfully.
[2020-02-07 09:36:04: INFO] Executing step 'step2_HELLO.hola' in 'hello_20200207-093549/step2/HELLO.hola'...
[2020-02-07 09:36:09: INFO] Step 'step2_HELLO.hola' in 'hello_20200207-093549/step2/HELLO.hola' finished successfully.

Great! But what happened? We can inspect the output directory to find out.

Look for a directory named hello_<TIMESTAMP>. That’s your output directory. Within, there should be a directory for each step of the workflow, plus one called merlin_info. The whole file tree looks like this:

../../_images/merlin_output.png

A lot of stuff, right? Here’s what it means:

  • The 3 yaml files inside merlin_info/ are called the provenance specs. They are copies of the original spec that was run, some showing under-the-hood variable expansions.

  • MERLIN_FINISHED files indicate that the step ran successfully.

  • .sh files contain the command for the step.

  • .out files contain the step’s stdout. Look at one of these, and it should contain your “hello” message.

  • .err files contain the step’s stderr. Hopefully empty, and useful for debugging.

3.4. Run Distributed!

Important

Before trying this, make sure you’ve properly set up your merlin config file app.yaml. Run $ merlin info for information on your merlin configuration.

Now we will run the same workflow, but in parallel on our task server:

$ merlin run my_hello.yaml

If your merlin configuration is set up correctly, you should see something like this:

       *
   *~~~~~
  *~~*~~~*      __  __           _ _
 /   ~~~~~     |  \/  |         | (_)
     ~~~~~     | \  / | ___ _ __| |_ _ __
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~
   ~~~*~~~*    Machine Learning for HPC Workflows



[2020-02-07 13:06:23: WARNING] Workflow specification missing
 encouraged 'merlin' section! Run 'merlin example' for examples.
Using default configuration with no sampling.
[2020-02-07 13:06:23: INFO] Study workspace is 'studies/simple_chain_20200207-130623'.
[2020-02-07 13:06:24: INFO] Reading app config from file ~/.merlin/app.yaml
[2020-02-07 13:06:25: INFO] broker: amqps://user:******@broker:5671//user
[2020-02-07 13:06:25: INFO] backend: redis://user:******@backend:6379/0
[2020-02-07 13:06:25: INFO] Calculating task groupings from DAG.
[2020-02-07 13:06:25: INFO] Converting graph to celery tasks.
[2020-02-07 13:06:25: INFO] Launching tasks.

That means we have launched our tasks! Now we need to launch the workers that will complete those tasks. Run this:

$ merlin run-workers my_hello.yaml

Here’s the expected merlin output message for running workers:

       *
   *~~~~~
  *~~*~~~*      __  __           _ _
 /   ~~~~~     |  \/  |         | (_)
     ~~~~~     | \  / | ___ _ __| |_ _ __
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~
   ~~~*~~~*    Machine Learning for HPC Workflows



[2020-02-07 13:14:38: INFO] Launching workers from 'hello.yaml'
[2020-02-07 13:14:38: WARNING] Workflow specification missing
 encouraged 'merlin' section! Run 'merlin example' for examples.
Using default configuration with no sampling.
[2020-02-07 13:14:38: INFO] Starting celery workers
[2020-02-07 13:14:38: INFO] ['celery worker -A merlin  -n default_worker.%%h -l INFO -Q merlin']

Immediately after that, this will pop up:

 -------------- celery@worker_name.%machine770 v4.4.0 (cliffs)
--- ***** -----
-- ******* ---- Linux-3.10.0-1062.9.1.1chaos.ch6.x86_64-x86_64-with-redhat-7.7-Maipo 2020-02-12 09:53:10
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app:         merlin:0x2aaab20619e8
- ** ---------- .> transport:   amqps://user:**@server:5671//user
- ** ---------- .> results:     redis://user:**@server:6379/0
- *** --- * --- .> concurrency: 36 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> merlin           exchange=merlin(direct) key=merlin


[tasks]
  . merlin.common.tasks.add_merlin_expanded_chain_to_chord
  . merlin.common.tasks.expand_tasks_with_samples
  . merlin.common.tasks.merlin_step
  . merlin:chordfinisher
  . merlin:queue_merlin_study

[2020-02-12 09:53:11,549: INFO] Connected to amqps://user:**@server:5671//user
[2020-02-12 09:53:11,599: INFO] mingle: searching for neighbors
[2020-02-12 09:53:12,807: INFO] mingle: sync with 2 nodes
[2020-02-12 09:53:12,807: INFO] mingle: sync complete
[2020-02-12 09:53:12,835: INFO] celery@worker_name.%machine770 ready.

You may not see all of the info logs listed after the Celery C is displayed. If you’d like to see them you can change the merlin workers’ log levels with the --worker-args tag:

$ merlin run-workers --worker-args "-l INFO" my_hello.yaml

The terminal you ran workers in is now being taken over by Celery, the powerful task queue library that merlin uses internally. The workers will continue to report their task status here until their tasks are complete.

Workers are persistent, even after work is done. Send a stop signal to all your workers with this command:

$ merlin stop-workers

…and a successful worker stop will look like this, with the name of specific worker(s) reported:

$ merlin stop-workers


       *
   *~~~~~
  *~~*~~~*      __  __           _ _
 /   ~~~~~     |  \/  |         | (_)
     ~~~~~     | \  / | ___ _ __| |_ _ __
    ~~~~~*     | |\/| |/ _ \ '__| | | '_ \
   *~~~~~~~    | |  | |  __/ |  | | | | | |
  ~~~~~~~~~~   |_|  |_|\___|_|  |_|_|_| |_|
 *~~~~~~~~~~~
   ~~~*~~~*    Machine Learning for HPC Workflows



[2020-03-06 09:20:08: INFO] Stopping workers...
[2020-03-06 09:20:08: INFO] Reading app config from file .merlin/app.yaml
[2020-03-06 09:20:08: INFO] broker: amqps://user:******@server:5671//user
[2020-03-06 09:20:08: INFO] backend: redis://mlsi:******@server:6379/0
all_workers: ['celery@default_worker.%machine']
spec_worker_names: []
workers_to_stop: ['celery@default_worker.%machine']
[2020-03-06 09:20:10: INFO] Sending stop to these workers: ['celery@default_worker.%machine']

3.5. Using Samples

It’s a little boring to say “hello world” in just two different ways. Let’s instead say hello to many people!

To do this, we’ll need samples. Specifically, we’ll change WORLD from a global parameter to a sample. While parameters are static, samples are generated dynamically, and can be more complex data types. In this case, WORLD will go from being “world” or “mundo” to being a randomly-generated name.

First, we remove the global parameter WORLD so it does not conflict with our new sample. Parameters now look like this:

global.parameters:
    GREET:
        values : ["hello", "hola"]
        label  : GREET.%%

Now add these yaml sections to your spec:

env:
    variables:
        N_SAMPLES: 3

This makes N_SAMPLES into a user-defined variable that you can use elsewhere in your spec.

merlin:
    samples:
        generate:
            cmd: python3 $(SPECROOT)/make_samples.py --filepath=$(MERLIN_INFO)/samples.csv --number=$(N_SAMPLES)
        file: $(MERLIN_INFO)/samples.csv
        column_labels: [WORLD]

This is the merlin block, an exclusively merlin feature. It provides a way to generate samples for your workflow. In this case, a sample is the name of a person.

For simplicity we give column_labels the name WORLD, just like before.

It’s also important to note that $(SPECROOT) and $(MERLIN_INFO) are reserved variables. The $(SPECROOT) variable is a shorthand for the directory path of the spec file and the $(MERLIN_INFO) variable is a shorthand for the directory holding the provenance specs and sample generation results. More information on Merlin variables can be found on the variables page.

It’s good practice to shift larger chunks of code to external scripts. At the same location of your spec, make a new file called make_samples.py:

import argparse

import names
import numpy as np


# argument parsing
parser = argparse.ArgumentParser(description="Make some samples (names of people).")
parser.add_argument("--number", type=int, action="store", help="the number of samples you want to make")
parser.add_argument("--filepath", type=str, help="output file")
args = parser.parse_args()

# sample making
all_names = np.loadtxt(names.FILES["first:female"], dtype=str, usecols=0)
selected_names = np.random.choice(all_names, size=args.number)

result = ""
name_list = list(selected_names)
result = "\n".join(name_list)

with open(args.filepath, "w") as f:
    f.write(result)

Since our environment variable N_SAMPLES is set to 3, this sample-generating command should churn out 3 different names.

Before we can run this, we must install the script’s external python library dependencies (names: a simple package that generates random names, and numpy: a scientific computing package):

$ pip3 install -r requirements.txt

Here’s our DAG with samples:

../../_images/dag4.png

Here’s your new and improved my_hello.yaml, which now should match hello_samples.yaml:

description:
    name: hello_samples
    description: a very simple merlin workflow, with samples

env:
    variables:
        N_SAMPLES: 3

global.parameters:
    GREET:
        values : ["hello","hola"]
        label  : GREET.%%

study:
    - name: step_1
      description: say hello
      run:
          cmd: echo "$(GREET), $(WORLD)!"

    - name: step_2
      description: print a success message
      run:
          cmd: print("Hurrah, we did it!")
          depends: [step_1_*]
          shell: /usr/bin/env python3

merlin:
    samples:
        generate:
            cmd: python3 $(SPECROOT)/make_samples.py --filepath=$(MERLIN_INFO)/samples.csv --number=$(N_SAMPLES)
        file: $(MERLIN_INFO)/samples.csv
        column_labels: [WORLD]

Run the workflow again!

Once finished, this is what the insides of step_1 look like:

../../_images/merlin_output2.png
  • Numerically-named directories like 00, 01, and 02 are sample directories. Instead of storing sample output in a single flattened location, merlin stores them in a tree-like sample index, which helps get around file system constraints when working with massive amounts of data.

Lastly, let’s flex merlin’s muscle a bit and scale up our workflow to 1000 samples. To do this, you could internally change the value in the spec from 3 to 1000. OR you could just run this:

$ merlin run my_hello.yaml --vars N_SAMPLES=1000
$ merlin run-workers my_hello.yaml

Once again, to send a warm stop signal to your workers, run:

$ merlin stop-workers

Congratulations! You concurrently greeted 1000 friends in English and Spanish!