study
This module represents all of the logic for a study
MerlinStudy
Represents a Merlin study run on a specification. Used for 'merlin run'.
This class manages the execution of a study based on a provided specification file, handling sample data, output paths, workspace management, and the generation of a Directed Acyclic Graph (DAG) for execution.
Attributes:
| Name | Type | Description |
|---|---|---|
dag |
DAG
|
Directed acyclic graph representing the execution flow of the study. |
dry_run |
bool
|
Flag indicating whether to perform a dry run of the workflow. |
expanded_spec |
MerlinSpec
|
The expanded specification after applying overrides. |
filepath |
str
|
Path to the desired specification file. |
flux_command |
str
|
Command for running flux jobs, if applicable. |
info |
str
|
Path to the 'merlin_info' directory within the workspace. |
level_max_dirs |
int
|
The number of directories at each level of the sample hierarchy. |
no_errors |
bool
|
Flag to ignore some errors for testing purposes. |
original_spec |
MerlinSpec
|
The original specification loaded from the filepath. |
output_path |
str
|
Path to the output directory for the study. |
override_vars |
Dict[str, Union[str, int]]
|
Dictionary of variables to override in the specification. |
parameter_labels |
List[str]
|
List of parameter labels used in the study. |
pargs |
List[str]
|
Arguments for the parameter generator. |
pgen_file |
str
|
Filepath for the parameter generator, if applicable. |
restart_dir |
str
|
Filepath to restart the study, if applicable. |
sample_labels |
List[str]
|
The column labels of the samples. |
samples |
ndarray
|
The samples in the study. |
samples_file |
str
|
File to load samples from, if specified. |
special_vars |
Dict[str, str]
|
Dictionary of special variables used in the study. |
timestamp |
str
|
Timestamp representing the start time of the study. |
user_vars |
Dict[str, str]
|
The user-defined variables in the study. |
workspace |
str
|
Path to the workspace directory for the study. |
Methods:
| Name | Description |
|---|---|
generate_samples |
Executes a command to generate sample data if the sample file is missing. |
get_adapter_config |
Builds and returns the adapter configuration dictionary. |
get_expanded_spec |
Returns a new YAML spec file with defaults, CLI overrides, and variable expansions. |
get_sample_labels |
Retrieves the column labels for the samples. |
get_user_vars |
Returns a dictionary of expanded user-defined variables from the specification. |
label_clash_error |
Checks for clashes between sample and parameter names. |
load_dag |
Generates a Directed Acyclic Graph (DAG) for the study's execution. |
load_pgen |
Executes a parameter generator script. |
load_samples |
Loads samples from disk or generates them if the file does not exist. |
write_original_spec |
Copies the original specification to the 'merlin_info' directory. |
Source code in merlin/study/study.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 | |
expanded_spec
cached
property
Determines, writes to YAML, and loads into memory an expanded specification.
This property handles the expansion of the study's specification based on the original specification and any provided environment variables. If the study is being restarted, it retrieves the previously expanded specification without re-expanding it. Otherwise, it processes the original specification, expands any tokens or shell references, and updates paths accordingly.
Returns:
| Type | Description |
|---|---|
MerlinSpec
|
The expanded specification object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the expanded name for the workspace contains invalid characters for a filename. |
flux_command
cached
property
Returns the full path to the flux command based on the specified workflow configuration.
This property constructs the command to execute the flux binary. If a
flux_path is provided in the expanded specification's batch configuration,
it will use that path to create the full command. Otherwise, it defaults
to the standard 'flux' command.
Returns:
| Type | Description |
|---|---|
str
|
The complete command string for executing flux. |
info
cached
property
Creates and returns the path to the 'merlin_info' directory within the study's workspace.
This property checks if a restart directory is specified. If not, it creates the 'merlin_info' directory inside the study's workspace directory. This directory is intended to store metadata and other relevant information related to the study.
Returns:
| Type | Description |
|---|---|
str
|
The absolute path to the 'merlin_info' directory. |
level_max_dirs
property
Retrieves the maximum number of directory levels for sample organization.
This property checks the expanded specification for the maximum
number of directory levels defined under the 'merlin' section.
If the value is not found, it falls back to a default value
specified in the defaults.SAMPLES dictionary.
Returns:
| Type | Description |
|---|---|
int
|
The maximum number of directory levels. If the value is
not specified in the expanded specification, the default
value from |
output_path
cached
property
Determines and creates an output directory for this study.
This property checks if a restart directory is specified. If so, it validates the existence of the directory and returns its absolute path. If no restart directory is provided, it constructs the output path based on the original specification and any override variables. The output path is expanded to include user-defined variables and environment variables. If the directory does not exist, it is created.
Returns:
| Type | Description |
|---|---|
str
|
The absolute path to the output directory for the study. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified restart directory does not exist. |
parameter_labels
property
Retrieves the parameter labels associated with this study.
This property extracts parameter labels from the expanded specification of the study. It accesses the parameters and their associated metadata, collecting all labels defined for each parameter.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of parameter labels used in this study. |
sample_labels
property
Retrieves the labels of the samples associated with this study.
This property extracts the sample labels from the study's expanded specification. It returns a list of labels that correspond to the samples defined in the specification.
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of sample labels. If no labels are defined, an empty list is returned. |
samples
property
Retrieves the samples associated with this study.
This property checks if there are any samples defined in the expanded specification of the study. If samples are present, it loads and returns them; otherwise, it returns an empty list.
Returns:
| Type | Description |
|---|---|
ndarray
|
A numpy array of samples corresponding to the study. If no samples are defined, an empty list is returned. |
timestamp
cached
property
Returns a timestamp string representing the time this study began.
This property generates a unique identifier based on the current time when the study is initiated. If a restart directory is specified, it extracts a substring from the directory name as the timestamp. Otherwise, it formats the current time in the 'YYYYMMDD-HHMMSS' format.
Returns:
| Type | Description |
|---|---|
str
|
A string representing the timestamp of the study's initiation, which can be used as an identifier or unique key. |
user_vars
property
Retrieves the user-defined variables for the study.
This property accesses the original specification of the study and
retrieves the user-defined variables using the get_user_vars
method from this class.
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
A dictionary containing the user-defined variables associated with the study. |
workspace
cached
property
Determines, creates, and returns the path to this study's workspace directory.
This property generates a unique workspace directory for the study, which contains subdirectories for each step of the study and a 'merlin_info/' directory. The name of the workspace directory is derived from the original specification name and includes a timestamp to ensure uniqueness. If a restart directory is specified, it validates the existence of the directory and returns its absolute path.
Returns:
| Type | Description |
|---|---|
str
|
The absolute path to the workspace directory for the study. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified restart directory does not exist. |
__init__(filepath, override_vars=None, restart_dir=None, samples_file=None, dry_run=False, no_errors=False, pgen_file=None, pargs=None)
Initializes a MerlinStudy object, which represents a study run based on a specification file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
Path to the specification file for the study. |
required |
override_vars
|
Dict[str, Union[str, int]]
|
Dictionary of variables to override in the specification. |
None
|
restart_dir
|
str
|
Path to the directory for restarting the study. |
None
|
samples_file
|
str
|
Path to a file containing sample data. If specified, the samples will be loaded from this file. |
None
|
dry_run
|
bool
|
Flag indicating whether to perform a dry run of the workflow without executing tasks. |
False
|
no_errors
|
bool
|
Flag to suppress certain errors for testing purposes. |
False
|
pgen_file
|
str
|
Path to a parameter generator file. |
None
|
pargs
|
List[str]
|
Arguments for the parameter generator. |
None
|
Source code in merlin/study/study.py
generate_samples()
Generates sample data by executing the command defined in the 'generate' section of the specification if the sample file does not already exist.
This method checks if the specified sample file exists. If it does not, it retrieves the command from the YAML specification and executes it using a subprocess. The output and error logs from the command execution are saved to files for later review.
Example
Here's an example sample generation command:
Source code in merlin/study/study.py
get_adapter_config(override_type=None)
Builds and returns the adapter configuration dictionary.
This method constructs a configuration dictionary for the adapter
based on the specifications defined in self.expanded_spec.batch.
It ensures that the configuration includes a type, which can be
overridden if specified. The method also checks for a dry run
flag and adds relevant commands if the batch type is set to
"flux".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
override_type
|
str
|
An optional string to override the default adapter type. If not provided, the type from the expanded specification will be used. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
A dictionary containing the adapter configuration. |
Source code in merlin/study/study.py
get_expanded_spec()
Generates a new YAML specification file with applied defaults, command-line interface (CLI) overrides, and variable expansions.
This method creates a modified version of the original specification by incorporating default values and user-defined overrides from the command line. It also expands user-defined variables and reserved words to produce a fully resolved specification. This is particularly useful for tracking provenance and ensuring that the specification accurately reflects all applied configurations.
Returns:
| Type | Description |
|---|---|
MerlinSpec
|
A new instance of the |
Source code in merlin/study/study.py
get_sample_labels(from_spec)
Retrieves the column labels of the samples from the provided specification.
This method checks the specified MerlinSpec
object for sample information and returns the associated column labels if they
exist. If no sample labels are found, an empty list is returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
from_spec
|
MerlinSpec
|
The specification object from which to extract sample column labels. It is expected to contain a "samples" key within its "merlin" dictionary. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
A list of column labels for the samples. If no sample labels are present, an empty list is returned. |
Source code in merlin/study/study.py
get_user_vars(spec)
staticmethod
Retrieves and expands user-defined variables from the specification environment.
This static method examines the provided specification's environment
to collect user-defined variables and labels. It constructs a list
of these variables and passes them to the determine_user_variables
function to obtain a dictionary of expanded variables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
spec
|
MerlinSpec
|
The specification object containing the environment from which to extract user-defined variables. The environment should have keys "variables" and/or "labels" that contain the relevant data. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
A dictionary of expanded user-defined variables, where the keys are variable names and the values are their corresponding expanded values. |
Source code in merlin/study/study.py
label_clash_error()
Detects illegal clashes between Merlin's sample column labels and Maestro's global parameters.
This method checks for any conflicts between the column labels
defined in the merlin section of the original specification and
the global parameters defined in the same specification. If a
column label is found to also exist in the global parameters,
a ValueError is raised to indicate the clash.
Raises:
| Type | Description |
|---|---|
ValueError
|
If any column label in |
Source code in merlin/study/study.py
load_dag()
Generates a Directed Acyclic Graph (DAG) for the execution of
the study and assigns it to the self.dag attribute.
This method constructs a DAG based on the specifications defined
in the expanded study specification. It retrieves the study
environment, steps, and parameters, and initializes a Maestro
Study
object. The method then sets up the workspace and environment
for the study, configures it, and generates the DAG using the
Maestro framework.
The generated DAG contains the execution flow of the study, ensuring that all steps are executed in the correct order without cycles.
Source code in merlin/study/study.py
load_pgen(filepath, pargs, env)
Loads a parameter generator script and creates a dictionary of variable names and their corresponding values.
This method reads a parameter generator script from the specified file path and extracts variable names and values defined within the script. It constructs a dictionary where each key is a variable name, and the value is another dictionary containing the variable's label and its associated values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filepath
|
str
|
The path to the parameter generator script to be loaded. |
required |
pargs
|
List[str]
|
A list of additional arguments to be passed to the parameter generator. If None, an empty list will be used. |
required |
env
|
StudyEnvironment
|
A Maestro
|
required |
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, str]]
|
A dictionary where each key is a variable name and each value is a dictionary containing:
|
Source code in merlin/study/study.py
load_samples()
Loads the study's samples from disk, generating them if the file does not exist and is defined in the YAML specification.
This method checks if a sample file is specified in the expanded specification. If the file does not exist, it will invoke the generation command defined in the 'generate' section of the specification to create the sample file. Once the file is available, it loads the samples into a NumPy array and assigns them to the variables specified in 'column_labels'.
Returns:
| Type | Description |
|---|---|
ndarray
|
A NumPy array containing the loaded samples. The shape of the array will be (n_samples, n_features), where n_samples is the number of samples loaded and n_features is the number of features corresponding to the column labels. |
Example
The spec file contents will look something like:
Source code in merlin/study/study.py
write_original_spec()
Copies the original specification file to the designated directory.
This method copies the original specification file from its
current location to the merlin_info/ directory, renaming it
to '