TaskFarmer

TaskFarmer can be downloaded from its GitHub page.

About

Execute a list of system commands from a task file one-by-one. This allows many simulations to be run within a single mpirun allocation. A new task is launched whenever a process becomes available, hence ensuring 100% utilization of the cores for the duration of the wall time, or until the task file is empty, whichever occurs first. This is useful for running many short simulations on a small number of cores, or to avoid resource wastage when individual simulations have markedly different run times. The task file can be updated dynamically, allowing simulations to be added or deleted as required.

A master-worker type scenario is avoided by exploiting a file lock. This ensures that only one process has access to the task file at any given time.

The order of operations is as follows:

A Python implementation is provided in the python/ directory, although this is known to suffer from significant start up lag on clusters that don't natively support Python shared libraries on their compute nodes.

Installation

A Makefile is included for building and installing TaskFarmer. You will first need to make sure that you have Open MPI installed. If you use an alternative MPI implementation, such as aprun on the Cray Linux Environment (CLE), you will need to change the compiler variable CC in the Makefile or override the variable from the command line.

To compile TaskFarmer, then install the executable and man page:

make
sudo make install

TaskFarmer can be completely removed from your system as follows:

sudo make uninstall

To build TaskFarmer using a different compiler (e.g. Cray):

make CC=cc

Usage

mpirun -np CORES taskfarmer [-h] -f FILE [-v] [-w] [-r] [-s SLEEP_TIME]
  [-m MAX_RETRIES]

TaskFarmer supports the following short- and long-form command-line options:

-h or --help
Print the help message to stdout.
-f or --file <string>
Specify the location of the task file (required).
-v or --verbose
Activate verbose mode (status updates printed to stdout).
-w or --wait-on-idle
Wait for more tasks when idle.
-r or --retry
Retry failed tasks.
-s or --sleep-time <int>
Specify the duration of time to sleep when idle (seconds).
-m or --max-retries <int>
Specify the maximum number of times to retry a failed task.

It is possible to change the state of idle cores using the --wait-on-idle option. When set, a core will sleep for a specified period of time if it cannot find a task to execute. After the waiting period the process will check whether more tasks have been added to the task file. The amount of time that a process sleeps for can be changed with the --sleep-time option, the default is 300 seconds. This cycle will continue until the wall time is reached. By default wait-on-idle is deavtivated meaning that each process exits when the task file is empty.

The --retry and --max-retries options allow TaskFarmer to retry failed tasks up to a maximum number of attempts. The default number of retries is 10.

Examples

Try the following:

shuf examples/commands.txt | head -n 100 > tasks.txt | mpirun -np 4 taskfarmer -f tasks.txt

A collection of example PBS and SLURM batch scripts are included in the examples/ directory.

Tips

System commands in the task file should redirect their standard output to a separate log file to avoid littering the standard output of TaskFarmer itself. As an example, the tasks.txt file could contain a command like

echo "Hello, I'm a task" > job.log

with TaskFarmer launched as follows:

mpirun -np 4 taskfarmer -f tasks.txt > tasks.log

The wc command-line utility is handy for checking the number of remaining tasks in a task file without the need to trawl through any of TaskFarmer's logs. For example, if task files are stored in a directory called task_files then the following command will provide a concise output showing the number of remaining tasks in each file as well as the total.

wc -l task_files/*

Since tasks are read from the task file line-by-line it is possible to introduce dependencies between tasks by placing multiple tasks on a single line separated by semicolons. For example

perform_calculation > data.txt; analyze_data < data.txt

Words of caution

export OMPI_MCA_mpi_warn_on_fork=0
export OMPI_MCA_btl_openib_want_fork_support=0