SGE – Parallel environment (PE)

Parallel environment (PE) is the central notion of SGE and represents  a set of settings that tell Grid Engine how to start, stop, and manage jobs run by the class of  queues that is using this PE. It also set some parameters for parallel messaging framework such as MPI, that is used by parallel jobs.


Show me the list of all PE created
qconf -splThe usual syntax applies

  • Show me                      qconf -sp <PE name>
  • Let me change it               qconf -mp <PE name>
  • Create new from file            qconf -Ap ./my-PE-template.txt

Parallel environment is the defining characteristic of each queue. Needs to be specified in correctly for queue to work. It is specified in pe_listattribute which can contain a single PE or list of PEs.  For example:

pe_list               make mpi mpi_fill_up

Each parallel environment determines a class of queues that use it and has several important attributes are:

  1. slots – the maximum number of job slots that the parallel environment is allowed to occupy at once
  2. allocation_rule” -> see man page. $pe_slots will allocate all slots for that job on a single host. Other rules support to schedule the job on multiple machines.
  3. control_slaves -> when set to “true” Grid Engine takes care about  starting the slave MPI taks. In this case MPI should be compiled with the option -with_sge
  4. job_is_first_task  The job_is_first_task parameter can be set to TRUE or FALSE. A value of TRUE indicates that the Sun Grid Engine job script already contains one of the tasks of the parallel application (the number of slots reserved for the job is the number of slots requested with the -pe switch), while a value of FALSE indicates that the job script (and its child processes) is not part of the parallel program (the number of slots reserved for the job is the number of slots requested with the -pe switch + 1).If wallclock accounting is used (execd_params ACCT_RESERVED_USAGE and/or SHARETREE_RESERVED_USAGE set to TRUE) and control_slaves is set to FALSE, the job_is_first_task parameter influences the accounting for the job: A value of TRUE means that accounting for cpu and requested memory gets multiplied by the number of slots requested with the -pe switch, if job_is_first_task is set to FALSE, the accounting information gets multiplied by number of slots + 1.
  5. accounting_summary This parameter is only checked if control_slaves (see above) is set to TRUE and thus Sun Grid Engine is the creator of the slave tasks of a parallel application via sge_execd(8) and sge_shepherd(8). In this case, accounting information is available for every single slave task started by Sun Grid Engine.The accounting_summary parameter can be set to TRUE or FALSE. A value of TRUE indicates that only a single accounting record is written to the accounting(5) file, containing the accounting summary of the whole job including all slave tasks, while a value of FALSE indicates an individual accounting(5) record is written for every slave task, as well as for the master task.

    Note: When running tightly integrated jobs with SHARETREE_RESERVED_USAGE set, and with having accounting_summary enabled in the parallel environment, reserved usage will only be reported by the master task of the parallel job. No per parallel task usage records will be sent from execd to qmaster, which can significantly reduce load on qmaster when running large tightly integrated parallel jobs.

Some important details are well explained in the blog post Configuring a New Parallel Environment

 Mar 23, 2007 | DanT’s Grid Blog

By templedf on  Mar 23, 2007

Since this seems to be a regular topic on the user mailing list, here’s a quick guide to setting up a parallel environment on Grid Engine:

  1. First, create/borrow/steal the startup and shutdown scripts for the parallel environment you’re using. You can find MPI and PVM scripts in the $SGE_ROOT/mpi and $SGE_ROOT/pvm directories, respectively. If you cannot find scripts for your parallel environment, you’ll have to create them. The startup script must prepare the parallel environment for being used. With most MPI implementations, that’s just a matter of creating a “machines” file that lists the machines which are to run the parallel job. The shutdown script must clean up after the parallel job’s execution. The MPI shutdown script can just delete or rename the “machines” file.
  2. Next, you have to tell Grid Engine about your parallel environment. You can do that interactively with qmon or qconf -ap <pe_name> or you can write the data to a file and use qconf -Ap <file_name>.For an example of what such a file would look like, see $SGE_ROOT/mpi/mpi.template or $SGE_ROOT/pvm/pvm.template.

    Let’s look at what the parallel environment configuration contains.

    pe_name           template
    slots             0
    user_lists        NONE
    xuser_lists       NONE
    start_proc_args   /bin/true
    stop_proc_args    /bin/true
    allocation_rule   $pe_slots
    control_slaves    FALSE
    job_is_first_task FALSE
    urgency_slots     min
    • pe_name – the name by which the parallel environment will be known to Grid Engine
    • slots – the maximum number of job slots that the parallel environment is allowed to occupy at once
    • users_lists – an ACL specifying the users who are allowed to use the parallel environment. If set to NONE, it means any user can use it
    • xusers_list – an ACL specifying the users who are not allowed to use the parallel environment. Users in both the users_list and xusers_list are not allowed to use the parallel environment
    • start_proc_args – the path to the startup script for the parallel environment followed by any needed arguments. Grid Engine provides some inline variables that you can use as arguments:
      • $pe_hostfile – the path to a file written by Grid Engine which contains information about how and where the parallel job should be run
      • $host – the host on which the parallel environment is being started
      • $job_owner – the name of the user who owns the parallel job
      • $job_id – the id of the parallel job
      • $job_name – the name of the parallel job
      • $pe – the name of the parallel environment
      • $pe_slots – the number of job slots assigned to the job
      • $queue – the name of the queue in which the parallel job is running

      The value of this setting is the command that will be run to start the parallel environment for every parallel job.

    • stop_proc_args – the path to the shutdown script for the parallel environment followed by any needed arguments. The same inline variables are available as with start_proc_args.
    • allocation_rule – this setting controls how job slots are assigned to hosts. It can have four possible values:
      • a number – if set to a number, Grid Engine will assign that many slots to the parallel job on each host until the assigned number of job slots is met. Setting this attribute to 1, for example, would mean that the job gets a single job slot on each host where it is assigned. Grid Engine will not assign the job more job slots than the number of assigned hosts multiplied by this attribute’s value.
      • $fill_up – use all of the job slots on a given host before moving to the next host
      • $round_robin – select one slot from each host in a round-robin fashion until all job slots are assigned. This setting can result in more than one job slot per host.
      • $pe_slots – place all the job slots on a single machine. Grid Engine will only schedule such a job to a machine that can host the maximum number of slots requested by the job. (See below.)
    • control_slaves – this setting tells Grid Engine whether the parallel environment integration is “tight” or “loose”. See your parallel environment’s documentation for more details.
    • job_is_first_task – this setting tells Grid Engine whether the first task of the parallel job is actually a job task or whether it’s just there to kick off the rest of the jobs. This setting is also determined by your parallel environment integration.
    • urgency_slots – this setting affect how resource requests affect job priority for parallel jobs. The values can be “min,” “max,” “avg,” or a number. For more information about resource-based job priorities, see this white paper

    For more information about these settings, see the sge_pe man page.

  3. The next step is to enable your parallel environment for the queues where it should be available. You can add the parallel environment to a queue interactively with qmon or qconf -mq <queue> or in a single action with qconf -aattr queue pe_list <pe_name> <queue>.
  4. Now you’re ready to test your parallel environment. Run qsub -pe <pe_name> <slots>. Aside from the usual output and error files (<job_name>.o<job_id> and <job_name>.e<job_id>, respectively), you should also look for the parallel environment startup output and error files, <job_name>.po<job_id> and <job_name>.pe<job_id>.

That’s all there is to it! Just to make sure we’re clear on everything, let’s do an example. Let’s create a parallel environment that starts up an RMI registry and stores the port number in a file so that the job can find it.

First thing we have to do is write the startup and shudown scripts for the RMI parallel environment. Here’s what they look like:

rmi_startup.sh

#!/bin/sh
# $TMPDIR and $JOB_ID are set by Grid Engine automatically

# Borrowed from $SGE_ROOT/mpi/startmpi.sh
PeHostfile2MachineFile()
{
   cat $1 | while read line; do
      host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
      nslots=`echo $line|cut -f2 -d" "`
      i=1

      while [ $i -le $nslots ]; do
         echo $host
         i=`expr $i + 1`
      done
   done
}

# get arguments
pe_hostfile=$1

# ensure pe_hostfile is readable
if [ ! -r $pe_hostfile ]; then
   echo "$me: can't read $pe_hostfile" >&2
   exit 1
fi

# create machines file
machines="$TMPDIR/machines"
PeHostfile2MachineFile $pe_hostfile >> $machines

# We use ports 40000-40999
port=`expr \\( $JOB_ID % 1000 \\) + 40000`

# Start the registry
/usr/java/bin/rmiregistry $port &

# Save the registry's PID so that we can stop it later
echo $! > $TMPDIR/pid

# Save the port number so the job can find it
echo $port > $TMPDIR/port

rmi_shutdown.sh

#!/bin/sh
# $TMPDIR is set by Grid Engine automatically

# Get the registry's PID
pid=`cat $TMPDIR/pid`

# Kill the registry
kill $pid

# Clean up the files the startup script created
rm $TMPDIR/pid
rm $TMPDIR/port
rm $TMPDIR/machines

Next thing we have to do is add our parallel environment to Grid Engine. First we create a file, say /tmp/rmi_pe, with the following contents:

pe_name           rmi
slots             4
user_lists        NONE
xuser_lists       NONE
start_proc_args   /home/dant/rmi_startup.sh $pe_hostfile
stop_proc_args    /home/dant/rmi_shutdown.sh
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

Note that control_slaves is true and job_is_first_task is false. Because we’re writing the integration scripts, the choice is somewhat arbitrary, but it affects how the job scripts must be written, as we’ll see below. It also affect whether theqmaster is able to keep accounting records on the slave tasks. If control_slaves is false, the qmaster is have no records of how much resources the slaves tasks consumed.

Now we add the parallel environment with qconf -Ap /tmp/rmi_pe. We could have skipped a step by running qconf -ap rmi and entering the data in the editor that comes up, but they way we’ve done it here is scriptable.

The next step is to add our parallel environment to our queue with qconf -aattr queue pe_list rmi all.q. Again, we could have run qconf -mq all.q and edited the pe_list attribute in the editor, but the way we’ve done it is scriptable.

Last thing to do is test out our parallel environment. First we need a job script:

#!/bin/sh
#$ -S /bin/sh

port=`cat $TMPDIR/port`
qrsh=$SGE_ROOT/bin/$ARC/qrsh

cat $TMPDIR/machines | while read host; do
   $qrsh -inherit $host /usr/bin/java -cp ~/rmi.jar RMIApp $port &
done

Let’s look at this job script for a moment. The first thing to notice is the use of qrsh -inherit. The -inherit switch is specifically for kicking off slave tasks. It requires that the target host name be supplied. In order to get the target host name, we read the machines file that the startup script generated from the one Grid Engine supplied.

The second thing to notice is how ugly the use of qrsh -inherit is. RMI is not really a parallel environment. It’s a communications framework. It doesn’t do the work of kicking off remote processes for you. So, instead, we have to do it ourselves in the job script. With a true parallel environment, like any of the MPI flavors, the framework also takes care of starting the remote processes, often through rsh. In the MPI scripts included with Grid Engine, an rsh wrapper script is included, which transparently replaces calls to rsh with calls to qrsh -inherit. By using that wrapper script, the parallel environment’s calls to rsh can be rerouted through the grid via qrsh without having to modify the parallel environment itself to work with Grid Engine.

The last thing to notice is how this script correlates to the control_slaves and job_is_first_task attributes of the parallel environment configuration. Let’s start with first_job_is_task. In our configuration, we set it to false. That means that this master job script is not counted as a job and does no real work. That is why our script doesn’t do anything but kick off sub-tasks. If job_is_first_task had been true, our job script would be expected to run one of the RMIApp instances itself.

Now let’s talk about the control_slaves attribute. If control_slaves is true, we are allowed to use qrsh -inherit to kick off our sub-tasks. The qmaster will not, however, allow us to kick off more subtasks than the number of slots we’ve been assigned (minus 1 if job_is_first_task is true). The advantage of using qrsh -inherit is that the sub-tasks are tracked by Grid Engine like regular jobs. If control_slaves is false, we have to use some mechanism external to Grid Engine, such as rsh or ssh, to kick off our sub-tasks, meaning that Grid Engine cannot track them and is actually fully unaware of them. That’s why job_is_first_task is meaningless when control_slaves is false.

In order to test our job we need a Java application called RMIApp. As that’s outside the scope of the example, let’s just pretend we have a parallel Java application that uses the RMI registry for intra-process communication. To submit our job we use qsub -pe rmi 2-4 rmi_job.sh. The -pe rmi 2-4 argument tells the qmaster that we’re using the rmi parallel environment and we want 4 job slots assigned to our job, but we will accept as few as 2. Because our job script starts a sub-task for every entry in to host file, it will start the right number of sub-tasks, no matter how many slots we are assigned. Had we written the job script to start exactly two sub-tasks, we would have to use -pe rmi 2 so that we could be sure we got exactly two job slots.

While the job is running, run qstat -f. You’ll see output something like this:

% qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@ultra20                  BIP   4/10      0.08     sol-amd64
    253 0.55500 rmi_job.sh dant         r     03/23/2007 11:46:51     4

From this output we can see that the job has been scheduled and has been assigned four job slots. Those four job slots only account for the four sub-tasks. The master job itself is not counted because the job_is_first_task attribute is false.

After our job completes, if we look in our home directory (which is where Grid Engine will put the output files since we didn’t tell it otherwise), we will find four new files: rmi_job.sh.e253, rmi_job.sh.o253, rmi_job.sh.pe253, and rmi_job.sh.o253, assuming, of course, that our job was number 253. The \*.o253 and \*.e253 files should be familiar. They’re the output and error streams from the job script. The \*.po253 and \*.pe253 files are new. They’re the output and error streams from the parallel environment startup and shutdown scripts.

So, there you have it. A complete, top-to-bottom example of creating, configuring, and running a parallel environment.

MPI jobs and SGE parallel environment

For most parallel jobs, including those using OpenMP and IntelMPI an SGE Parallel Environment needs to be correctly specified.

For MPI you should set “job_is_first_task FALSE” and “control_slaves TRUE“.

This PE acts as glue ensuring that SGE and the parallel, i.e., multi-process, program play nicely together. Two parameters are especially important:

  • control_slaves  This parameter can be set to TRUE or FALSE (the default). It indicates whether Sun Grid Engine is the creator of the slave tasks of a parallel application via sge_execd(8) and sge_shepherd(8) and thus has full control over all processes in a parallel application, which enables capabilities such as resource limitation and correct accounting.

    To gain control over the slave tasks of a parallel application it should be set to TRUE. Example of  PE environments are available through your Sun Grid Engine support. In this case you need to provide access to the file hostfile(which contains the list of nodes on which task should be run in parallel) on all nodes (for example by putting it on NFS).

    Please set the control_slaves parameter to false for all other PE interfaces. For example

    qconf -sp mpi

    pe_name mpi
    slots 999
    user_lists NONE
    xuser_lists NONE
    start_proc_args NONE
    stop_proc_args NONE
    allocation_rule $fill_up
    control_slaves TRUE
    job_is_first_task FALSE
    urgency_slots min
    accounting_summary TRUE

See

man pe_conf

Allocation rule

The allocation rule is interpreted by sge_schedd(8) and helps the scheduler to decide how to distribute parallel processes among the available machines. If, for instance, a parallel environment is built for shared memory applications only, all parallel processes have to be assigned to a single machine. If, however, the parallel environment follows the distributed memory paradigm, an even distribution of processes among machines may be favorable. The current version of scheduler understands only the following allocation rules

  1. integer_number: An integer number fixing the number of processes per host. If the number is 1, all processes have to reside on different hosts.
  2. $pe_slots If the special denominator  $pe_slots is used, the full range of processes as specified with the qsub(1) -pe switch has to be allocated on a single host.
  3. $fill_up: Starting from the best suitable host/queue, all available slots are allocated. Further hosts and queues are “filled up” as long as a job still requires slots for parallel tasks.
  4. $round_robin: The allocation scheme walks through suitable hosts in a best-suitable-first order.

Typical operations

Show the available parallel environments

qconf -spl

Shown parallel environment

qconf -sp mpi
pe_name           mpi
slots             96
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/sge/n1ge6/mpi/startmpi.sh $pe_hostfile
stop_proc_args    /opt/sge/n1ge6/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min
# qconf -sp ms
pe_name            ms
slots              256
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

PE make in most cases is default and is installed with SGE installation

# qconf -spl
make
ms
# qconf -sp make
pe_name make
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE

Add a new parallel environment:

Interactively

qconf -ap mpi180

From the file:

qconf -Ap /home/sgeadmin/config/mpi_fill_up.pe
root@merlin00 added "mpi_fill_up" to parallel environment list
qconf -spl
mpi
mpi_fill_up

From a template

# qconf -sp mpi
pe_name            mpi
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

The second example shows that you can specify strt up and stop procedures

qconf -sp mpi_fill_up
pe_name           mpi_fill_up
slots             9999
user_lists        NONE
xuser_lists       NONE
start_proc_args   /opt/sge/mpi/startmpi.sh $pe_hostfile
stop_proc_args    /opt/sge/mpi/stopmpi.sh
allocation_rule   $fill_up
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

Useful commands

Also see the ‘Managing Special Environment’ section in the Administration Guide from sun.com if you need more details about PE configuration.

  1. Add the new PE from the file config_file:
    qconf -Ap <config_file>’
  2. qconf -spl – view all PEs currently available;
  3. qconf -sp <PE_name> – list particular PE;
  4. qconf -dp <PE_name> – remove a PE;
  5. qconf -mp <PE_name> – modify an existing PE.
  6. qconf -sql – to see all queues available;
  7. qconf -mq <queue_name> – to modify the queue’s settings.
  8. qconf -sq <queue_name> – list queue.

Leave a comment