| SP Parallel Programming Workshop |
| l o a d l e v e l e r |
| What Is LoadLeveler? |
| LoadLeveler Overview |
| Basic LoadLeveler Tasks |
| Building a Job Command File |
# The executable is myprog. The stdin, stdout and stderr # files are named accordingly. # #@ executable = myprog #@ input = myprog.in #@ output = myprog.out #@ error = myprog.err #@ account_no = ABCDE-1234-567 #@ queue
| Basic LoadLeveler Tasks |
| Submitting a Job |
llsubmit myjob.cmd
submit: 1 job has been submitted, "fr2n02.32.0"
| Basic LoadLeveler Tasks |
| Displaying Job Status |
Id Owner Submitted ST PRI Size Class Running On ------------ ------- ----------- -- --- ---- --------- ----------- fr3n05.14.0 rjz 7/9 09:13 R 50 0.0 large fr7n09 fr7n05.15.0 bmr3 7/9 10:34 R 50 0.0 large fr7n15 fr5n09.13.4 ksmith 7/9 11:05 R 50 0.0 medium fr3n13 fr5n09.13.3 ksmith 7/9 11:05 R 50 0.0 medium fr7n03 fr5n09.13.2 ksmith 7/9 11:05 P 50 0.0 medium fr5n09.13.0 ksmith 7/9 11:05 I 50 0.0 medium fr5n09.13.1 ksmith 7/9 11:05 I 50 0.0 medium fr2n09.33.1 sokel 7/9 12:19 I 50 0.0 bigmem fr3n11.02.1 salay 7/9 12:23 I 50 0.0 bigmem fr3n01.02.1 caldwel 7/8 02:49 H 50 0.0 large 10 jobs in queue 4 waiting, 1 pending, 4 running, 1 held. |
| Basic LoadLeveler Tasks |
| Changing a Job's Priority |
llprio +10 mymachine.23.0
llprio -p 75 mymachine.23.0
| Basic LoadLeveler Tasks |
| Holding a Job |
llhold mymachine.23.0
llhold -r mymachine.23.0
| Basic LoadLeveler Tasks |
| Displaying a Machine's Status |
Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys
cws1.mhpcc.edu Down 0 0 Down 0 0.00 22 R6000 AIX41
fr1n05.mhpcc.edu Avail 0 0 Busy 1 0.00 9999 R6000 AIX41
fr1n07.mhpcc.edu Avail 0 0 Idle 0 0.07 9999 R6000 AIX41
fr2n01.mhpcc.edu Avail 0 0 Busy 1 0.00 9999 R6000 AIX41
fr2n05.mhpcc.edu Avail 0 0 Busy 1 1.00 9999 R6000 AIX41
fr9n05.mhpcc.edu Avail 0 0 Idle 0 0.77 9999 R6000 AIX41
fr9n05.mhpcc.edu Avail 0 0 Busy 1 1.01 9999 R6000 AIX41
. . . . . . . . . .
. . . . . . . . . .
fr28n15.mhpcc.edu Avail 0 0 Busy 1 1.01 9999 R6000 AIX41
fr28n16.mhpcc.edu Avail 0 0 Busy 1 1.01 9999 R6000 AIX41
R6000/AIX41 217 machines 43 jobs 198 running
217 machines 43 jobs 198 running
The Central Manager is defined on cws1.class.mhpcc.edu
All machines on the machine_list are present
|
llstatus -l
By default, llstatus -l will provide a long list of all
machines. This can be piped to grep to obtain useful information.
For example, the command below
will provide a list of all machines and their associated
memory configurations.
llstatus -l |grep -E "Machine|Memory"
llstatus -l mymachine
| Basic LoadLeveler Tasks |
| Canceling a Job |
llcancel mymachine.23.0
| Basic LoadLeveler Tasks |
| Displaying the Central Manager |
| Submitting Multiple Jobs |
#@ executable = longjob # #@ input = longjob.in1 #@ output = longjob.out1 #@ error = longjob.err1 #@ account_no = ABCDE-1234-567 #@ queue # #@ input = longjob.in2 #@ output = longjob.out2 #@ error = longjob.err2 #@ account_no = ABCDE-1234-567 #@ queue
#@ executable = longjob # #@ input = longjob.in.$(Process) #@ output = longjob.out.$(Cluster).$(Process) #@ error = longjob.err.$(Cluster).$(Process) #@ account_no = ABCDE-1234-567 #@ queue #@ queue #@ queue #@ queue #@ queue
| Using the Job Command File as the Executable |
#!/bin/csh # # Beginning of LoadLeveler commands #@ initialdir = /u/jsmith/LoadLeveler #@ error = run1.$(Cluster).err #@ output = run1.$(Cluster).out #@ environment = MP_SHARED_MEMORY=yes #@ account_no = ABCDE-1234-567 #@ queue # Beginning of script commands echo 'Copying input file to /localscratch' cp input.1 /localscratch/input.1 echo 'Running the program' run1 echo 'Copying output file back' cp /localscratch/output.1 output.1 rm /localscratch/input.1 echo 'Cleanup done. Job completed.' end
| Submitting Parallel Jobs - General Notes |
#@ job_type = serial -default
#@ job_type = parallel -MPI
#@ node = 2
#@ tasks_per_node = 4
#@ total_tasks = 4
#@ network.MPI = css0,not_shared,US
#@ environment = MP_SHARED_MEMORY=yes; MP_INFOLEVEL=5; MP_MP_SAVEHOSTFILE=myhosts.txt;
echo $LOADL_PROCESSOR_LIST > myhosts
#@ notification = complete
#@ notify_user = jsmith@mhpcc.hpc.mil
#@ error = myjob.err
#@ environment = MP_INFOLEVEL=2
#!/bin/csh
#@ initialdir = /u/jsmith/LoadLeveler
#@ error = run1.err
#@ output = run1.out
#@ job_type = parallel
#@ network.MPI = css0,not_shared,US
#@ environment = MP_SHARED_MEMORY=yes
#@ node = 4
#@ tasks_per_node = 4
#@ wall_clock_limit = 12000
#@ account_no = ABCDE-1234-567
#@ queue
#
set mydir = "/u/jsmith/LoadLeveler/"
set infile = "input.1"
set nodes = `echo $LOADL_PROCESSOR_LIST`
# Pre-execution setup
foreach node (${nodes})
rcp ${mydir}${infile} ${node}:/localscratch
echo "copied ${mydir}${infile} to ${node}:/localscratch"
end
run1
# Post-execution cleanup
foreach node (${nodes})
rsh $node "cd /localscratch; rm -f ${infile}"
echo "cleanup on ${node} done"
end
echo "Job completed"
| Submitting Parallel MPI Jobs |
# Example MPI job command file # #@ initialdir = /u/smith/proj01 #@ notification = complete #@ notify_user = smith@favorite.machine.mail #@ input = myprog.in #@ error = myprog.$(Cluster).err #@ output = myprog.$(Cluster).out #@ job_type = parallel #@ requirements = network.MPI = css0,not_shared,US #@ environment = MP_INFOLEVEL=3;MP_LABELIO=yes;MP_SHARED_MEMORY=yes #@ node = 4 #@ tasks_per_node = 4 #@ checkpoint = no #@ wall_clock_limit= 14000 #@ account_no = ABCDE-1234-567 #@ queue myprog
| LoadLeveler Internals |




| How LoadLeveler Schedules Parallel Jobs |
| Note: This section does not apply to MHPCC users. The MHPCC has implemented its own batch scheduler which supercedes the LoadLeveler scheduling mechanisms. Please see the "LoadLeveler at the MHPCC" section of this tutorial for details. |
| LoadLeveler at the MHPCC |
| Important: The MHPCC has implemented a batch scheduler which replaces most of LoadLeveler's scheduling mechanisms. Additionally, there are a number of site specific details unique to the MHPCC. Users should be certain to become familiar with this section before attempting to submit batch jobs on the MHPCC systems. |
set path=($path /usr/lpp/LoadL/full/bin)
| Time Limit (hrs) | Minimum# processors |
|---|---|
| 8 | 65 - 128 |
| 16 | 33 - 64 |
| 24 | 5 - 32 |
| 36 | 1 - 4 |
Currently (8/8/97), the scheduler is using FIRSTFIT algorithm to backfill jobs. Users can take advantage of this by setting their wall_clock_limit keyword to the shortest amount of time required by their job.
xlf -bmaxdata:512000000 -o myprog mprog.fSee the xlf man page for details about the -bmaxdata: option.
cp $WORKSHOP/samples/loadl/Xdefaults.xloadl .
Other helpful hints
#@ environment = MP_Shared_MEMORY=yes;MP_INFOLEVEL=3;MP_LABELIO=yes
#@ network.MPI = css0,not_shared,US
if ($?prompt) then
setenv TERM vt100
set filec
set prompt = "`hostname -s`% "
setenv MP_EUILIB us
:
:
endif
| LoadLeveler Job Command File Keywords Reference |
An alphabetical list of the keywords you can use in a LoadLeveler job command file is provided below. All of these keywords are linked to their descriptions from the llsubmit man page.
account_no
arguments
checkpoint
class
comment
core_limit
cpu_limit
data_limit
dependency
environment
error
executable
file_limit
group
hold
image_size
initialdir
input
job_cpu_limit
job_name
job_type
max_processors
min_processors
notification
notify_user
output
parallel_path
preferences
queue
restart
requirements
rss_limit
shell
stack_limit
startdate
stepname
user_priority
wall_clock_limit
| LoadLeveler Commands Reference |
The following commands permit you to perform LoadLeveler related activities. Each is linked to its LoadLeveler man page.
| References and More Information |