LSF Introduction
IBM Spectrum LSF (load sharing facility) is software to distribute work across heterogeneous resources.
Here is some general terminology used by LSF:
- Cluster – a group of hosts running LSF
lscluster
- Hosts – a computer in the cluster
lshosts
- Job – a unit of work running on the LSF system
bjobs
- Job slot – a bucket into which a single unit of work is assigned in the LSF system
bslots
- Queue – a cluster-wide container for jobs
bqueues
- Resources – objects in the system that can run work (including hosts, CPU slots, licenses)
lsinfo
lsid
General information about how LSF is setup is the lsid
command. This shows the LSF version (10.1.0.9), the cluster name (lsf-cluster) and the master name (lsf-master).
➜ lsid
IBM Spectrum LSF Standard 10.1.0.9, Oct 16 2019
Copyright International Business Machines Corp. 1992, 2016
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is lsf-cluster
My master name is lsf-master
lsinfo
The lsinfo
command lists the resources available in the cluster.
➜ lsinfo
RESOURCE_NAME TYPE ORDER DESCRIPTION
...
TYPE_NAME
...
MODEL_NAME CPU_FACTOR
...
lshosts
The lshosts
command lists the resources defined for each host. The -o
option can be used to display more information. And the -json
option can display the information in JSON format.
➜ lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
lsf-host0 X86_64 INTEL_EM 60.0 64 256G 256G Yes ()
lsf-host1 X86_64 INTEL_EM 60.0 32 256G 256G Yes ()
➜ lshosts -o "HOST_NAME ncpus nprocs ncores nthreads"
HOST_NAME ncpus nprocs ncores nthreads
lsf-host0 64 2 32 2
lsf-host1 32 1 32 2
lsload
The lsload
command displays load information for the hosts. The status
field shows the load status of the host, the r15s
, r1m
, and r15m
fields show the CPU load averaged over different time intervals, ut
field shows the percentage of time the CPU is in use, pg
is the paging rate, ls
is the total number of login sessions, it
is the idle time, tmp
is the available temporary disk space, swp
is the available swap space, and mem
is the available RAM.
The -l
option reports more information about each host. The -o
option allows for setting the display information, similar to lshosts
above.
The lsmon
command is an updating display of load information, similar to if watch lsload
were called.
➜ lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
lsf-host0 ok 1.1 1.2 1.5 1% 0.0 0 102 33G 3.9G 624G
lsf-host1 ok 0.0 0.3 0.2 0% 0.0 0 8209 42G 3.9G 712G
bsub
Jobs are submitted to LSF using the bsub
command. Some common options include:
-I
– submit an interactive job-Is
– submit an interactive job with a pseudo-terminal (such as forvi
)-J <job_name>
– assign a name to the job-Jd <job_description>
– assign a job description to the job-P <project_name>
– assign the job to the specified project-eo <error_file>
– overwrite the standard error output of the job to the specified file path-m <host_name>
– submit the job to be run on specific hosts-oo <output_file>
– overwrite the standard output of the job to the specified file path-q <queue_name>
– submit the job to one of the specified queues
➜ bsub date
Job <1234> is submitted to default queue <normal>.
➜ bsub -I date
Job <1235> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on lsf-host0>>
Wed Feb 2 11:28:25 EST 2022
bqueues
The bqueues
command lists the available job queues.
A “-“ means that the field does not apply to that queue. The PRIO
field shows the priority of the queue – the larger the number, the higher the priority. The STATUS
field shows the status of the field (open or closed, and active or inactive). The MAX
field shows the maximum number of job slots that can be used by jobs from the queue. JL/U
shows the per-user job slot cap, and JL/P
shows the per-processor job slot cap. NJOBS
shows the total number of slots for jobs in the queue, summing up the pending, running, and suspended tasks. PEND
shows the pending tasks, RUN
shows the running tasks, and SUSP
shows the suspended tasks.
The -l
option shows much more information about each queue, and the -o
option allows for changing the fields shown for each queue.
➜ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P NJOBS PEND RUN SUSP
priority 49 Open::Active - - - 0 0 0 0
normal 30 Open::Active - - - 5 0 5 0
interactive 30 Open::Active - - - 3 1 2 0
bslots
The bslots
command displays information about available job slots.
The -l
option displays information in long format, including information about how many slots each host contains.
Note that the number of jobs is not limited by the number of cores or threads available – the operating system can easily switch between these jobs as needed.
➜ bslots
SLOTS RUNTIME
150 UNLIMITED
➜ bslots -l
SLOTS: 150
RUNTIME: UNLIMITED
HOSTS: 100*lsf-host1 50*lsf-host2
busers
The busers
command lists information about one or more LSF user account.
Most of the fields are the same as the bqueues
command, with SSUSP
showing the number of tasks in system-suspended jobs, USUSP
showing the number of tasks in user-suspended jobs, and RSV
showing the number of tasks that reserve slots.
If busers all
is called, it will show information for all users.
➜ busers
USER/GROUP JL/P MAX NJOBS PEND RUN SSUSP USUSP RSV
cody - - 3 0 3 0 0 0
bjobs
The bjobs
command shows information about all jobs submitted to LSF.
➜ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1234 cody RUN normal cody-mbp lsf-host0 ./do Feb 2 13:18
bpeek
The bpeek
command shows the stdout
and stderr
output of an unfinished job.
If the -f
option is used, the output will be displayed with tail -f
. Otherwise, cat
is used to display the output.
➜ bpeek 1234
<< output from stdout >>
Wed Feb 2 13:18:41 EST 2022
Wed Feb 2 13:18:46 EST 2022
Wed Feb 2 13:18:51 EST 2022
bkill
The bkill
command will send a signal to kill, suspend, or resume unfinished jobs.
By default, bkill
will send the KILL
signal. The STOP
signal can be used to suspend a job, and the CONT
signal to resume it. Note that the bstop
and bresume
commands could also be used to suspend or resume a job.
Using bkill
with a job ID of 0
will kill all jobs that match the other options (-app
, -g
, -m
, -q
, -u
, and -J
).
➜ bkill -s STOP 1234
Job <1234> is being stopped
➜ bjobs 1234
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1234 cody USUSP normal cody-mbp lsf-host0 ./do Feb 2 13:18
➜ bkill -s CONT 1234
Job <1234> is being resumed
➜ bjobs 1234
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1234 cody RUN normal cody-mbp lsf-host0 ./do Feb 2 13:18
➜ bkill 1234
Job <1234> is being terminated
Common commands
To get a high-level view of LSF performance, I typically run the following commands, normally in a watch
command so that they continuously update.
# Get a list of all jobs being run
➜ bjobs -w -u all -noheader | sort -k 2,2
1234 cody RUN normal hostA lsf-host0 ./do
1235 cody RUN normal hostA lsf-host0 ./run
1236 cody RUN normal hostA lsf-host1 ./run -test
# See if the hosts are running and if they are accepting jobs
➜ bhosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
lsf-host0 X86_64 INTEL_EM 60.0 64 256G 256G Yes ()
lsf-host1 X86_64 INTEL_EM 60.0 32 256G 256G Yes ()
# Check the status of the queues
➜ bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P NJOBS PEND RUN SUSP
priority 49 Open::Active - - - 0 0 0 0
normal 30 Open::Active - - - 5 0 5 0
interactive 30 Open::Active - - - 3 1 2 0
# Look at which users are using LSF
➜ busers -w all
USER/GROUP JL/P MAX NJOBS PEND RUN SSUSP USUSP RSV
ashley - - 0 0 0 0 0 0
cody - - 3 0 3 0 0 0