3  Running Jobs on Hazel

4 02. Running jobs on Hazel-HPC

4.0.0.1 What is a job

An HPC job is the unit of work that a user submits to the cluster’s Workload Manager (or job scheduler) for execution. It’s essentially your program, simulation code, or analysis script, bundled with all the necessary instructions for the system to run it. These instructions are typically contained within a job script (or submission script), which includes vital directives for the scheduler. By encapsulating the executable code, resource requirements, and runtime parameters, the job acts as a complete, self-contained package that the scheduler can manage and allocate hardware for, ensuring efficient use of the shared cluster resources.

4.0.0.2 LSF Job Scheduler

Hazel uses the LSF (Load Sharing Facility) job scheduler, currently known as IBM Spectrum LSF. Users submit jobs using the bsub command and include specific directives (e.g., #BSUB -n 16) to communicate their resource needs to the scheduler. LSF then handles the entire lifecycle of the job—from submission to completion, including fault tolerance and output logging.



4.0.1 🖥️ Interactive Jobs

For quick testing, debugging, or running applications that require a graphical interface (GUI), you must request a temporary allocation on a compute node using an interactive job.

$ bsub -Is -n 1 -W 10 bash
Option Meaning
-I Requests an interactive session.
-s Sends a signal to the job shell upon termination.
-n 1 Requests 1 CPU core.
-W 10 Requests 10 minutes of wall-clock time.
bash The shell to run on the allocated node.


4.0.1.1 Requesting multiple cores on Interactive

It is possible to request multiple cores for an interactive node. For example, to request an interactive session using 4 cores (-n 4), with all cores on the same node (-R "span[hosts=1]"), and 10 minutes of wall clock time (-W 10) use:

bsub -Is -n 4 -R "span[hosts=1]" -W 10 bash


4.0.2 📋 Anatomy of a Job Script (testing changes)

Job scheduling software. LSF (Load sharing facility) Job contains: the script you want to run and information about the resources you need to run it.

4.0.2.1 0. Example of a job script:

#!/bin/bash
# ==================================================
# Hello World python Job Script for Hazel HPC
# ==================================================
# This job script submits a python program to print "Hello World"


# --------------------------------------------------
# Request resources here
# --------------------------------------------------
#BSUB -J hello                        # job name
#BSUB -n 2                            # number of CPUs required per task
#BSUB -q shared_memory                # the queue to run on
#BSUB -R "span[hosts=1]"              # number of hosts to spread the jobs across, 1 host used here
#BSUB -R "rusage[mem=4GB]"            # required total memory for the job 
#BSUB -o "./output.%J.log"            # standard output file (%J is job name)
#BSUB -e "./error.%J.log"             # standard error file (%J is job ID)
#BSUB -W 10:00                        # time to run

# --------------------------------------------------
# Load modules here
# --------------------------------------------------

module load python

# --------------------------------------------------
# Execute commands here
# --------------------------------------------------

python hello.py

4.0.2.2 1. Request resources block

The “Request Resources Block” is the most critical section, as it uses LSF directives (lines starting with #BSUB) to tell the Workload Manager exactly how much computing power and time your job needs.

Directive Command Description
Job Name #BSUB -J hello Sets the job name to hello. This is used to identify the job in the queue and in output files.
CPU Cores #BSUB -n 2 Requests 2 CPU cores per task. For a parallel job, this is the number of cores dedicated to that single task.
Queue #BSUB -q shared_memory Specifies the queue (or partition) where the job should run
Host Span #BSUB -R "span[hosts=1]" Requests that all required resources (the 2 cores) for the task must reside on a single computing host (node).
Memory #BSUB -R "rusage[mem=4GB]" Requests 4 GB of total memory (RAM) for the job task. The job will be scheduled only on a node with at least this much available memory.
Output File #BSUB -o "./output.%J.log" Directs standard output (stdout) to this file.
Error File #BSUB -e "./error.%J.log" Directs standard error (stderr) to this file.
Wall Time #BSUB -W 10:00 Sets the maximum wall-clock time (actual time elapsed) to 10 minutes. If the job runs longer, LSF will terminate it.

4.0.2.3 2. Load modules block

This block is crucial for setting up the software environment your specific job requires. HPC systems use a tool called Modules to manage different versions of software, compilers, and libraries without conflicts.

This section uses the module load command to prepare the job’s environment. This avoids conflicts by allowing different users and different jobs to use specific, isolated versions of software on the shared cluster.

Sometimes the sofware you need won’t be available as a pre-installed module in the HPC, in that case you will need to install it yourself before you run it. We recommend trying to use containers first, and if that is not possible try to install the software using an virtual environment manager like conda.

There will be more information on this in later documentation but in case you installed your software with any of these options this is how you would load/activate them in your job script:

Command Example Description
Setup Option 1 module load conda Loads the Conda module, which manages virtual environments. This makes the conda command available for use.
Activation conda activate my_env If using Conda, this line activates the specific virtual environment containing your job’s required tools.
Setup Option 2 module load apptainer Loads the Apptainer module for running containerized applications. Containers package software with all its dependencies.

Key Principle: Always load the modules your job needs before trying to run the main command.


4.0.2.4 3. Execute commands block

This is the main body of the script where your actual scientific or computational work takes place. These are standard shell commands that will be executed sequentially on the allocated compute node(s).This is where you place the executable commands that perform the job’s core task. It’s often wrapped with simple echo commands to log key steps, helping with debugging and monitoring.

The example above has a very simplified execution compared to what one would normally make or encounter in bioinformatics. This is a more realistic example of execution of the software fastqc:


# --------------------------------------------------
# Load modules here
# --------------------------------------------------

module load apptainer

# --------------------------------------------------
# Execute commands here
# --------------------------------------------------

# Print job start information
echo "=================================================="
echo "Job started: $(date)"
echo "Job ID: $LSB_JOBID"
echo "Running on host: $(hostname)"
echo "Working directory: $(pwd)"
echo "=================================================="

FASTQC_SIF="/gpfs_backup/bioinfo_data/containers/images/quay.io_biocontainers_fastqc:0.12.1--hdfd78af_0.sif"
IN_DIR="/path/to/your/repo/data"
SAMPLE="sample_001”

mkdir -p /path/to/working/dir/fastqc_results
OUT_DIR="/path/to/working/dir/fastqc_results"

# --------------------------------------------------
# Validate inputs
# --------------------------------------------------

# Check if container exists
if [ ! -f "${FASTQC_SIF}" ]; then
    echo "ERROR: Container not found at ${FASTQC_SIF}"
    exit 1
fi

# Check if input directory exists
if [ ! -d "${IN_DIR}" ]; then
    echo "ERROR: Input directory not found at ${IN_DIR}"
    exit 1
fi

# --------------------------------------------------
# Execute FastQC
# --------------------------------------------------

echo "Running FastQC on sample: ${SAMPLE}"

apptainer exec $BIND ${FASTQC_SIF} fastqc \
    --threads $CPUS \
    --outdir ${OUT_DIR} \
    ${IN_DIR}/${SAMPLE}_*.fastq*

# Check if FastQC completed successfully
if [ $? -eq 0 ]; then
    echo "FastQC completed successfully"
else
    echo "ERROR: FastQC failed with exit code $?"
    exit 1
fi

# --------------------------------------------------
# Job completion
# --------------------------------------------------

echo "=================================================="
echo "Job completed: $(date)"
echo "Results saved to: ${OUT_DIR}"
echo "=================================================="

As you can see here we have included examples of commonly used techniques in job scripting:

  • Variable creation and value setting
  • Passing of job information to the standard output to help debugging and monitoring
  • Creation of output directories
  • Error checking Additionally, note that the the software fastqc is executed from an Apptainer container, not a module, so we also had to include paths to the container image, apart from input, output, etc.

Here is a detailed explanation of each line in case you would like to include any of these tools to your job scripts:

Command Example Description
Start Log echo "Job started: $(date)"
echo "Job ID: $LSB_JOBID"
echo "Running on host: $(hostname)"
echo "Working directory: $(pwd)"
Logs the start time and environment details into your standard output (output.%J.log) file:
date - records when the job started
$LSB_JOBID - shows the unique job ID assigned by LSF
hostname - shows which compute node is running the job
pwd - prints current working directory
Set Variables FASTQC_SIF="/rs1/shares/brc/admin
/containers/images/
quay.io_biocontainers_fastqc:0.12.1--hdfd78af_0.sif"
IN_DIR="/rs1/shares/brc
/trainings/hazel_hpc/data"
SAMPLE="sample_001"
Defines paths and parameters used throughout the script:
• Container image location
• Input data directory
• Output directory for results
• Sample identifier to process
Create Directories mkdir -p /path/to/working/
dir/fastqc_results
OUT_DIR="/path/to/working/
dir/fastqc_results"
Creates the output directory if it doesn’t already exist. The -p flag prevents errors if the directory already exists.
Validate Inputs if [ ! -f "${FASTQC_SIF}" ]; then
  echo "ERROR: Container not found"
  exit 1
fi
Checks that required files and directories exist before running the main task:
• Verifies the container image exists
• Confirms the input directory is accessible
• Exits with error code if validation fails
Core Task apptainer exec ${FASTQC_SIF} fastqc \
  --threads 2 \
  --outdir ${OUT_DIR} \
  ${IN_DIR}/${SAMPLE}_*.fastq*
Runs the main application using Apptainer to execute fastqc within the container:
--threads 2 - uses 2 CPU cores (matches #BSUB -n 2)
--outdir - specifies where to save results
• Processes all FASTQ files matching the sample pattern
Error Checking if [ $? -eq 0 ]; then
  echo "FastQC completed successfully"
else
  echo "ERROR: FastQC failed"
  exit 1
fi
Checks the exit status of the previous command:
$? contains the exit code (0 = success)
• Logs whether the task completed successfully
• Exits with error code if the task failed
End Log echo "Job completed: $(date)"
echo "Results saved to: ${OUT_DIR}"
Logs the end time and output location. This helps you:
• Calculate actual runtime
• Confirm the job completed
• Know where to find results

Key Principle: Every command listed here runs on the allocated compute resources, not on the login node. This ensures your intensive computations don’t overload the shared login environment.

Best Practices Shown:

  • Logging: Track when and where your job runs
  • Variables: Make paths easy to change and reuse
  • Validation: Catch problems before wasting compute time
  • Error handling: Know immediately if something goes wrong
  • Documentation: Clear comments explain what each section does
Important

All images should be in /rs1/shares/brc/admin/containers/images, if your tool is not there feel free to contact us and request it.



4.0.3 🛠️ Job Monitoring and Management

These are some of the most relevalt commands you will need to monitor your jobs:

Command Category Command Syntax Description
Submission bsub < yourjob.sh Submits the job script (yourjob.sh) to the LSF scheduler. The scheduler processes the #BSUB directives within the script.
Status Check bjobs Displays a list of all your jobs (running, pending, suspended, etc.) and their current status.
Detailed Status bjobs -l [JOBID] Provides a detailed report on a specific job, including the execution host, resource usage, and waiting reasons (if pending).
Termination bkill [JOBID] Terminates (kills) a running or pending job instantly. Use this if your job is faulty or you no longer need the results.
Resource Inquiry bhosts Shows the status of the compute nodes in the cluster, indicating which are available, busy, or otherwise unavailable.

4.0.3.1 1. Get job information with bjobs

The bjobs command is used to monitor the status of jobs after they are submitted to LSF. An LSF job status is usually in one of two states (STAT): PEND means that the job is queued and waiting for resources to become available and RUN means that the job is currently executing.

A typical bjobs output looks like:

[unityID@login01 ~]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
948851  unityID  RUN   standard   login01.hpc bc2j6       myjob1 

Once you have the JOBID you can obtain more information about the job using:

bjobs -l JOBID

For jobs in a pending state, this will give information about why the job is pending. It is possible to make resource requests that are impossible to satisfy; jobs that have been pending for a long time may have made that type of request. It is possible that the chosen resources are currently being used, but it is also possible that the specified resources do not exist on the system. For example, LSF will not give an error message upon requesting a 64 core node with 500 GB of memory; it will simply wait until such a node is installed, leaving the job in a forever pending state.

Other job information commands:

  • bjobs -l: get more detailed information on the jobs than bjobs
  • bjobs -lp -p3: will include a list of reasons the job is pending, and it may include an estimate of when the job will start (e.g., Job will start no sooner than indicated time stamp).
  • bjobs -u all | grep gpu: find jobs running in a particular queue
  • bjobs -r -X -o "jobid queue cpu_used run_time avg_mem max_mem slots delimiter=','": return a CSV formatted list of your jobs showing the job ID, queue, total CPU time, elapsed wall clock time, average memory utilized, maximum memory utilized, and the number of cores reserved

Note: Job priority is determined by several factors including fair share priority, queue priority, and time of submission.

4.0.3.2 2. Modify job requests with bmod

For example, to change the wall clock limit for a pending or running job, use

bmod -W [new time] [job ID].

4.0.3.3 3. Kill jobs with bkill

A pending job may be removed from the queue or a running job may be terminated by using the bkill command. The job ID is used to specify which job to remove.

bkill JOBID

To terminate all running and pending jobs, use bkill 0.



4.0.4 🚦 Queues on Hazel (from HazelWiki)

To specify a queue, use -q queue_name.

Note

In general, users should not specify a queue. When no queue is specified, LSF will choose the most appropriate queue based on the number of cores and time requested from the set of default queues. The exceptions are partner queues and specialty queues, which are queues with special resources.

4.0.4.1 Default Queues

  • debug
  • serial
  • short
  • single_chassis
  • standard
  • long

4.0.4.2 Specialty Queues

  • shared_memory - Nodes intended for running OpenMP or other shared memory executables particularly if have large memory requirements
  • gpu - Nodes with attached NVIDIA GPUs
  • short_gpu - Access to partner GPU nodes for up to 2 hour run time

The queues available to a user can be displayed by using bqueues -u user_name, and the properties of a queue can be displayed by using bqueues -l queue_name. For help interpreting the output of bqueues, see this example

(base) [mtouced@login03 ~]$ bqueues -u mtouced
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
debug            80  Open:Active       -   64    -    -    64    64     0     0
short            75  Open:Active    2560 1024    -    -  4229  2792  1237     0
shared_memory    70  Open:Active       -  256    -    -  2164  1086  1028     0
single_chassis   65  Open:Active       - 1024    -    -  6090  2759  2595     0
standard         64  Open:Active       - 1024    -    -   112   112     0     0
long             63  Open:Active    1536  512    -    -  1167   644   523     0
short_gpu        62  Open:Active       -   64    -    -    78    20    58     0
gpu              60  Open:Active       -   64    -    -   380   100   280     0
serial           30  Open:Active    2048 1024    -    -   342    63   279     0
4.0.4.2.1 bqueues output:
  • NJOBS: number of total jobs in the queue
  • RUN: Number of jobs actually running
  • PEND: Number of jobs pending
  • MAX: The maximum number of cores available. For some queues, like gpu, the MAX is not shown.

4.0.4.3 Other commands to get queue information

  • lshosts | grep gpu or bqueues -l gpu: find which hosts have GPUs


4.0.5 📤 Standard Output and Standard Error

When you submit a job to the HPC, your program’s output needs to go somewhere. By default, programs write to two streams:

  • Standard Output (stdout): Normal program output and results
  • Standard Error (stderr): Error messages and warnings

4.0.5.1 Specifying Output Files in LSF

In your batch script, use these directives to control where output goes:

#BSUB -o stdout.%J    # Standard output file
#BSUB -e stderr.%J    # Standard error file

The %J is automatically replaced by your job ID when the job starts. For example, if your job ID is 948851, the files will be named: - stdout.948851 - stderr.948851

You can customize the names:

#BSUB -o logs/myanalysis_out.%J
#BSUB -e logs/myanalysis_err.%J


4.0.6 ⚠️ Common Errors and How to Fix Them

4.0.6.1 1. File or Directory Missing

Error message in stderr:

/bin/bash: line 12: reads_R1.fastq: No such file or directory

or

cannot access 'data/samples.txt': No such file or directory

Causes:

  • The file doesn’t exist where you think it does
  • You’re using a relative path, but the job runs from a different directory
  • Typo in the filename

Solutions:

  • Use absolute paths: /path/to/your/data/reads_R1.fastq
  • Verify the file exists before submitting: ls -l reads_R1.fastq
  • Check your current directory: pwd (jobs run from the directory where you submit)
  • Use #BSUB -cwd /path/to/working/directory to specify working directory

4.0.6.2 2. Out of Memory (OOM)

Error message in stderr:

slurmstepd: error: Detected 1 oom-kill event(s)

or

Killed
java.lang.OutOfMemoryError: Java heap space

Cause: Your job exceeded the memory allocated or available on the node.

Solutions:

  • Request more memory using -R "rusage[mem=XXXX]" (memory in MB per core)
  • Check if your program has memory limit options
  • Use a memory-efficient algorithm or subsample your data

4.0.6.3 3. Wall Time Exceeded

Error message in stderr:

TERM_RUNLIMIT: job killed after reaching LSF run time limit

Cause: Your job ran longer than the time specified with -W.

Solutions:

  • Increase the wall time: -W 480 (for 8 hours)
  • Optimize your code or reduce input size
  • Break the job into smaller tasks
  • Use checkpointing if your program supports it

4.0.6.4 4. Wrong Type (WT) - Incorrect Resource Specification

Error message:

Job not started: requested resource [avx512] not available

Cause: You requested a resource (CPU instruction set, GPU, specific node type) that doesn’t exist or isn’t available.

Solutions:

  • Check available resources at the cluster status page
  • Remove or modify the resource specification
  • Use more general resource requests

Common resource specifications on NCSU HPC:

#BSUB -R "select[avx2]"     # AVX2 instruction set
#BSUB -R "select[qc]"       # 8-core nodes
#BSUB -R "select[sc]"       # 16-core nodes
#BSUB -R "select[dc]"       # 32-core nodes
#BSUB -q gpu                # GPU queue

4.0.6.5 5. Module Not Found

Error message in stderr:

module: command not found

or

Lmod has detected the following error: The following module(s) are unknown: "blast"

Causes:

  • Module system not initialized (rarely on NCSU HPC)
  • Module name is incorrect or doesn’t exist
  • Module is not available on the compute nodes

Solutions:

  • Check available modules: module avail
  • Search for the module: module spider blast
  • Verify correct module name and version
  • Load prerequisite modules first

4.0.6.6 6. Permission Denied

Error message in stderr:

./my_script.sh: Permission denied

Cause: The script or executable doesn’t have execute permissions.

Solution: Make the file executable before submitting:

chmod +x my_script.sh

4.0.7 🐛 Debugging Tips

  1. Check your error file first: Most problems will show up in the stderr file
   cat error.JOBID
   tail -50 error.JOBID  # Last 50 lines often contain the error
  1. Test your script interactively first: Request an interactive session and run commands manually
   bsub -Is -n 1 -W 30 bash
   # Test your commands here before submitting a batch job
  1. Use verbose/debug flags: Many programs have options for detailed output
   my_program --verbose input.txt
  1. Add checks in your script:
   #!/bin/bash
   #BSUB -n 1
   #BSUB -W 30
   #BSUB -J mytest
   #BSUB -o output.%J
   #BSUB -e error.%J
   
   # Exit on any error
   set -e
   
   # Print commands as they execute
   set -x
   
   # Check if input file exists
   if [ ! -f "input.txt" ]; then
       echo "ERROR: input.txt not found!"
       exit 1
   fi
   
   # Your analysis here
   my_program input.txt
  1. Monitor your job: Check on running jobs regularly
   bjobs -l JOBID  # Detailed job info