4 Running Jobs with SLURM
4.1 What is a Job?
A job is the unit of work you submit to the cluster. It packages your analysis script together with resource requirements (cores, memory, time). A job-scheduler, called SLURM, queues the job, assigns it to compute nodes when resources are available, and manages its execution.
Since real work cannot be done on the login nodes, learning how to submit jobs to the powerful compute nodes is an essential part of using Hazel effectively.
4.2 What is SLURM?
When you work on your laptop, you’re the only user—so you can run anything you want, anytime. A cluster is different. Hazel has hundreds of compute nodes, but thousands of researchers share them. Without a coordinator, jobs would compete for resources, nodes would sit idle while other work waited, and no one could predict when their analysis would actually run.
That coordinator is SLURM (Simple Linux Utility for Resource Management). SLURM is the job scheduler that sits between you and the compute nodes. You describe what you need—cores, memory, time—and SLURM queues your request, waits until a suitable node is available, launches your work there, and cleans up afterward.
Your workflow follows a consistent pattern every time:
- Write a job script — a shell script that declares your resource requirements and the commands to run
- Submit it with
sbatch; you immediately get a job ID - SLURM queues the job and schedules it when resources open up
- Your job runs on a compute node; output goes to log files you specify
- You check results when it finishes—no babysitting required
You don’t need to be active on Hazel for your job to run. Once a job is submitted, you can close your laptop and it will still run.
4.3 Anatomy of a Job Script
A SLURM job script is a regular shell script with two distinguishing features: a block of #SBATCH directives near the top that tell SLURM what resources to allocate, and the analysis commands that follow. Here is an example job script in a file named hello_world.sh.
#!/bin/bash
# ---------------------------------------
# Hello World job script
# ---------------------------------------
# --- Resources ---
#SBATCH --job-name=hello_world
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --partition=shared
#SBATCH --output=logs/hello_world%j.out
#SBATCH --error=logs/hello_world%j.err
#SBATCH --time=0:10:00
# --- Environment ---
module load python
# --- Execute ---
python hello.pyTo submit a job script to SLURM, use the sbatch command
$ sbatch hello_world.sh
submitted batch job 12345In this example, 12345 is the job id.
4.3.1 How #SBATCH Directives Work
Lines starting with #SBATCH look like comments to bash but are read by SLURM when you submit the script. A few rules to keep in mind:
- Directives must appear before any executable command in the script. Once SLURM hits the first real command, it stops parsing directives.
- Each directive uses the same long-form flag you would pass to
sbatchon the command line.#SBATCH --time=1:00:00is equivalent tosbatch --time=1:00:00 my_job.sh. - Command-line flags override directives in the script, which is useful for one-off overrides without editing the file:
sbatch --time=4:00:00 my_job.sh. - The shebang (
#!/bin/bash) must be the first line. SBATCH directives come immediately after.
Group related directives together (resources, output paths, notifications) and add a blank line between groups. It makes the script much easier to scan.
4.3.2 Resource Directives Reference
| Directive | Example | Description |
|---|---|---|
--job-name |
--job-name=hello |
Label shown in squeue output |
--nodes |
--nodes=1 |
Number of compute nodes |
--ntasks |
--ntasks=1 |
Number of parallel tasks (MPI ranks) |
--cpus-per-task |
--cpus-per-task=2 |
CPU cores per task (use for threaded tools) |
--mem |
--mem=4G |
Total RAM for the job |
--partition |
--partition=shared |
Queue to submit to |
--output |
--output=out.%j.log |
Standard output file (%j = job ID) |
--error |
--error=err.%j.log |
Standard error file |
--time |
--time=2:00:00 |
Wall-clock time limit (HH:MM:SS) |
See Chapter 7 Job Performance for information on how to estimate resource needs for you job.
4.3.3 Environment Setup
HPC systems use modules to manage software versions. In your job script, load the module for each tool your job needs before running it. Examples:
$ module load python # system-provided python
$ module load samtools/1.17 # load samtools version 1.17
$ module load apptainer # run containerized softwareFor more information on modules, see Chapter 2.4.3
Shared BRC container images live in /rs1/shares/brc/admin/containers/images. See Chapter 14 Loading BRC Modules.
Key SLURM environment variables available inside every job, add these to your job script if needed:
| Variable | Value |
|---|---|
$SLURM_JOB_ID |
Unique job ID |
$SLURM_CPUS_PER_TASK |
Cores allocated (matches --cpus-per-task) |
$SLURM_MEM_PER_NODE |
Memory allocated in MB |
$SLURM_SUBMIT_DIR |
Directory where sbatch was run |
4.4 Job Monitoring and Management
| Task | Command |
|---|---|
| Submit job | sbatch job.sh |
| List your jobs | squeue -u $USER |
| Detailed job info | scontrol show job JOBID |
| Cancel a job | scancel JOBID |
| Cancel all your jobs | scancel -u $USER |
| Modify a pending job | scontrol update JobId=JOBID TimeLimit=NEW_HH:MM:SS |
| Node/partition status | sinfo |
To receive additional information about job progress without being logged onto Hazel, use the following SBATCH directives in your job script to send you email updates:
#SBATCH --mail-user=<unityid>@ncsu.edu
#SBATCH --mail-type=ALLOptions for mail-type
| Option | timing |
|---|---|
BEGIN |
Email at job start |
END |
Email at successful job completion |
FAIL |
Email at job failure |
ALL |
BEGIN + END + FAIL |
NONE |
No emails (default) |
REQUEUE |
Email if job is requeued |
4.4.1 Reading squeue Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST
948851 shared hello_world uid R 0:23 1 node042
948852 shared bigrun uid PD 0:00 1 (Resources)
Status codes: R = Running · PD = Pending · CG = Completing
When a job is pending, (Resources) means nodes are busy. (Priority) means your job is waiting behind higher-priority submissions.
4.4.2 Why is my job pending?
$ squeue -j JOBID --reasonCommon reasons:
Resources— cluster at capacity; your job will start when nodes free upPriority— other jobs have higher priorityQOSMaxJobsPerUser— you’ve hit the per-user job limitReqNodeNotAvail— the resources you requested don’t exist or aren’t available (check your directives)
4.5 Partitions (Queues)
SLURM organizes nodes into partitions. In most cases, omit --partition and let SLURM choose based on your resource request.
$ sinfo # show all partitions
$ sinfo -p shared # details for one partitionTypical partitions on Hazel:
| Partition | Purpose |
|---|
TODO: Add partition info once SLURM transition is finalized.
4.6 Standard Output and Error
SLURM separates program output into two streams:
- stdout (
--output): normal results and print statements - stderr (
--error): warnings and error messages
The %j token in filenames is replaced by the job ID at runtime:
#SBATCH --output=logs/analysis.%j.out
#SBATCH --error=logs/analysis.%j.errAlways create the log directory before submitting:
$ mkdir -p logs
$ sbatch job.sh4.7 Common Errors and Fixes
When a job fails, the error file is almost always your first stop:
$ cat logs/analysis.12345.err
$ tail -50 logs/analysis.12345.errFor more detail on what SLURM recorded about the run — exit code, allocated resources, completion state — use sacct:
$ sacct -j 12345 --format=JobID,State,ExitCode,Elapsed,MaxRSS,ReqMemSome common errors and fixes are shown below:
4.7.1 File or Directory Not Found
Error:
/bin/bash: reads_R1.fastq: No such file or directory- Use absolute paths everywhere:
/rs1/researchers/s/smith/data/reads_R1.fastq - Verify files exist before submitting:
ls -l reads_R1.fastq - Note for releative paths to files: Jobs run from the directory where you called
sbatch.
4.7.2 Out of Memory
Error:
slurmstepd: error: Detected 1 oom-kill event(s)- Increase memory:
#SBATCH --mem=16G - Check what the job actually needed:
seff JOBID(after it finishes)
4.7.3 Wall Time Exceeded
Error:
slurmstepd: error: Job 12345 exceeded time limit, sending SIGTERM- Increase time limit:
#SBACH --time=8:00:00 - Test with a subset of data first to estimate real runtime
4.7.4 Module Not Found
ERROR: Unable to locate a modulefile for '<module_name>'- Search for the correct name:
module av - Check if
modulehas access to the directory the module file is in withmodule path; if not usemodule use /path/to/module/dir. - Check if a prerequisite module must be loaded first
4.7.5 Permission Denied
Error:
./my_script.sh: Permission deniedFix:
$ chmod +x my_script.sh4.8 Interactive Jobs
Sometimes you may need to get on a compute not for an interactive session to test, debug, or use GUI applications. The srun --pty command gives you a shell directly on a compute node. Most SBATCH directives can also be passed to srun --pty as flags. Some examples are below
# 1 core, 10 minutes
$ srun --pty -n 1 --time=0:10:00 bash
# 4 cores on a single node, 30 minutes
$ srun --pty --nodes=1 --ntasks=1 --cpus-per-task=4 --time=0:30:00 bash4.9 Next Steps
For tips on writing scripts that fail more loudly and informatively, see the next chapter on best practices.