7 Job Performance
7.1 Why Resource Estimation Matters
| Too little | Too much |
|---|---|
| Job fails or times out | Longer queue wait |
| Must resubmit, wasting time | Cluster resources sit idle |
| Wastes your fair-share allocation | Reduces your future priority |
The goal is to request what you need with a reasonable buffer — not to over-provision.
7.2 Estimating Resources Before Your First Run
7.2.1 Cores
- Serial code — request 1 core
- Multi-threaded (OpenMP, pthreads) — typically 4–16 cores; check the tool’s documentation
- MPI — can span multiple nodes; test scaling before committing large core counts
- GPU — request a
gpupartition node
Always start small: run on 1–4 cores, measure actual runtime, then decide whether more parallelism helps.
7.2.2 Memory
| Method | How |
|---|---|
| Check documentation | Many tools list minimum/recommended RAM |
| Look at past jobs | sacct shows actual peak memory usage |
| Run a small test | Use a subset of data, then check with seff |
| Estimate from data | If a tool loads input as a matrix and makes N copies: memory ≈ N × file size |
Available node memory configurations on Hazel (GB): 64 / 128 / 192 / 256 / 512* / 1024**
* Limited outside partner queues · ** Rare; expect long waits
Request slightly below the node maximum — the OS also needs RAM. To target a 128 GB node, request --mem=120G.
7.2.3 Time
Run a small test job and extrapolate:
# Run a quick test
$ sbatch --time=0:30:00 -n 1 --cpus-per-task=4 ./test_run.sh
# Check how long it actually took after it finishes
$ seff JOBIDAdd a 20–30% buffer over your measured time. For data-dependent runtimes, linear scaling is a safe first assumption (2× data ≈ 2× time), though some algorithms are super-linear.
7.3 Monitoring Running Jobs
$ squeue -u $USER # all your jobs and their state
$ squeue -j JOBID # status of one job
$ squeue -j JOBID --reason # why a pending job is waiting
$ scontrol show job JOBID # full details: nodes, resources, time left7.3.1 Job States
| Code | Meaning |
|---|---|
PD |
Pending — waiting for resources |
R |
Running |
CG |
Completing — cleaning up |
CD |
Completed successfully (in sacct) |
F |
Failed (in sacct) |
TO |
Timed out |
OOM |
Out of memory |
7.4 Analyzing a Finished Job
7.4.1 seff — Quick Efficiency Report
seff JOBID is the fastest way to see whether your resource requests were appropriate:
Job ID: 948851
Cluster: hazel
Nodes: 1
Cores per node: 4
CPU Utilized: 00:28:43
CPU Efficiency: 85.3% of 00:33:36 core-walltime
Job Wall-clock time: 00:08:24
Memory Utilized: 3.52 GB
Memory Efficiency: 88.0% of 4.00 GB
What to look for:
- CPU Efficiency < 50% — you requested more cores than the tool can use; reduce
--cpus-per-task - Memory Efficiency < 50% — halve your
--memfor future runs - Memory Efficiency > 95% — you were close to OOM; increase
--mem
7.4.2 sacct — Detailed Accounting
# Basic resource summary for one job
$ sacct -j JOBID \
--format=JobID,Elapsed,CPUTime,MaxRSS,AveRSS,ReqMem,AllocCPUs
# CSV output for scripting
$ sacct -j JOBID \
--format=JobID,Partition,Elapsed,CPUTime,MaxRSS,AllocCPUs \
--delimiter=',' --noheader
# All your jobs from the past week
$ sacct --starttime=$(date -d '7 days ago' +%Y-%m-%d) \
--format=JobID,JobName,State,Elapsed,MaxRSS,CPUTimeKey sacct fields:
| Field | Meaning |
|---|---|
Elapsed |
Wall-clock time (actual runtime) |
CPUTime |
Elapsed × AllocCPUs (total CPU time charged) |
MaxRSS |
Peak RAM used (resident set size) |
AveRSS |
Average RAM across all job steps |
ReqMem |
Memory you requested |
7.5 Performance Red Flags
| Symptom | Likely Cause | Fix |
|---|---|---|
| CPU efficiency < 30% | Tool isn’t using all cores | Reduce --cpus-per-task |
| Memory efficiency < 20% | Way over-requested | Halve --mem |
| Job OOM-killed | Under-requested memory | Double --mem, then tune with seff |
| Job timed out | Under-estimated runtime | Run small test first |
| High I/O wait | Too many tiny files on scratch | Bundle files into archives; use local node scratch /tmp |
7.6 Exercise: Profile a Job
# 1. Submit a test job
$ sbatch --time=0:15:00 --ntasks=1 --cpus-per-task=4 --mem=8G ./test_program.sh
# 2. Note the job ID from the output, then monitor
$ squeue -u $USER
# 3. After it finishes, get the efficiency report
$ seff JOBID
# 4. Get detailed accounting
$ sacct -j JOBID --format=JobID,Elapsed,MaxRSS,AllocCPUsAnalyze:
- Was CPU efficiency above 70%?
- Was memory efficiency above 60%?
- What would you change for the production run?
7.7 Workflow for Job Optimization
- Start small — test with a single sample and conservative resources
- Measure — use
seffandsacctto see actual usage - Identify the bottleneck — CPU, memory, I/O, or scaling limit
- Scale — increase resources or sample count incrementally
- Document — record optimal settings in your
config.sh
7.8 Common Job Profiles
CPU-bound (aligners, variant callers): Scale well with cores up to the tool’s thread limit. Find that limit with a core-scaling test (1, 2, 4, 8 cores) and plot runtime vs. cores.
Memory-bound (de novo assembly, large reference loading): RAM is the limiting factor. Cores help less; choose a high-memory node.
I/O-bound (many small file operations): More cores don’t help. Minimize file open/close operations; stage data locally if your cluster provides /tmp on compute nodes; use scratch storage for intermediates.
7.9 Practical Tips
- Test with subsets of data — a 10% sample usually exposes errors and gives usable timing data
- Share resource findings with your research group — optimal settings for common tools rarely change
- Don’t over-optimize — spending 2 hours tuning a job that runs in 20 minutes has diminishing returns
- Check output files for built-in timing summaries — many tools (BWA, STAR, GATK) print runtime stats