11 GNU Parallel on Login Nodes
Compute nodes on Hazel do not have internet access, so tasks that require it — downloading databases, fetching sequencing data from SRA — must run on a login node. However, login nodes are shared infrastructure; running resource-intensive commands directly can affect all other users.
GNU Parallel solves this by letting you run multiple commands simultaneously while capping the number of concurrent processes. You control the load, stay within policy, and finish faster than running tasks sequentially.
GNU Parallel is for login-node tasks that require internet access (downloads, installs). For actual analyses, submit jobs with sbatch.
11.1 What GNU Parallel Does
Without Parallel, downloading 100 SRA accessions means running prefetch 100 times in sequence. With Parallel, you specify -j 4 and it keeps 4 downloads running at all times — as one finishes, the next starts automatically. Progress, logging, and retry-on-failure are built in.
11.2 Example 1: Downloading from SRA
Create a file listing SRA accession numbers, one per line:
# sra_accessions.txt
SRR28473915
SRR28473916
SRR28473917
SRR28473918
SRR28473919
Basic parallel download (3 simultaneous):
#!/bin/bash
TASKS=3
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
SRA_SIF=/rs1/shares/brc/admin/containers/images/quay.io_biocontainers_sra-tools:3.2.1--h4304569_1.sif
module load apptainer
cat sra_accessions.txt \
| $PARALLEL -j $TASKS \
"apptainer exec $SRA_SIF prefetch {}" \
> output.log 2> error.logUse double quotes "..." around the parallel command so that shell variables like $SRA_SIF expand before being passed to each task.
With progress monitoring and job logging:
#!/bin/bash
TASKS=3
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
SRA_SIF=/rs1/shares/brc/admin/containers/images/quay.io_biocontainers_sra-tools:3.2.1--h4304569_1.sif
JOBLOG=/your_workdir/download_log.txt
module load apptainer
cat sra_accessions.txt \
| $PARALLEL -j $TASKS --progress --joblog $JOBLOG \
"apptainer exec $SRA_SIF prefetch {}" \
> output.log 2> error.log| Option | Effect |
|---|---|
-j $TASKS |
Maximum concurrent jobs |
--progress |
Live progress bar in terminal |
--joblog $JOBLOG |
CSV file recording status and timing of each task |
> output.log 2> error.log |
Capture stdout and stderr separately |
11.3 Example 2: Running a Command List
When you need to download from multiple sources (NCBI, Ensembl, SRA, S3), put every command in a file and let Parallel execute them with controlled concurrency.
# all_commands.txt — one command per line
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt -O bacteria_summary.txt
curl -O https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
apptainer exec $SRA_SIF prefetch SRR123456
apptainer exec $SRA_SIF prefetch SRR123457
wget https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz
aws s3 cp s3://1000genomes/data/sample1.vcf.gz . --no-sign-request#!/bin/bash
TASKS=6
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
JOBLOG=/your_workdir/commands_log.txt
module load apptainer
cat all_commands.txt \
| $PARALLEL -j $TASKS --progress --joblog $JOBLOG \
> output.log 2> error.logThe -a - flag is optional when piping from cat; Parallel reads stdin by default.
11.4 Sending Processes to the Background
Since you’re running on a login node, you’ll want to keep using your terminal while downloads proceed.
Background immediately:
$ ./download.sh &Start, then background:
$ ./download.sh # start normally
# Press Ctrl+Z to suspend
$ bg # resume in backgroundSave the PID for later:
$ ./download.sh &
$ echo $! > download.pid
$ echo "Running as PID $(cat download.pid)"
$ echo "To kill: kill $(cat download.pid)"Check or kill a background process:
$ ps aux | grep download.sh # find the process
$ kill -9 [PID] # force-stop it