11  GNU Parallel on Login Nodes

Compute nodes on Hazel do not have internet access, so tasks that require it — downloading databases, fetching sequencing data from SRA — must run on a login node. However, login nodes are shared infrastructure; running resource-intensive commands directly can affect all other users.

GNU Parallel solves this by letting you run multiple commands simultaneously while capping the number of concurrent processes. You control the load, stay within policy, and finish faster than running tasks sequentially.

Important

GNU Parallel is for login-node tasks that require internet access (downloads, installs). For actual analyses, submit jobs with sbatch.

11.1 What GNU Parallel Does

Without Parallel, downloading 100 SRA accessions means running prefetch 100 times in sequence. With Parallel, you specify -j 4 and it keeps 4 downloads running at all times — as one finishes, the next starts automatically. Progress, logging, and retry-on-failure are built in.

11.2 Example 1: Downloading from SRA

Create a file listing SRA accession numbers, one per line:

# sra_accessions.txt
SRR28473915
SRR28473916
SRR28473917
SRR28473918
SRR28473919

Basic parallel download (3 simultaneous):

#!/bin/bash
TASKS=3
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
SRA_SIF=/rs1/shares/brc/admin/containers/images/quay.io_biocontainers_sra-tools:3.2.1--h4304569_1.sif

module load apptainer

cat sra_accessions.txt \
  | $PARALLEL -j $TASKS \
      "apptainer exec $SRA_SIF prefetch {}" \
  > output.log 2> error.log
Note

Use double quotes "..." around the parallel command so that shell variables like $SRA_SIF expand before being passed to each task.

With progress monitoring and job logging:

#!/bin/bash
TASKS=3
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
SRA_SIF=/rs1/shares/brc/admin/containers/images/quay.io_biocontainers_sra-tools:3.2.1--h4304569_1.sif
JOBLOG=/your_workdir/download_log.txt

module load apptainer

cat sra_accessions.txt \
  | $PARALLEL -j $TASKS --progress --joblog $JOBLOG \
      "apptainer exec $SRA_SIF prefetch {}" \
  > output.log 2> error.log
Option Effect
-j $TASKS Maximum concurrent jobs
--progress Live progress bar in terminal
--joblog $JOBLOG CSV file recording status and timing of each task
> output.log 2> error.log Capture stdout and stderr separately

11.3 Example 2: Running a Command List

When you need to download from multiple sources (NCBI, Ensembl, SRA, S3), put every command in a file and let Parallel execute them with controlled concurrency.

# all_commands.txt — one command per line
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt -O bacteria_summary.txt
curl -O https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
apptainer exec $SRA_SIF prefetch SRR123456
apptainer exec $SRA_SIF prefetch SRR123457
wget https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz
aws s3 cp s3://1000genomes/data/sample1.vcf.gz . --no-sign-request
#!/bin/bash
TASKS=6
PARALLEL=/rs1/shares/brc/admin/tools/parallel-20250922/bin/parallel
JOBLOG=/your_workdir/commands_log.txt

module load apptainer

cat all_commands.txt \
  | $PARALLEL -j $TASKS --progress --joblog $JOBLOG \
  > output.log 2> error.log

The -a - flag is optional when piping from cat; Parallel reads stdin by default.

11.4 Sending Processes to the Background

Since you’re running on a login node, you’ll want to keep using your terminal while downloads proceed.

Background immediately:

$ ./download.sh &

Start, then background:

$ ./download.sh     # start normally
# Press Ctrl+Z to suspend
$ bg                # resume in background

Save the PID for later:

$ ./download.sh &
$ echo $! > download.pid
$ echo "Running as PID $(cat download.pid)"
$ echo "To kill: kill $(cat download.pid)"

Check or kill a background process:

$ ps aux | grep download.sh    # find the process
$ kill -9 [PID]                # force-stop it