Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

fasterp

High-performance FASTQ preprocessing in Rust. A drop-in replacement for fastp with blazing fast performance.

Installation

cargo install fasterp

Or build from source:

git clone https://github.com/drbh/fasterp
cd fasterp
cargo build --release

Quick Start

# Single-end
fasterp -i input.fq -o output.fq

# Paired-end
fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq

# With quality filtering
fasterp -i input.fq -o output.fq -q 20 -l 50

# With adapter trimming
fasterp -i input.fq -o output.fq -a AGATCGGAAGAGC

# Multi-threaded with memory limit
fasterp -i large.fq -o output.fq -w 8 --max-memory 4096

Quick Reference

Copy-paste commands for common tasks:

TaskCommand
Basic cleanupfasterp -i in.fq -o out.fq
Strict quality filterfasterp -i in.fq -o out.fq -q 20 -u 30 -l 50
Aggressive trimmingfasterp -i in.fq -o out.fq --cut-front --cut-tail --cut-mean-quality 20
Remove adapters onlyfasterp -i in.fq -o out.fq -a AGATCGGAAGAGC
Paired-end basicfasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq
Paired-end + mergefasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -m --merged-out merged.fq
Paired-end + correctionfasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -c
Fast (16 threads)fasterp -i in.fq -o out.fq -w 16
Memory-limited (4GB)fasterp -i in.fq -o out.fq --max-memory 4096
Compressed in/outfasterp -i in.fq.gz -o out.fq.gz
Split into 10 filesfasterp -i in.fq -o out.fq -s 10
Skip all filteringfasterp -i in.fq -o out.fq -A -G -Q -L
With UMI extractionfasterp -i in.fq -o out.fq --umi --umi-len 8
Deduplicatefasterp -i in.fq -o out.fq -D
Custom report namefasterp -i in.fq -o out.fq -j report.json --html report.html

Pro tip: Combine flags freely. Example for production pipeline:

fasterp -i R1.fq.gz -I R2.fq.gz -o o1.fq.gz -O o2.fq.gz \
  -q 20 -l 50 --cut-tail --cut-mean-quality 20 \
  -c -w 16 --max-memory 8192

CLI Reference

Input/Output

FASTQ files contain sequence data as text. Each “read” is a short string of letters (A, C, G, T) representing a fragment of genetic material, plus a confidence score for each letter. fasterp reads these files, processes them, and writes cleaned versions.

FlagLongDefaultDescription
-i--in1Input FASTQ file or URL (use ‘-’ for stdin, supports .gz and http(s)://)
-o--out1Output FASTQ file (use ‘-’ for stdout, .gz extension enables compression)
-I--in2Read2 input file or URL (paired-end mode, supports .gz and http(s)://)
-O--out2Read2 output file (paired-end mode)
--unpaired1Output file for unpaired read1
--unpaired2Output file for unpaired read2
--interleaved-inInput is interleaved paired-end
input.fq.gz Process output.fq.gz report.json

Length Filtering

Reads that are too short often come from errors or failed sequencing. Think of it like filtering out incomplete sentences - if a read is shorter than your minimum threshold, it’s probably not useful data and gets discarded.

FlagLongDefaultDescription
-l--length-required15Minimum length required
-L--disable-length-filterDisable length filtering (fastp compatibility)
-b--max-len10Maximum length for read1 - trim if longer (0 = disabled)
-B--max-len20Maximum length for read2 - trim if longer (0 = disabled)
Read len ≥ 15? yes Keep no Discard

Quality Filtering

Each letter in a read has a confidence score (0-40+) indicating how certain the machine was about that letter. Higher is better. Quality filtering removes reads with too many low-confidence letters, too many unknown letters (N), or repetitive patterns that suggest errors. It’s like spell-checking - removing text that’s too garbled to trust.

FlagLongDefaultDescription
-q--qualified-quality-phred15Quality value that a base is qualified (phred >= this)
-u--unqualified-percent-limit40Percent of bases allowed to be unqualified (0-100)
-e--average-qual0Average quality threshold - discard if mean < this (0 = disabled)
-n--n-base-limit5Max number of N bases allowed
-y--low-complexity-filterEnable low complexity filter
-Y--complexity-threshold30Complexity threshold (0-100), 30 = 30% complexity required
Read N bases ≤ 5? Mean quality ≥ threshold? Pass Filter Discard

Trimming - Fixed Position

Sometimes the beginning or end of reads contain known bad data (like primer sequences or low-quality regions). Fixed trimming removes a set number of characters from the start or end of every read - like cutting off the first and last few words of every sentence because you know they’re always wrong.

FlagLongDefaultDescription
--trim-front0Trim N bases from 5’ (front) end
--trim-tail0Trim N bases from 3’ (tail) end
-f--trim-front10Trim N bases from front for read1
-t--trim-tail10Trim N bases from tail for read1
-F--trim-front20Trim N bases from front for read2
-T--trim-tail20Trim N bases from tail for read2
Before: 5'─ NNN ACGTACGTACGT NN ─3' -f 3 -t 2 After: 5'─ ACGTACGTACGT ─3'

Trimming - Quality Based

Instead of cutting a fixed amount, this looks at actual quality scores and trims where the data gets bad. A sliding window moves along the read, checking average quality. When it drops below your threshold, everything past that point is cut off. It’s like editing a document and stopping where the text becomes unreadable.

FlagLongDefaultDescription
--cut-mean-quality0Quality cutoff for sliding-window trimming (0 = disabled)
--cut-window-size4Sliding window size for quality trimming
--cut-frontEnable quality trimming at 5’ end (front)
--cut-tailEnable quality trimming at 3’ end (tail)
--disable-quality-trimmingDisable polyG/quality tail trimming
Quality scores: 10 12 14 35 36 38 37 35 20 15 10 8 window=4 cut here Result: Keep region where sliding window mean ≥ 20

PolyG/PolyX Trimming

Some sequencing machines produce false runs of the same letter (like “GGGGGGGG”) at the end of reads when they run out of real signal. These aren’t real data - they’re artifacts. This feature detects and removes these repetitive tails, like removing a stuck key’s output from a document.

FlagLongDefaultDescription
--trim-poly-gEnable polyG tail trimming
-G--disable-trim-poly-gDisable polyG tail trimming
--trim-poly-xEnable generic polyX tail trimming (any homopolymer)
--poly-g-min-len10Minimum length for polyG/polyX detection
Before: ACGTACGT GGGGGGGGGGGG (polyG tail) After: ACGTACGT (NovaSeq artifact removed)

Adapter Trimming

Adapters are short synthetic sequences added during sample preparation - they’re not part of the actual sample. They appear at the ends of reads and must be removed, like stripping metadata headers from a file. fasterp can auto-detect common adapters or you can specify them manually.

FlagLongDefaultDescription
-A--disable-adapter-trimmingDisable adapter trimming
-a--adapter-sequenceAdapter for read1 (auto-detected if not specified)
--adapter-sequence-r2Adapter sequence for read2
--disable-adapter-detectionDisable adapter auto-detection
Auto-detect: Sample reads → K-mer analysis → Find adapter sequence Before: ACGTACGT AGATCGGAAGAGC (adapter) After: ACGTACGT

Paired-End Options

Paired-end sequencing reads the same fragment from both ends, giving you two reads that overlap in the middle. When they overlap, you can compare them: if one says “A” with high confidence and the other says “T” with low confidence, you trust the “A”. You can also merge overlapping pairs into a single, longer read. It’s like having two people transcribe the same audio and combining their best parts.

FlagLongDefaultDescription
-c--correctionEnable base correction using overlap analysis
--overlap-len-require30Minimum overlap length required for correction
--overlap-diff-limit5Maximum allowed differences in overlap region
--overlap-diff-percent-limit20Maximum allowed difference percentage (0-100)
-m--mergeMerge overlapping paired-end reads into single reads
--merged-outOutput file for merged reads
--include-unmergedWrite unmerged reads to merged output file
Overlap Detection: R1: 5'─ ACGTACGT NNNNNN ─3' R2: 3'───── NNNNNN TGCATGCA ─5' overlap region Base Correction (-c): R1: A (Q30) vs R2: T (Q15) → Use A Merged (-m): 5'─ACGTACGT──────TGCATGCA─3'

UMI Processing

A UMI (Unique Molecular Identifier) is a short random tag added to each original molecule before copying. Since copies of the same molecule have the same UMI, you can later identify and remove duplicates that came from the same original. Think of it like adding a unique serial number to each document before photocopying - you can then tell which copies came from which original.

FlagLongDefaultDescription
--umiEnable UMI preprocessing
--umi-locread1UMI location: read1/read2/index1/index2/per_read/per_index
--umi-len0UMI length (required when –umi is enabled)
--umi-prefixUMIPrefix in read name for UMI
--umi-skip0Skip N bases before UMI
Before: ACGTACGT NNNNNNNNNNN... UMI (8bp) After: @READ_NAME UMI:ACGTACGT

Deduplication

During sample preparation, molecules get copied many times. Identical reads are likely copies of the same original, which can skew your analysis (like counting the same vote multiple times). Deduplication detects and removes these duplicates using a memory-efficient probabilistic filter. It estimates duplication rates without needing to store every read in memory.

FlagLongDefaultDescription
-D--dedupEnable deduplication to remove duplicate reads
--dup-calc-accuracy0Deduplication accuracy (1-6), higher = more memory
--dont-eval-duplicationDon’t evaluate duplication rate (saves time/memory)
Read Hash In Bloom? yes Discard no Add to filter Keep

Output Splitting

Large datasets can be split into multiple smaller files for parallel downstream processing. You can split by total file count (distribute reads evenly across N files) or by lines per file (each file gets up to N lines). This is useful when you want to process chunks in parallel on different machines.

FlagLongDefaultDescription
-s--splitSplit output by limiting total number of files (2-999)
-S--split-by-linesSplit output by limiting lines per file (>=1000)
-d--split-prefix-digits4Digits for file number padding (1-10)
input.fq out.0001.fq out.0002.fq out.0003.fq

Reporting

fasterp generates detailed statistics about your data before and after processing: total reads, quality distributions, how many were filtered out and why. The JSON report is machine-readable for pipelines; the HTML report has interactive charts for visual inspection.

FlagLongDefaultDescription
-j--jsonfasterp.jsonJSON report file
--htmlfasterp.htmlHTML report file
--stats-formatcompactStats output format: compact, pretty, off, or jsonl

Performance

Control how fasterp uses your system resources. More threads = faster processing but more CPU/memory usage. The batch size affects memory consumption - larger batches are more efficient but use more RAM. Use --max-memory to set a limit and let fasterp auto-tune the other settings.

FlagLongDefaultDescription
-w--threadautoNumber of worker threads
-z--compression6Compression level for gzip output (0-9)
--parallel-compressiontrueUse parallel compression for gzip output
--batch-bytes33554432Batch size in bytes (32 MiB)
--max-backlogthreads+1Maximum backlog of batches
--max-memoryMaximum memory usage in MB
--skip-kmer-countingSkip k-mer counting for performance

Architecture

fasterp processes data in a pipeline with three stages running in parallel. The Producer reads chunks from disk. Multiple Workers process those chunks simultaneously (filtering, trimming, counting). The Merger collects results in order and writes output. This design keeps all CPU cores busy while maintaining correct output order.

Producer Read & Parse 32MB batches Worker 1 Worker 2 Worker N Merger Reorder & Write Aggregate stats output.fq report.json

fasterp uses a 3-stage pipeline for multi-threaded processing:

  1. Producer: Reads input in 32MB batches, parses FASTQ records
  2. Workers: Process batches in parallel (filtering, trimming, stats)
  3. Merger: Writes output in order, aggregates statistics

Key optimizations:

  • SIMD acceleration (AVX2/NEON) for quality statistics
  • Zero-copy buffers with Arc<Vec<u8>>
  • Lookup tables for base/quality checks
  • Bloom filter for duplication detection

Examples

Quality Control Only

fasterp -i input.fq -o output.fq -A -G

Aggressive Trimming

fasterp -i input.fq -o output.fq \
  --cut-front --cut-tail --cut-mean-quality 20 \
  -q 20 -u 30 -l 50

Paired-End with Merging

fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq \
  -m --merged-out merged.fq -c

High-Throughput Pipeline

fasterp -i large.fq.gz -o output.fq.gz \
  -w 16 --max-memory 8192 -z 4

With UMI

fasterp -i input.fq -o output.fq \
  --umi --umi-loc read1 --umi-len 12

Comparison with fastp

fasterp produces byte-exact output for standard parameters:

fastp -i input.fq -o fastp_out.fq
fasterp -i input.fq -o fasterp_out.fq
sha256sum fastp_out.fq fasterp_out.fq  # hashes match

Tutorial: Compare fasterp and fastp

This tutorial walks through downloading real sequencing data, running both fasterp and fastp, and comparing their outputs and reports.

Prerequisites

  • fasterp installed (cargo install --git https://github.com/drbh/fasterp.git)
  • fastp installed (installation guide)
  • curl for downloading files
  • jq (optional) for inspecting JSON reports

Step 1: Download Test Data

Download paired-end FASTQ files from SRX2987343 - a chromatin accessibility study on mouse hair follicle stem cells, examining how DNA packaging changes during differentiation into hair-related cell types.

# Create a working directory
mkdir -p fasterp_tutorial && cd fasterp_tutorial

# Download R1 (forward reads) - 365 MB
curl -LO https://genedata.dholtz.com/SRX2987343/SRR5808766_1.fastq.gz

# Download R2 (reverse reads) - 364 MB
curl -LO https://genedata.dholtz.com/SRX2987343/SRR5808766_2.fastq.gz

# Verify downloads
ls -lh *.gz

Expected output:

-rw-r--r--  1 user  staff   365M  SRR5808766_1.fastq.gz
-rw-r--r--  1 user  staff   364M  SRR5808766_2.fastq.gz

Step 2: Run fastp

Process the paired-end data with fastp:

fastp \
  -i SRR5808766_1.fastq.gz -I SRR5808766_2.fastq.gz \
  -o fastp_out_R1.fq.gz -O fastp_out_R2.fq.gz \
  -j fastp_report.json \
  -h fastp_report.html

Expected output:

Read1 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 494840792(99.0171%)
Q30 bases: 490032403(98.0549%)
Q40 bases: 333003632(66.6337%)

Read2 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 491032411(98.255%)
Q30 bases: 485393847(97.1268%)
Q40 bases: 335965361(67.2263%)

Read1 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 481444894(99.0657%)
Q30 bases: 476939326(98.1386%)
Q40 bases: 324904111(66.8547%)

Read2 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 479080210(98.5791%)
Q30 bases: 473663787(97.4646%)
Q40 bases: 327920585(67.4754%)

Filtering result:
reads passed filter: 19523574
reads failed due to low quality: 74384
reads failed due to too many N: 194
reads failed due to too short: 0
reads with adapter trimmed: 3113628
bases trimmed due to adapters: 23734734

Duplication rate: 28.3885%

Insert size peak (evaluated by paired-end reads): 51

JSON report: fastp_report.json
HTML report: fastp_report.html

fastp v1.0.1, time used: 11 seconds

Step 3: Run fasterp

Process the same data with fasterp:

fasterp \
  -i SRR5808766_1.fastq.gz -I SRR5808766_2.fastq.gz \
  -o fasterp_out_R1.fq.gz -O fasterp_out_R2.fq.gz \
  -j fasterp_report.json \
  --html fasterp_report.html

Expected output:

Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 494840792(99.0171%)
Q30 bases: 490032403(98.0549%)
Q40 bases: 333003632(66.6337%)

Read2 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 491032411(98.2550%)
Q30 bases: 485393847(97.1268%)
Q40 bases: 335965361(67.2263%)

Read1 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 481444894(99.0657%)
Q30 bases: 476939326(98.1386%)
Q40 bases: 324904111(66.8547%)

Read2 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 479080210(98.5791%)
Q30 bases: 473663787(97.4646%)
Q40 bases: 327920585(67.4754%)

Filtering result:
reads passed filter: 19523574
reads failed due to low quality: 74384
reads failed due to too many N: 194
reads failed due to too short: 0
reads with adapter trimmed: 3113628
bases trimmed due to adapters: 23734734

Duplication rate: 28.3885%

Insert size peak (evaluated by paired-end reads): 51

JSON report: fasterp_report.json
HTML report: fasterp_report.html

fasterp v0.1.0, time used: 6 seconds

Note: fasterp processed ~10 million read pairs in 6 seconds vs fastp’s 11 seconds.

Step 4: Compare Outputs

Verify identical FASTQ output

When comparing gzipped files, the compressed hashes may differ due to compression metadata. Compare the decompressed content:

# Compare decompressed R1 content
gunzip -c fastp_out_R1.fq.gz | shasum -a 256
gunzip -c fasterp_out_R1.fq.gz | shasum -a 256

# Compare decompressed R2 content
gunzip -c fastp_out_R2.fq.gz | shasum -a 256
gunzip -c fasterp_out_R2.fq.gz | shasum -a 256

Expected output:

716d6bc9b5aa075e5f7ff527b1638de1ea56b67439b7a7646bfe25fb14d132e7  -
716d6bc9b5aa075e5f7ff527b1638de1ea56b67439b7a7646bfe25fb14d132e7  -

c28e4e14c58ded4f4a7b776f6fb9f92cabc83056253f0ff4e268e5db1653b656  -
c28e4e14c58ded4f4a7b776f6fb9f92cabc83056253f0ff4e268e5db1653b656  -

The hashes match - both tools produced identical output.

Compare key statistics

# Total reads before/after filtering
echo "=== fastp ==="
jq '{
  before: .summary.before_filtering.total_reads,
  after: .summary.after_filtering.total_reads,
  passed_rate: .summary.after_filtering.total_reads / .summary.before_filtering.total_reads * 100
}' fastp_report.json

echo "=== fasterp ==="
jq '{
  before: .summary.before_filtering.total_reads,
  after: .summary.after_filtering.total_reads,
  passed_rate: .summary.after_filtering.total_reads / .summary.before_filtering.total_reads * 100
}' fasterp_report.json

Expected output:

=== fastp ===
{
  "before": 19598152,
  "after": 19523574,
  "passed_rate": 99.61946412090282
}

=== fasterp ===
{
  "before": 19598152,
  "after": 19523574,
  "passed_rate": 99.61946412090282
}

Compare k-mer counts

# Top k-mers should be identical
echo "=== fastp k-mers ==="
jq '.read1_before_filtering.kmer_count | to_entries | sort_by(-.value) | .[0:5]' fastp_report.json

echo "=== fasterp k-mers ==="
jq '.read1_before_filtering.kmer_count | to_entries | sort_by(-.value) | .[0:5]' fasterp_report.json

Expected output:

=== fastp k-mers ===
[
  { "key": "CTGTC", "value": 1651445 },
  { "key": "TGTCT", "value": 1651009 },
  { "key": "TCTCT", "value": 1576849 },
  { "key": "GTCTC", "value": 1358458 },
  { "key": "CTCTT", "value": 1314738 }
]

=== fasterp k-mers ===
[
  { "key": "CTGTC", "value": 1651445 },
  { "key": "TGTCT", "value": 1651009 },
  { "key": "TCTCT", "value": 1576849 },
  { "key": "GTCTC", "value": 1358458 },
  { "key": "CTCTT", "value": 1314738 }
]

Step 5: View Reports

Open the HTML reports in your browser to compare visualizations:

# macOS
open fastp_report.html fasterp_report.html

# Linux
xdg-open fastp_report.html && xdg-open fasterp_report.html

# Or start a local server
python3 -m http.server 8000
# Then visit http://localhost:8000

Example Reports

Here are the actual reports generated from this tutorial:

ToolReport
fastpfastp_report.html
fasterpfasterp_report.html
fastp
fasterp

Compare these side-by-side to see identical quality metrics, base content graphs, and k-mer distributions.

Cleanup

# Remove generated files
rm -f *.gz fastp_report.* fasterp_report.*
cd .. && rmdir fasterp_tutorial

Next Steps

  • Try with your own FASTQ data
  • Experiment with different filtering parameters
  • Check the CLI Reference for all available options

Playground

fasterp WASM Sandbox