fasterp
High-performance FASTQ preprocessing in Rust. A drop-in replacement for fastp with blazing fast performance.
Installation
cargo install fasterp
Or build from source:
git clone https://github.com/drbh/fasterp
cd fasterp
cargo build --release
Quick Start
# Single-end
fasterp -i input.fq -o output.fq
# Paired-end
fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq
# With quality filtering
fasterp -i input.fq -o output.fq -q 20 -l 50
# With adapter trimming
fasterp -i input.fq -o output.fq -a AGATCGGAAGAGC
# Multi-threaded with memory limit
fasterp -i large.fq -o output.fq -w 8 --max-memory 4096
Quick Reference
Copy-paste commands for common tasks:
| Task | Command |
|---|---|
| Basic cleanup | fasterp -i in.fq -o out.fq |
| Strict quality filter | fasterp -i in.fq -o out.fq -q 20 -u 30 -l 50 |
| Aggressive trimming | fasterp -i in.fq -o out.fq --cut-front --cut-tail --cut-mean-quality 20 |
| Remove adapters only | fasterp -i in.fq -o out.fq -a AGATCGGAAGAGC |
| Paired-end basic | fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq |
| Paired-end + merge | fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -m --merged-out merged.fq |
| Paired-end + correction | fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -c |
| Fast (16 threads) | fasterp -i in.fq -o out.fq -w 16 |
| Memory-limited (4GB) | fasterp -i in.fq -o out.fq --max-memory 4096 |
| Compressed in/out | fasterp -i in.fq.gz -o out.fq.gz |
| Split into 10 files | fasterp -i in.fq -o out.fq -s 10 |
| Skip all filtering | fasterp -i in.fq -o out.fq -A -G -Q -L |
| With UMI extraction | fasterp -i in.fq -o out.fq --umi --umi-len 8 |
| Deduplicate | fasterp -i in.fq -o out.fq -D |
| Custom report name | fasterp -i in.fq -o out.fq -j report.json --html report.html |
Pro tip: Combine flags freely. Example for production pipeline:
fasterp -i R1.fq.gz -I R2.fq.gz -o o1.fq.gz -O o2.fq.gz \
-q 20 -l 50 --cut-tail --cut-mean-quality 20 \
-c -w 16 --max-memory 8192
CLI Reference
Input/Output
FASTQ files contain sequence data as text. Each “read” is a short string of letters (A, C, G, T) representing a fragment of genetic material, plus a confidence score for each letter. fasterp reads these files, processes them, and writes cleaned versions.
| Flag | Long | Default | Description |
|---|---|---|---|
-i | --in1 | Input FASTQ file or URL (use ‘-’ for stdin, supports .gz and http(s)://) | |
-o | --out1 | Output FASTQ file (use ‘-’ for stdout, .gz extension enables compression) | |
-I | --in2 | Read2 input file or URL (paired-end mode, supports .gz and http(s)://) | |
-O | --out2 | Read2 output file (paired-end mode) | |
--unpaired1 | Output file for unpaired read1 | ||
--unpaired2 | Output file for unpaired read2 | ||
--interleaved-in | Input is interleaved paired-end |
Length Filtering
Reads that are too short often come from errors or failed sequencing. Think of it like filtering out incomplete sentences - if a read is shorter than your minimum threshold, it’s probably not useful data and gets discarded.
| Flag | Long | Default | Description |
|---|---|---|---|
-l | --length-required | 15 | Minimum length required |
-L | --disable-length-filter | Disable length filtering (fastp compatibility) | |
-b | --max-len1 | 0 | Maximum length for read1 - trim if longer (0 = disabled) |
-B | --max-len2 | 0 | Maximum length for read2 - trim if longer (0 = disabled) |
Quality Filtering
Each letter in a read has a confidence score (0-40+) indicating how certain the machine was about that letter. Higher is better. Quality filtering removes reads with too many low-confidence letters, too many unknown letters (N), or repetitive patterns that suggest errors. It’s like spell-checking - removing text that’s too garbled to trust.
| Flag | Long | Default | Description |
|---|---|---|---|
-q | --qualified-quality-phred | 15 | Quality value that a base is qualified (phred >= this) |
-u | --unqualified-percent-limit | 40 | Percent of bases allowed to be unqualified (0-100) |
-e | --average-qual | 0 | Average quality threshold - discard if mean < this (0 = disabled) |
-n | --n-base-limit | 5 | Max number of N bases allowed |
-y | --low-complexity-filter | Enable low complexity filter | |
-Y | --complexity-threshold | 30 | Complexity threshold (0-100), 30 = 30% complexity required |
Trimming - Fixed Position
Sometimes the beginning or end of reads contain known bad data (like primer sequences or low-quality regions). Fixed trimming removes a set number of characters from the start or end of every read - like cutting off the first and last few words of every sentence because you know they’re always wrong.
| Flag | Long | Default | Description |
|---|---|---|---|
--trim-front | 0 | Trim N bases from 5’ (front) end | |
--trim-tail | 0 | Trim N bases from 3’ (tail) end | |
-f | --trim-front1 | 0 | Trim N bases from front for read1 |
-t | --trim-tail1 | 0 | Trim N bases from tail for read1 |
-F | --trim-front2 | 0 | Trim N bases from front for read2 |
-T | --trim-tail2 | 0 | Trim N bases from tail for read2 |
Trimming - Quality Based
Instead of cutting a fixed amount, this looks at actual quality scores and trims where the data gets bad. A sliding window moves along the read, checking average quality. When it drops below your threshold, everything past that point is cut off. It’s like editing a document and stopping where the text becomes unreadable.
| Flag | Long | Default | Description |
|---|---|---|---|
--cut-mean-quality | 0 | Quality cutoff for sliding-window trimming (0 = disabled) | |
--cut-window-size | 4 | Sliding window size for quality trimming | |
--cut-front | Enable quality trimming at 5’ end (front) | ||
--cut-tail | Enable quality trimming at 3’ end (tail) | ||
--disable-quality-trimming | Disable polyG/quality tail trimming |
PolyG/PolyX Trimming
Some sequencing machines produce false runs of the same letter (like “GGGGGGGG”) at the end of reads when they run out of real signal. These aren’t real data - they’re artifacts. This feature detects and removes these repetitive tails, like removing a stuck key’s output from a document.
| Flag | Long | Default | Description |
|---|---|---|---|
--trim-poly-g | Enable polyG tail trimming | ||
-G | --disable-trim-poly-g | Disable polyG tail trimming | |
--trim-poly-x | Enable generic polyX tail trimming (any homopolymer) | ||
--poly-g-min-len | 10 | Minimum length for polyG/polyX detection |
Adapter Trimming
Adapters are short synthetic sequences added during sample preparation - they’re not part of the actual sample. They appear at the ends of reads and must be removed, like stripping metadata headers from a file. fasterp can auto-detect common adapters or you can specify them manually.
| Flag | Long | Default | Description |
|---|---|---|---|
-A | --disable-adapter-trimming | Disable adapter trimming | |
-a | --adapter-sequence | Adapter for read1 (auto-detected if not specified) | |
--adapter-sequence-r2 | Adapter sequence for read2 | ||
--disable-adapter-detection | Disable adapter auto-detection |
Paired-End Options
Paired-end sequencing reads the same fragment from both ends, giving you two reads that overlap in the middle. When they overlap, you can compare them: if one says “A” with high confidence and the other says “T” with low confidence, you trust the “A”. You can also merge overlapping pairs into a single, longer read. It’s like having two people transcribe the same audio and combining their best parts.
| Flag | Long | Default | Description |
|---|---|---|---|
-c | --correction | Enable base correction using overlap analysis | |
--overlap-len-require | 30 | Minimum overlap length required for correction | |
--overlap-diff-limit | 5 | Maximum allowed differences in overlap region | |
--overlap-diff-percent-limit | 20 | Maximum allowed difference percentage (0-100) | |
-m | --merge | Merge overlapping paired-end reads into single reads | |
--merged-out | Output file for merged reads | ||
--include-unmerged | Write unmerged reads to merged output file |
UMI Processing
A UMI (Unique Molecular Identifier) is a short random tag added to each original molecule before copying. Since copies of the same molecule have the same UMI, you can later identify and remove duplicates that came from the same original. Think of it like adding a unique serial number to each document before photocopying - you can then tell which copies came from which original.
| Flag | Long | Default | Description |
|---|---|---|---|
--umi | Enable UMI preprocessing | ||
--umi-loc | read1 | UMI location: read1/read2/index1/index2/per_read/per_index | |
--umi-len | 0 | UMI length (required when –umi is enabled) | |
--umi-prefix | UMI | Prefix in read name for UMI | |
--umi-skip | 0 | Skip N bases before UMI |
Deduplication
During sample preparation, molecules get copied many times. Identical reads are likely copies of the same original, which can skew your analysis (like counting the same vote multiple times). Deduplication detects and removes these duplicates using a memory-efficient probabilistic filter. It estimates duplication rates without needing to store every read in memory.
| Flag | Long | Default | Description |
|---|---|---|---|
-D | --dedup | Enable deduplication to remove duplicate reads | |
--dup-calc-accuracy | 0 | Deduplication accuracy (1-6), higher = more memory | |
--dont-eval-duplication | Don’t evaluate duplication rate (saves time/memory) |
Output Splitting
Large datasets can be split into multiple smaller files for parallel downstream processing. You can split by total file count (distribute reads evenly across N files) or by lines per file (each file gets up to N lines). This is useful when you want to process chunks in parallel on different machines.
| Flag | Long | Default | Description |
|---|---|---|---|
-s | --split | Split output by limiting total number of files (2-999) | |
-S | --split-by-lines | Split output by limiting lines per file (>=1000) | |
-d | --split-prefix-digits | 4 | Digits for file number padding (1-10) |
Reporting
fasterp generates detailed statistics about your data before and after processing: total reads, quality distributions, how many were filtered out and why. The JSON report is machine-readable for pipelines; the HTML report has interactive charts for visual inspection.
| Flag | Long | Default | Description |
|---|---|---|---|
-j | --json | fasterp.json | JSON report file |
--html | fasterp.html | HTML report file | |
--stats-format | compact | Stats output format: compact, pretty, off, or jsonl |
Performance
Control how fasterp uses your system resources. More threads = faster processing but more CPU/memory usage. The batch size affects memory consumption - larger batches are more efficient but use more RAM. Use --max-memory to set a limit and let fasterp auto-tune the other settings.
| Flag | Long | Default | Description |
|---|---|---|---|
-w | --thread | auto | Number of worker threads |
-z | --compression | 6 | Compression level for gzip output (0-9) |
--parallel-compression | true | Use parallel compression for gzip output | |
--batch-bytes | 33554432 | Batch size in bytes (32 MiB) | |
--max-backlog | threads+1 | Maximum backlog of batches | |
--max-memory | Maximum memory usage in MB | ||
--skip-kmer-counting | Skip k-mer counting for performance |
Architecture
fasterp processes data in a pipeline with three stages running in parallel. The Producer reads chunks from disk. Multiple Workers process those chunks simultaneously (filtering, trimming, counting). The Merger collects results in order and writes output. This design keeps all CPU cores busy while maintaining correct output order.
fasterp uses a 3-stage pipeline for multi-threaded processing:
- Producer: Reads input in 32MB batches, parses FASTQ records
- Workers: Process batches in parallel (filtering, trimming, stats)
- Merger: Writes output in order, aggregates statistics
Key optimizations:
- SIMD acceleration (AVX2/NEON) for quality statistics
- Zero-copy buffers with
Arc<Vec<u8>> - Lookup tables for base/quality checks
- Bloom filter for duplication detection
Examples
Quality Control Only
fasterp -i input.fq -o output.fq -A -G
Aggressive Trimming
fasterp -i input.fq -o output.fq \
--cut-front --cut-tail --cut-mean-quality 20 \
-q 20 -u 30 -l 50
Paired-End with Merging
fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq \
-m --merged-out merged.fq -c
High-Throughput Pipeline
fasterp -i large.fq.gz -o output.fq.gz \
-w 16 --max-memory 8192 -z 4
With UMI
fasterp -i input.fq -o output.fq \
--umi --umi-loc read1 --umi-len 12
Comparison with fastp
fasterp produces byte-exact output for standard parameters:
fastp -i input.fq -o fastp_out.fq
fasterp -i input.fq -o fasterp_out.fq
sha256sum fastp_out.fq fasterp_out.fq # hashes match