fasterp

High-performance FASTQ preprocessing in Rust. A drop-in replacement for fastp with blazing fast performance.

Installation

cargo install fasterp

Or build from source:

git clone https://github.com/drbh/fasterp
cd fasterp
cargo build --release

Quick Start

# Single-end
fasterp -i input.fq -o output.fq

# Paired-end
fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq

# With quality filtering
fasterp -i input.fq -o output.fq -q 20 -l 50

# With adapter trimming
fasterp -i input.fq -o output.fq -a AGATCGGAAGAGC

# Multi-threaded with memory limit
fasterp -i large.fq -o output.fq -w 8 --max-memory 4096

Quick Reference

Copy-paste commands for common tasks:

Task	Command
Basic cleanup	`fasterp -i in.fq -o out.fq`
Strict quality filter	`fasterp -i in.fq -o out.fq -q 20 -u 30 -l 50`
Aggressive trimming	`fasterp -i in.fq -o out.fq --cut-front --cut-tail --cut-mean-quality 20`
Remove adapters only	`fasterp -i in.fq -o out.fq -a AGATCGGAAGAGC`
Paired-end basic	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq`
Paired-end + merge	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -m --merged-out merged.fq`
Paired-end + correction	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -c`
Fast (16 threads)	`fasterp -i in.fq -o out.fq -w 16`
Memory-limited (4GB)	`fasterp -i in.fq -o out.fq --max-memory 4096`
Compressed in/out	`fasterp -i in.fq.gz -o out.fq.gz`
Split into 10 files	`fasterp -i in.fq -o out.fq -s 10`
Skip all filtering	`fasterp -i in.fq -o out.fq -A -G -Q -L`
With UMI extraction	`fasterp -i in.fq -o out.fq --umi --umi-len 8`
Deduplicate	`fasterp -i in.fq -o out.fq -D`
Custom report name	`fasterp -i in.fq -o out.fq -j report.json --html report.html`

Pro tip: Combine flags freely. Example for production pipeline:

fasterp -i R1.fq.gz -I R2.fq.gz -o o1.fq.gz -O o2.fq.gz \
  -q 20 -l 50 --cut-tail --cut-mean-quality 20 \
  -c -w 16 --max-memory 8192

FASTQ files contain sequence data as text. Each “read” is a short string of letters (A, C, G, T) representing a fragment of genetic material, plus a confidence score for each letter. fasterp reads these files, processes them, and writes cleaned versions.

Flag	Long	Description
`-i`	`--in1`	Input FASTQ file or URL (use ‘-’ for stdin, supports .gz and http(s)://)
`-o`	`--out1`	Output FASTQ file (use ‘-’ for stdout, .gz extension enables compression)
`-I`	`--in2`	Read2 input file or URL (paired-end mode, supports .gz and http(s)://)
`-O`	`--out2`	Read2 output file (paired-end mode)
	`--unpaired1`	Output file for unpaired read1
	`--unpaired2`	Output file for unpaired read2
	`--interleaved-in`	Input is interleaved paired-end

Length Filtering

Reads that are too short often come from errors or failed sequencing. Think of it like filtering out incomplete sentences - if a read is shorter than your minimum threshold, it’s probably not useful data and gets discarded.

Flag	Long	Default	Description
`-l`	`--length-required`	15	Minimum length required
`-L`	`--disable-length-filter`		Disable length filtering (fastp compatibility)
`-b`	`--max-len1`	0	Maximum length for read1 - trim if longer (0 = disabled)
`-B`	`--max-len2`	0	Maximum length for read2 - trim if longer (0 = disabled)

Quality Filtering

Each letter in a read has a confidence score (0-40+) indicating how certain the machine was about that letter. Higher is better. Quality filtering removes reads with too many low-confidence letters, too many unknown letters (N), or repetitive patterns that suggest errors. It’s like spell-checking - removing text that’s too garbled to trust.

Flag	Long	Default	Description
`-q`	`--qualified-quality-phred`	15	Quality value that a base is qualified (phred >= this)
`-u`	`--unqualified-percent-limit`	40	Percent of bases allowed to be unqualified (0-100)
`-e`	`--average-qual`	0	Average quality threshold - discard if mean < this (0 = disabled)
`-n`	`--n-base-limit`	5	Max number of N bases allowed
`-y`	`--low-complexity-filter`		Enable low complexity filter
`-Y`	`--complexity-threshold`	30	Complexity threshold (0-100), 30 = 30% complexity required

Trimming - Fixed Position

Sometimes the beginning or end of reads contain known bad data (like primer sequences or low-quality regions). Fixed trimming removes a set number of characters from the start or end of every read - like cutting off the first and last few words of every sentence because you know they’re always wrong.

Flag	Long	Description
	`--trim-front`	Trim N bases from 5’ (front) end
	`--trim-tail`	Trim N bases from 3’ (tail) end
`-f`	`--trim-front1`	Trim N bases from front for read1
`-t`	`--trim-tail1`	Trim N bases from tail for read1
`-F`	`--trim-front2`	Trim N bases from front for read2
`-T`	`--trim-tail2`	Trim N bases from tail for read2

Trimming - Quality Based

Instead of cutting a fixed amount, this looks at actual quality scores and trims where the data gets bad. A sliding window moves along the read, checking average quality. When it drops below your threshold, everything past that point is cut off. It’s like editing a document and stopping where the text becomes unreadable.

Long	Default	Description
`--cut-mean-quality`	0	Quality cutoff for sliding-window trimming (0 = disabled)
`--cut-window-size`	4	Sliding window size for quality trimming
`--cut-front`		Enable quality trimming at 5’ end (front)
`--cut-tail`		Enable quality trimming at 3’ end (tail)
`--disable-quality-trimming`		Disable polyG/quality tail trimming

PolyG/PolyX Trimming

Some sequencing machines produce false runs of the same letter (like “GGGGGGGG”) at the end of reads when they run out of real signal. These aren’t real data - they’re artifacts. This feature detects and removes these repetitive tails, like removing a stuck key’s output from a document.

Flag	Long	Default	Description
	`--trim-poly-g`		Enable polyG tail trimming
`-G`	`--disable-trim-poly-g`		Disable polyG tail trimming
	`--trim-poly-x`		Enable generic polyX tail trimming (any homopolymer)
	`--poly-g-min-len`	10	Minimum length for polyG/polyX detection

Adapter Trimming

Adapters are short synthetic sequences added during sample preparation - they’re not part of the actual sample. They appear at the ends of reads and must be removed, like stripping metadata headers from a file. fasterp can auto-detect common adapters or you can specify them manually.

Flag	Long	Description
`-A`	`--disable-adapter-trimming`	Disable adapter trimming
`-a`	`--adapter-sequence`	Adapter for read1 (auto-detected if not specified)
	`--adapter-sequence-r2`	Adapter sequence for read2
	`--disable-adapter-detection`	Disable adapter auto-detection

Paired-End Options

Paired-end sequencing reads the same fragment from both ends, giving you two reads that overlap in the middle. When they overlap, you can compare them: if one says “A” with high confidence and the other says “T” with low confidence, you trust the “A”. You can also merge overlapping pairs into a single, longer read. It’s like having two people transcribe the same audio and combining their best parts.

Flag	Long	Default	Description
`-c`	`--correction`		Enable base correction using overlap analysis
	`--overlap-len-require`	30	Minimum overlap length required for correction
	`--overlap-diff-limit`	5	Maximum allowed differences in overlap region
	`--overlap-diff-percent-limit`	20	Maximum allowed difference percentage (0-100)
`-m`	`--merge`		Merge overlapping paired-end reads into single reads
	`--merged-out`		Output file for merged reads
	`--include-unmerged`		Write unmerged reads to merged output file

UMI Processing

A UMI (Unique Molecular Identifier) is a short random tag added to each original molecule before copying. Since copies of the same molecule have the same UMI, you can later identify and remove duplicates that came from the same original. Think of it like adding a unique serial number to each document before photocopying - you can then tell which copies came from which original.

Long	Default	Description
`--umi`		Enable UMI preprocessing
`--umi-loc`	read1	UMI location: read1/read2/index1/index2/per_read/per_index
`--umi-len`	0	UMI length (required when –umi is enabled)
`--umi-prefix`	UMI	Prefix in read name for UMI
`--umi-skip`	0	Skip N bases before UMI

Deduplication

During sample preparation, molecules get copied many times. Identical reads are likely copies of the same original, which can skew your analysis (like counting the same vote multiple times). Deduplication detects and removes these duplicates using a memory-efficient probabilistic filter. It estimates duplication rates without needing to store every read in memory.

Flag	Long	Default	Description
`-D`	`--dedup`		Enable deduplication to remove duplicate reads
	`--dup-calc-accuracy`	0	Deduplication accuracy (1-6), higher = more memory
	`--dont-eval-duplication`		Don’t evaluate duplication rate (saves time/memory)

Output Splitting

Large datasets can be split into multiple smaller files for parallel downstream processing. You can split by total file count (distribute reads evenly across N files) or by lines per file (each file gets up to N lines). This is useful when you want to process chunks in parallel on different machines.

Flag	Long	Default	Description
`-s`	`--split`		Split output by limiting total number of files (2-999)
`-S`	`--split-by-lines`		Split output by limiting lines per file (>=1000)
`-d`	`--split-prefix-digits`	4	Digits for file number padding (1-10)

Reporting

fasterp generates detailed statistics about your data before and after processing: total reads, quality distributions, how many were filtered out and why. The JSON report is machine-readable for pipelines; the HTML report has interactive charts for visual inspection.

Flag	Long	Default	Description
`-j`	`--json`	fasterp.json	JSON report file
	`--html`	fasterp.html	HTML report file
	`--stats-format`	compact	Stats output format: compact, pretty, off, or jsonl

Performance

Control how fasterp uses your system resources. More threads = faster processing but more CPU/memory usage. The batch size affects memory consumption - larger batches are more efficient but use more RAM. Use --max-memory to set a limit and let fasterp auto-tune the other settings.

Flag	Long	Default	Description
`-w`	`--thread`	auto	Number of worker threads
`-z`	`--compression`	6	Compression level for gzip output (0-9)
	`--parallel-compression`	true	Use parallel compression for gzip output
	`--batch-bytes`	33554432	Batch size in bytes (32 MiB)
	`--max-backlog`	threads+1	Maximum backlog of batches
	`--max-memory`		Maximum memory usage in MB
	`--skip-kmer-counting`		Skip k-mer counting for performance

Architecture

fasterp processes data in a pipeline with three stages running in parallel. The Producer reads chunks from disk. Multiple Workers process those chunks simultaneously (filtering, trimming, counting). The Merger collects results in order and writes output. This design keeps all CPU cores busy while maintaining correct output order.

fasterp uses a 3-stage pipeline for multi-threaded processing:

Producer: Reads input in 32MB batches, parses FASTQ records
Workers: Process batches in parallel (filtering, trimming, stats)
Merger: Writes output in order, aggregates statistics

Key optimizations:

SIMD acceleration (AVX2/NEON) for quality statistics
Zero-copy buffers with Arc<Vec<u8>>
Lookup tables for base/quality checks
Bloom filter for duplication detection

Examples

Quality Control Only

fasterp -i input.fq -o output.fq -A -G

Aggressive Trimming

fasterp -i input.fq -o output.fq \
  --cut-front --cut-tail --cut-mean-quality 20 \
  -q 20 -u 30 -l 50

Paired-End with Merging

fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq \
  -m --merged-out merged.fq -c

High-Throughput Pipeline

fasterp -i large.fq.gz -o output.fq.gz \
  -w 16 --max-memory 8192 -z 4

With UMI

fasterp -i input.fq -o output.fq \
  --umi --umi-loc read1 --umi-len 12

Comparison with fastp

fasterp produces byte-exact output for standard parameters:

fastp -i input.fq -o fastp_out.fq
fasterp -i input.fq -o fasterp_out.fq
sha256sum fastp_out.fq fasterp_out.fq  # hashes match

fasterp

fasterp

Installation

Quick Start

Quick Reference

CLI Reference

Input/Output

Length Filtering

Quality Filtering

Trimming - Fixed Position

Trimming - Quality Based

PolyG/PolyX Trimming

Adapter Trimming

Paired-End Options

UMI Processing

Deduplication

Output Splitting

Reporting

Performance

Architecture

Examples

Quality Control Only

Aggressive Trimming

Paired-End with Merging

High-Throughput Pipeline

With UMI

Comparison with fastp

Keyboard shortcuts

fasterp