fasterp

High-performance FASTQ preprocessing in Rust. A drop-in replacement for fastp with blazing fast performance.

Installation

cargo install fasterp

Or build from source:

git clone https://github.com/drbh/fasterp
cd fasterp
cargo build --release

Quick Start

# Single-end
fasterp -i input.fq -o output.fq

# Paired-end
fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq

# With quality filtering
fasterp -i input.fq -o output.fq -q 20 -l 50

# With adapter trimming
fasterp -i input.fq -o output.fq -a AGATCGGAAGAGC

# Multi-threaded with memory limit
fasterp -i large.fq -o output.fq -w 8 --max-memory 4096

Quick Reference

Copy-paste commands for common tasks:

Task	Command
Basic cleanup	`fasterp -i in.fq -o out.fq`
Strict quality filter	`fasterp -i in.fq -o out.fq -q 20 -u 30 -l 50`
Aggressive trimming	`fasterp -i in.fq -o out.fq --cut-front --cut-tail --cut-mean-quality 20`
Remove adapters only	`fasterp -i in.fq -o out.fq -a AGATCGGAAGAGC`
Paired-end basic	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq`
Paired-end + merge	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -m --merged-out merged.fq`
Paired-end + correction	`fasterp -i R1.fq -I R2.fq -o o1.fq -O o2.fq -c`
Fast (16 threads)	`fasterp -i in.fq -o out.fq -w 16`
Memory-limited (4GB)	`fasterp -i in.fq -o out.fq --max-memory 4096`
Compressed in/out	`fasterp -i in.fq.gz -o out.fq.gz`
Split into 10 files	`fasterp -i in.fq -o out.fq -s 10`
Skip all filtering	`fasterp -i in.fq -o out.fq -A -G -Q -L`
With UMI extraction	`fasterp -i in.fq -o out.fq --umi --umi-len 8`
Deduplicate	`fasterp -i in.fq -o out.fq -D`
Custom report name	`fasterp -i in.fq -o out.fq -j report.json --html report.html`

Pro tip: Combine flags freely. Example for production pipeline:

fasterp -i R1.fq.gz -I R2.fq.gz -o o1.fq.gz -O o2.fq.gz \
  -q 20 -l 50 --cut-tail --cut-mean-quality 20 \
  -c -w 16 --max-memory 8192

FASTQ files contain sequence data as text. Each “read” is a short string of letters (A, C, G, T) representing a fragment of genetic material, plus a confidence score for each letter. fasterp reads these files, processes them, and writes cleaned versions.

Flag	Long	Description
`-i`	`--in1`	Input FASTQ file or URL (use ‘-’ for stdin, supports .gz and http(s)://)
`-o`	`--out1`	Output FASTQ file (use ‘-’ for stdout, .gz extension enables compression)
`-I`	`--in2`	Read2 input file or URL (paired-end mode, supports .gz and http(s)://)
`-O`	`--out2`	Read2 output file (paired-end mode)
	`--unpaired1`	Output file for unpaired read1
	`--unpaired2`	Output file for unpaired read2
	`--interleaved-in`	Input is interleaved paired-end

Length Filtering

Reads that are too short often come from errors or failed sequencing. Think of it like filtering out incomplete sentences - if a read is shorter than your minimum threshold, it’s probably not useful data and gets discarded.

Flag	Long	Default	Description
`-l`	`--length-required`	15	Minimum length required
`-L`	`--disable-length-filter`		Disable length filtering (fastp compatibility)
`-b`	`--max-len1`	0	Maximum length for read1 - trim if longer (0 = disabled)
`-B`	`--max-len2`	0	Maximum length for read2 - trim if longer (0 = disabled)

Quality Filtering

Each letter in a read has a confidence score (0-40+) indicating how certain the machine was about that letter. Higher is better. Quality filtering removes reads with too many low-confidence letters, too many unknown letters (N), or repetitive patterns that suggest errors. It’s like spell-checking - removing text that’s too garbled to trust.

Flag	Long	Default	Description
`-q`	`--qualified-quality-phred`	15	Quality value that a base is qualified (phred >= this)
`-u`	`--unqualified-percent-limit`	40	Percent of bases allowed to be unqualified (0-100)
`-e`	`--average-qual`	0	Average quality threshold - discard if mean < this (0 = disabled)
`-n`	`--n-base-limit`	5	Max number of N bases allowed
`-y`	`--low-complexity-filter`		Enable low complexity filter
`-Y`	`--complexity-threshold`	30	Complexity threshold (0-100), 30 = 30% complexity required

Trimming - Fixed Position

Sometimes the beginning or end of reads contain known bad data (like primer sequences or low-quality regions). Fixed trimming removes a set number of characters from the start or end of every read - like cutting off the first and last few words of every sentence because you know they’re always wrong.

Flag	Long	Description
	`--trim-front`	Trim N bases from 5’ (front) end
	`--trim-tail`	Trim N bases from 3’ (tail) end
`-f`	`--trim-front1`	Trim N bases from front for read1
`-t`	`--trim-tail1`	Trim N bases from tail for read1
`-F`	`--trim-front2`	Trim N bases from front for read2
`-T`	`--trim-tail2`	Trim N bases from tail for read2

Trimming - Quality Based

Instead of cutting a fixed amount, this looks at actual quality scores and trims where the data gets bad. A sliding window moves along the read, checking average quality. When it drops below your threshold, everything past that point is cut off. It’s like editing a document and stopping where the text becomes unreadable.

Long	Default	Description
`--cut-mean-quality`	0	Quality cutoff for sliding-window trimming (0 = disabled)
`--cut-window-size`	4	Sliding window size for quality trimming
`--cut-front`		Enable quality trimming at 5’ end (front)
`--cut-tail`		Enable quality trimming at 3’ end (tail)
`--disable-quality-trimming`		Disable polyG/quality tail trimming

PolyG/PolyX Trimming

Some sequencing machines produce false runs of the same letter (like “GGGGGGGG”) at the end of reads when they run out of real signal. These aren’t real data - they’re artifacts. This feature detects and removes these repetitive tails, like removing a stuck key’s output from a document.

Flag	Long	Default	Description
	`--trim-poly-g`		Enable polyG tail trimming
`-G`	`--disable-trim-poly-g`		Disable polyG tail trimming
	`--trim-poly-x`		Enable generic polyX tail trimming (any homopolymer)
	`--poly-g-min-len`	10	Minimum length for polyG/polyX detection

Adapter Trimming

Adapters are short synthetic sequences added during sample preparation - they’re not part of the actual sample. They appear at the ends of reads and must be removed, like stripping metadata headers from a file. fasterp can auto-detect common adapters or you can specify them manually.

Flag	Long	Description
`-A`	`--disable-adapter-trimming`	Disable adapter trimming
`-a`	`--adapter-sequence`	Adapter for read1 (auto-detected if not specified)
	`--adapter-sequence-r2`	Adapter sequence for read2
	`--disable-adapter-detection`	Disable adapter auto-detection

Paired-End Options

Paired-end sequencing reads the same fragment from both ends, giving you two reads that overlap in the middle. When they overlap, you can compare them: if one says “A” with high confidence and the other says “T” with low confidence, you trust the “A”. You can also merge overlapping pairs into a single, longer read. It’s like having two people transcribe the same audio and combining their best parts.

Flag	Long	Default	Description
`-c`	`--correction`		Enable base correction using overlap analysis
	`--overlap-len-require`	30	Minimum overlap length required for correction
	`--overlap-diff-limit`	5	Maximum allowed differences in overlap region
	`--overlap-diff-percent-limit`	20	Maximum allowed difference percentage (0-100)
`-m`	`--merge`		Merge overlapping paired-end reads into single reads
	`--merged-out`		Output file for merged reads
	`--include-unmerged`		Write unmerged reads to merged output file

UMI Processing

A UMI (Unique Molecular Identifier) is a short random tag added to each original molecule before copying. Since copies of the same molecule have the same UMI, you can later identify and remove duplicates that came from the same original. Think of it like adding a unique serial number to each document before photocopying - you can then tell which copies came from which original.

Long	Default	Description
`--umi`		Enable UMI preprocessing
`--umi-loc`	read1	UMI location: read1/read2/index1/index2/per_read/per_index
`--umi-len`	0	UMI length (required when –umi is enabled)
`--umi-prefix`	UMI	Prefix in read name for UMI
`--umi-skip`	0	Skip N bases before UMI

Deduplication

During sample preparation, molecules get copied many times. Identical reads are likely copies of the same original, which can skew your analysis (like counting the same vote multiple times). Deduplication detects and removes these duplicates using a memory-efficient probabilistic filter. It estimates duplication rates without needing to store every read in memory.

Flag	Long	Default	Description
`-D`	`--dedup`		Enable deduplication to remove duplicate reads
	`--dup-calc-accuracy`	0	Deduplication accuracy (1-6), higher = more memory
	`--dont-eval-duplication`		Don’t evaluate duplication rate (saves time/memory)

Output Splitting

Large datasets can be split into multiple smaller files for parallel downstream processing. You can split by total file count (distribute reads evenly across N files) or by lines per file (each file gets up to N lines). This is useful when you want to process chunks in parallel on different machines.

Flag	Long	Default	Description
`-s`	`--split`		Split output by limiting total number of files (2-999)
`-S`	`--split-by-lines`		Split output by limiting lines per file (>=1000)
`-d`	`--split-prefix-digits`	4	Digits for file number padding (1-10)

Reporting

fasterp generates detailed statistics about your data before and after processing: total reads, quality distributions, how many were filtered out and why. The JSON report is machine-readable for pipelines; the HTML report has interactive charts for visual inspection.

Flag	Long	Default	Description
`-j`	`--json`	fasterp.json	JSON report file
	`--html`	fasterp.html	HTML report file
	`--stats-format`	compact	Stats output format: compact, pretty, off, or jsonl

Performance

Control how fasterp uses your system resources. More threads = faster processing but more CPU/memory usage. The batch size affects memory consumption - larger batches are more efficient but use more RAM. Use --max-memory to set a limit and let fasterp auto-tune the other settings.

Flag	Long	Default	Description
`-w`	`--thread`	auto	Number of worker threads
`-z`	`--compression`	6	Compression level for gzip output (0-9)
	`--parallel-compression`	true	Use parallel compression for gzip output
	`--batch-bytes`	33554432	Batch size in bytes (32 MiB)
	`--max-backlog`	threads+1	Maximum backlog of batches
	`--max-memory`		Maximum memory usage in MB
	`--skip-kmer-counting`		Skip k-mer counting for performance

Architecture

fasterp processes data in a pipeline with three stages running in parallel. The Producer reads chunks from disk. Multiple Workers process those chunks simultaneously (filtering, trimming, counting). The Merger collects results in order and writes output. This design keeps all CPU cores busy while maintaining correct output order.

fasterp uses a 3-stage pipeline for multi-threaded processing:

Producer: Reads input in 32MB batches, parses FASTQ records
Workers: Process batches in parallel (filtering, trimming, stats)
Merger: Writes output in order, aggregates statistics

Key optimizations:

SIMD acceleration (AVX2/NEON) for quality statistics
Zero-copy buffers with Arc<Vec<u8>>
Lookup tables for base/quality checks
Bloom filter for duplication detection

Examples

Quality Control Only

fasterp -i input.fq -o output.fq -A -G

Aggressive Trimming

fasterp -i input.fq -o output.fq \
  --cut-front --cut-tail --cut-mean-quality 20 \
  -q 20 -u 30 -l 50

Paired-End with Merging

fasterp -i R1.fq -I R2.fq -o out1.fq -O out2.fq \
  -m --merged-out merged.fq -c

High-Throughput Pipeline

fasterp -i large.fq.gz -o output.fq.gz \
  -w 16 --max-memory 8192 -z 4

With UMI

fasterp -i input.fq -o output.fq \
  --umi --umi-loc read1 --umi-len 12

Comparison with fastp

fasterp produces byte-exact output for standard parameters:

fastp -i input.fq -o fastp_out.fq
fasterp -i input.fq -o fasterp_out.fq
sha256sum fastp_out.fq fasterp_out.fq  # hashes match

Tutorial: Compare fasterp and fastp

This tutorial walks through downloading real sequencing data, running both fasterp and fastp, and comparing their outputs and reports.

Prerequisites

fasterp installed (cargo install --git https://github.com/drbh/fasterp.git)
fastp installed (installation guide)
curl for downloading files
jq (optional) for inspecting JSON reports

Step 1: Download Test Data

Download paired-end FASTQ files from SRX2987343 - a chromatin accessibility study on mouse hair follicle stem cells, examining how DNA packaging changes during differentiation into hair-related cell types.

# Create a working directory
mkdir -p fasterp_tutorial && cd fasterp_tutorial

# Download R1 (forward reads) - 365 MB
curl -LO https://genedata.dholtz.com/SRX2987343/SRR5808766_1.fastq.gz

# Download R2 (reverse reads) - 364 MB
curl -LO https://genedata.dholtz.com/SRX2987343/SRR5808766_2.fastq.gz

# Verify downloads
ls -lh *.gz

Expected output:

-rw-r--r--  1 user  staff   365M  SRR5808766_1.fastq.gz
-rw-r--r--  1 user  staff   364M  SRR5808766_2.fastq.gz

Step 2: Run fastp

Process the paired-end data with fastp:

fastp \
  -i SRR5808766_1.fastq.gz -I SRR5808766_2.fastq.gz \
  -o fastp_out_R1.fq.gz -O fastp_out_R2.fq.gz \
  -j fastp_report.json \
  -h fastp_report.html

Expected output:

Read1 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 494840792(99.0171%)
Q30 bases: 490032403(98.0549%)
Q40 bases: 333003632(66.6337%)

Read2 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 491032411(98.255%)
Q30 bases: 485393847(97.1268%)
Q40 bases: 335965361(67.2263%)

Read1 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 481444894(99.0657%)
Q30 bases: 476939326(98.1386%)
Q40 bases: 324904111(66.8547%)

Read2 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 479080210(98.5791%)
Q30 bases: 473663787(97.4646%)
Q40 bases: 327920585(67.4754%)

Filtering result:
reads passed filter: 19523574
reads failed due to low quality: 74384
reads failed due to too many N: 194
reads failed due to too short: 0
reads with adapter trimmed: 3113628
bases trimmed due to adapters: 23734734

Duplication rate: 28.3885%

Insert size peak (evaluated by paired-end reads): 51

JSON report: fastp_report.json
HTML report: fastp_report.html

fastp v1.0.1, time used: 11 seconds

Step 3: Run fasterp

Process the same data with fasterp:

fasterp \
  -i SRR5808766_1.fastq.gz -I SRR5808766_2.fastq.gz \
  -o fasterp_out_R1.fq.gz -O fasterp_out_R2.fq.gz \
  -j fasterp_report.json \
  --html fasterp_report.html

Expected output:

Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 494840792(99.0171%)
Q30 bases: 490032403(98.0549%)
Q40 bases: 333003632(66.6337%)

Read2 before filtering:
total reads: 9799076
total bases: 499752876
Q20 bases: 491032411(98.2550%)
Q30 bases: 485393847(97.1268%)
Q40 bases: 335965361(67.2263%)

Read1 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 481444894(99.0657%)
Q30 bases: 476939326(98.1386%)
Q40 bases: 324904111(66.8547%)

Read2 after filtering:
total reads: 9761787
total bases: 485985544
Q20 bases: 479080210(98.5791%)
Q30 bases: 473663787(97.4646%)
Q40 bases: 327920585(67.4754%)

Filtering result:
reads passed filter: 19523574
reads failed due to low quality: 74384
reads failed due to too many N: 194
reads failed due to too short: 0
reads with adapter trimmed: 3113628
bases trimmed due to adapters: 23734734

Duplication rate: 28.3885%

Insert size peak (evaluated by paired-end reads): 51

JSON report: fasterp_report.json
HTML report: fasterp_report.html

fasterp v0.1.0, time used: 6 seconds

Note: fasterp processed ~10 million read pairs in 6 seconds vs fastp’s 11 seconds.

Step 4: Compare Outputs

Verify identical FASTQ output

When comparing gzipped files, the compressed hashes may differ due to compression metadata. Compare the decompressed content:

# Compare decompressed R1 content
gunzip -c fastp_out_R1.fq.gz | shasum -a 256
gunzip -c fasterp_out_R1.fq.gz | shasum -a 256

# Compare decompressed R2 content
gunzip -c fastp_out_R2.fq.gz | shasum -a 256
gunzip -c fasterp_out_R2.fq.gz | shasum -a 256

Expected output:

716d6bc9b5aa075e5f7ff527b1638de1ea56b67439b7a7646bfe25fb14d132e7  -
716d6bc9b5aa075e5f7ff527b1638de1ea56b67439b7a7646bfe25fb14d132e7  -

c28e4e14c58ded4f4a7b776f6fb9f92cabc83056253f0ff4e268e5db1653b656  -
c28e4e14c58ded4f4a7b776f6fb9f92cabc83056253f0ff4e268e5db1653b656  -

The hashes match - both tools produced identical output.

Compare key statistics

# Total reads before/after filtering
echo "=== fastp ==="
jq '{
  before: .summary.before_filtering.total_reads,
  after: .summary.after_filtering.total_reads,
  passed_rate: .summary.after_filtering.total_reads / .summary.before_filtering.total_reads * 100
}' fastp_report.json

echo "=== fasterp ==="
jq '{
  before: .summary.before_filtering.total_reads,
  after: .summary.after_filtering.total_reads,
  passed_rate: .summary.after_filtering.total_reads / .summary.before_filtering.total_reads * 100
}' fasterp_report.json

Expected output:

=== fastp ===
{
  "before": 19598152,
  "after": 19523574,
  "passed_rate": 99.61946412090282
}

=== fasterp ===
{
  "before": 19598152,
  "after": 19523574,
  "passed_rate": 99.61946412090282
}

Compare k-mer counts

# Top k-mers should be identical
echo "=== fastp k-mers ==="
jq '.read1_before_filtering.kmer_count | to_entries | sort_by(-.value) | .[0:5]' fastp_report.json

echo "=== fasterp k-mers ==="
jq '.read1_before_filtering.kmer_count | to_entries | sort_by(-.value) | .[0:5]' fasterp_report.json

Expected output:

=== fastp k-mers ===
[
  { "key": "CTGTC", "value": 1651445 },
  { "key": "TGTCT", "value": 1651009 },
  { "key": "TCTCT", "value": 1576849 },
  { "key": "GTCTC", "value": 1358458 },
  { "key": "CTCTT", "value": 1314738 }
]

=== fasterp k-mers ===
[
  { "key": "CTGTC", "value": 1651445 },
  { "key": "TGTCT", "value": 1651009 },
  { "key": "TCTCT", "value": 1576849 },
  { "key": "GTCTC", "value": 1358458 },
  { "key": "CTCTT", "value": 1314738 }
]

Step 5: View Reports

Open the HTML reports in your browser to compare visualizations:

# macOS
open fastp_report.html fasterp_report.html

# Linux
xdg-open fastp_report.html && xdg-open fasterp_report.html

# Or start a local server
python3 -m http.server 8000
# Then visit http://localhost:8000

Example Reports

Here are the actual reports generated from this tutorial:

Tool	Report
fastp	fastp_report.html
fasterp	fasterp_report.html

fastp

fasterp

Compare these side-by-side to see identical quality metrics, base content graphs, and k-mer distributions.

Cleanup

# Remove generated files
rm -f *.gz fastp_report.* fasterp_report.*
cd .. && rmdir fasterp_tutorial

Next Steps

Try with your own FASTQ data
Experiment with different filtering parameters
Check the CLI Reference for all available options

Playground

fasterp WASM Sandbox

Keyboard shortcuts

fasterp