phredsort

phredsort is a command-line tool for sorting FASTQ reads by quality metrics.In is also possible to sort FASTA inputs when quality metrics are already present in sequence headers (using headersort subcommand).

Usage

Basic usage:

# Read from `input.fastq.gz` and write to `output.fastq.gz`
phredsort -i input.fastq.gz -o output.fastq.gz

# Read from stdin and write to stdout (default when -i/-o not specified)
zcat input.fastq.gz | phredsort | less -S

# Explicit stdin/stdout (equivalent to above)
zcat input.fastq.gz | phredsort -i - -o - | less -S

Sort sequences using pre-computed maxEE scores in headers

phredsort headersort -i input.fasta -o output.fasta --metric maxee

Sort by avgphred scores with quality filtering

phredsort headersort -i input.fastq -o output.fastq --metric avgphred --minqual 20 --maxqual 40

Sort in ascending order (lower quality first)

phredsort headersort -i input.fa -o output.fa --metric meep --ascending

Examples of supported header formats:

Space-separated: ">seq1 maxee=2.5 size=100"
Semicolon-separated: ">seq1;maxee=2.5;size=100"

Installation

Download compiled binary (for Linux)

wget https://github.com/vmikk/phredsort/releases/download/1.4.0/phredsort
chmod +x phredsort
./phredsort --help

Build from source

git clone --depth 1 https://github.com/vmikk/phredsort
cd phredsort
go build -ldflags="-s -w" phredsort.go
./phredsort --help

Quality metrics

phredsort supports several metrics (--metric parameter) to assess sequence quality:

1. (Back-transformed) average Phred score (`avgphred`)

Properly calculated mean quality score that accounts for the logarithmic nature of Phred scores
Converts Phred scores to error probabilities, calculates their arithmetic mean, then converts back to Phred scale
Formula: -10 * log10(mean(10^(-Q/10)))
More accurate than simple arithmetic mean of Phred scores, which would overestimate quality

2. Maximum expected error (`maxee`) (as per Edgar & Flyvbjerg, 2014)

Sum of error probabilities for all bases in a sequence
Formula: sum(10^(-Q/10))
Higher values indicate lower quality
Depends on sequence length (longer sequences tend to have higher MaxEE)

3. Maximum expected error percentage (`meep`)

MaxEE standardized by sequence length
Represents expected number of errors per 100 bases
Formula: (MaxEE * 100) / sequence_length
Higher values indicate lower quality
Allows fair comparison between sequences of different lengths

4. Low quality base count (`lqcount`)

Number of bases below specified quality threshold
Useful for binned quality scores (e.g., data from Illumina NovaSeq platform)
Counts bases with Phred score < threshold (default: 15)
Higher values indicate lower quality

5. Low quality base percentage (`lqpercent`)

Percentage of bases below quality threshold
Formula: (lqcount * 100) / sequence_length
Higher values indicate lower quality
Normalizes low-quality base count by sequence length

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
command_headersort.go		command_headersort.go
command_nosort.go		command_nosort.go
command_sort.go		command_sort.go
command_sort_test.go		command_sort_test.go
go.mod		go.mod
go.sum		go.sum
help.go		help.go
io.go		io.go
io_test.go		io_test.go
phredsort.go		phredsort.go
phredsort_test.go		phredsort_test.go
qualitymetrics.go		qualitymetrics.go
qualitymetrics_test.go		qualitymetrics_test.go
sorting_test.go		sorting_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phredsort

Usage

Sort sequences using pre-computed maxEE scores in headers

Sort by avgphred scores with quality filtering

Sort in ascending order (lower quality first)

Installation

Download compiled binary (for Linux)

Build from source

Quality metrics

1. (Back-transformed) average Phred score (`avgphred`)

2. Maximum expected error (`maxee`) (as per Edgar & Flyvbjerg, 2014)

3. Maximum expected error percentage (`meep`)

4. Low quality base count (`lqcount`)

5. Low quality base percentage (`lqpercent`)

About

Uh oh!

Releases 3

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

phredsort

Usage

Sort sequences using pre-computed maxEE scores in headers

Sort by avgphred scores with quality filtering

Sort in ascending order (lower quality first)

Installation

Download compiled binary (for Linux)

Build from source

Quality metrics

1. (Back-transformed) average Phred score (avgphred)

2. Maximum expected error (maxee) (as per Edgar & Flyvbjerg, 2014)

3. Maximum expected error percentage (meep)

4. Low quality base count (lqcount)

5. Low quality base percentage (lqpercent)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors

Uh oh!

Languages

1. (Back-transformed) average Phred score (`avgphred`)

2. Maximum expected error (`maxee`) (as per Edgar & Flyvbjerg, 2014)

3. Maximum expected error percentage (`meep`)

4. Low quality base count (`lqcount`)

5. Low quality base percentage (`lqpercent`)