procs-maker (ProCs Maker)

Protein-Clusters Maker — bacterial and viral protein extraction, MMseqs2 clustering, and protein-count + presence/absence matrix construction.

Given a set of target GCA accessions and/or a viral contigs FASTA, this tool:

Downloads and streams the ProGenomes3 protein reference (~8 GB compressed, only if a bacterial input is provided)
Extracts bacterial proteins for the target genomes (streaming in batches)
Extracts viral proteins via Prodigal gene prediction
Clusters the combined (or single-side) collection with MMseqs2 easy-cluster
Builds two matrices:
- pc_matrix.csv — protein-count matrix (rows = genomes/viruses, cols = cluster IDs)
- pb_matrix.csv — binary presence/absence matrix derived from pc_matrix

The bacterial and viral sides are independently optional; at least one is required.

Installation

pip install git+https://github.com/bio-datascience/procs-maker.git

Requires mmseqs/mmseqs2 and prodigal on PATH (Prodigal is only used when a viral input is provided).

Bacterial input modes

There are two ways to specify which bacterial genomes to target. They are mutually exclusive — use one

Mode 1 — Harmonized taxonomy table (`--bac-gca-table`)

This is the recommended mode when running ProCs Maker as part of a workflow. You pass a .csv file produced by a step of a workflow or from another tool (ex. progenomes-harmonizer), and the tool extracts the GCA target set automatically.

Required CSV format

The file must be a comma-separated CSV with a row index and at least three columns named exactly as follows:

Column	Content
`GCA_species`	Species-level GCA accession (e.g. `GCA_000001405`)
`GCA_genus`	Genus-level representative GCA accession
`GCA_family`	Family-level representative GCA accession

The exact file produced by progenomes-harmonizer already has these columns. Do not rename them.

How GCAs are extracted from the table

By default (genus-level mode), the target set is:

GCA_genus ∪ GCA_family

With --species-level, the target set becomes:

GCA_species ∪ GCA_family

NaN values in any column are silently ignored. This means that if a row has no genus-level representative it is naturally skipped at genus level, and a family-level genome is always included as a fallback when present.

example

Suppose taxonomy_table_gut_withIDs.csv contains:

,GCA_species,GCA_genus,GCA_family
0,GCA_000001405,GCA_000001405,GCA_000007305
1,GCA_000013425,GCA_000013425,GCA_000007305
2,GCA_000148985,,GCA_000007305

At genus level the extracted set is {GCA_000001405, GCA_000013425, GCA_000007305}. Row 2 has no genus entry so only its family GCA contributes.

procs_maker \
    --bac-gca-table taxonomy_table_gut_with_IDs.csv \
    --output-dir output/ \
    --download-path downloads/

Mode 2 — Plain GCA list (`--gca-list`)

Pass a plain text file with one GCA accession per line. Blank lines are skipped. Use this mode when you already know exactly which genomes you want and do not need the taxonomy-driven selection logic.

File format

GCA_000001405
GCA_000013425
GCA_000007305

No header, no extra columns — just one accession per line.

procs_maker \
    --gca-list target_gcas.txt \
    --output-dir output/ \
    --download-path downloads/

Usage examples

# ── Bi-modal (joint bacterial + viral clustering) ──────────────────────────
procs_maker \
    --bac-gca-table taxonomy_table_gut_withIDs.csv \
    --viral-contigs phage_contigs.fasta \
    --output-dir output/ \
    --download-path downloads/

# ── Bacterial-only (table mode) ────────────────────────────────────────────
procs_maker \
    --bac-gca-table taxonomy_table_gut_withIDs.csv \
    --output-dir output/ \
    --download-path downloads/

# ── Bacterial-only (list mode) ─────────────────────────────────────────────
procs_maker \
    --gca-list target_gcas.txt \
    --output-dir output/ \
    --download-path downloads/

# ── Viral-only (no --download-path needed) ────────────────────────────────
procs_maker \
    --viral-contigs phage_contigs.fasta \
    --output-dir output/

All flags

Bacterial input (mutually exclusive)

Flag	Description
`--bac-gca-table PATH`	Harmonized taxonomy CSV (`taxonomy_table_*_withIDs.csv`). Required columns: `GCA_species`, `GCA_genus`, `GCA_family`. The GCA target set is derived automatically (see above).
`--gca-list PATH`	Plain text file, one GCA accession per line. Simpler alternative when you already have the exact accession list.

Viral input

Flag	Description
`--viral-contigs PATH`	FASTA file of viral (phage) contigs. Prodigal is run on these to predict proteins. Omit for a bacterial-only run.

Required common argument

Flag	Description
`--output-dir PATH`	Root directory for all outputs. Created if it does not exist.

Download and reference

Flag	Default	Description
`--download-path PATH`	(none)	Directory where the ProGenomes3 protein reference (`progenomes3.proteins.representatives.fasta.bz2`, ~8 GB) is cached or downloaded. Required whenever a bacterial input is given (`--bac-gca-table` or `--gca-list`); ignored in viral-only mode.
`--protein-reference-url URL`	ProGenomes3 EMBL URL	Override the download URL for the protein reference.

Extraction options

Flag	Default	Description
`--batch-size INT`	`1500`	Number of genomes processed per streaming batch during bacterial protein extraction. The reference archive is large (~8 GB compressed); a larger batch size reduces the number of passes through the archive but increases peak RAM usage. Reduce this value if you run out of memory; increase it on machines with large RAM to speed up extraction.
`--species-level`	off	Switch the GCA target-set extraction to species level (`GCA_species ∪ GCA_family`) instead of the default genus level (`GCA_genus ∪ GCA_family`). Only meaningful with `--bac-gca-table`; has no effect with `--gca-list` (the list is used verbatim).
`--save-single-genomes`	off	In addition to the combined `BacterialProteinsCollection.fasta`, save an individual FASTA file per genome in `Targets Proteins Extraction/single_genomes/`. Useful for downstream per-genome analyses.
`--keep-coords`	off	By default the Prodigal `coords.gbk` annotation file produced during viral protein prediction is deleted after extraction. Pass this flag to keep it (e.g. for inspecting gene coordinates).

Clustering options

Flag	Default	Description
`--filter-1bac-1vir`	off	After clustering, discard any cluster that does not contain at least one bacterial protein and at least one viral protein. This enforces cross-domain clusters and is only meaningful in bi-modal runs. Requires both a bacterial input (`--bac-gca-table`/`--gca-list`) and `--viral-contigs` — the tool will error if you pass this flag with a single-side input.

Cleanup options

Flag	Default	Description
`--ref-removal`	off	Delete the ProGenomes3 protein reference archive (`progenomes3.proteins.representatives.fasta.bz2`) from `--download-path` after bacterial protein extraction completes. Frees ~8 GB of disk space. The file will be re-downloaded on the next run if bacterial input is given.
`--remove-collections`	off	Delete the intermediate protein FASTA collections (`BacterialProteinsCollection.fasta`, `ViralProteinsCollection.fasta`, and `CombinedProteinsCollection.fasta` in bi-modal runs) after clustering finishes. The matrices are already written at that point, so these files are no longer needed unless you want to inspect the raw protein sequences.

Logging

Flag	Description
`-v`, `--verbose`	Enable verbose (`INFO`-level) logging. By default only warnings and errors are shown.

Outputs

<output-dir>/
├── Targets Proteins Extraction/
│   ├── BacterialProteinsCollection.fasta   (bacterial or bi-modal runs)
│   ├── ViralProteinsCollection.fasta        (viral or bi-modal runs)
│   └── CombinedProteinsCollection.fasta    (bi-modal only; input to MMseqs2)
├── Clustering/
│   ├── clusterRes_cluster.tsv              (MMseqs2 cluster assignments)
│   └── ...                                  (other MMseqs2 output files)
├── pc_matrix.csv                            (protein-count matrix)
└── pb_matrix.csv                            (binary presence/absence matrix)

pc_matrix.csv: rows are genome/virus identifiers, columns are cluster representative IDs, values are the count of proteins from that genome/virus assigned to each cluster.

pb_matrix.csv: same shape as pc_matrix.csv but binarized — 1 if the genome/virus has at least one protein in that cluster, 0 otherwise. This is the matrix used downstream for network inference.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
procs_maker		procs_maker
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

procs-maker (ProCs Maker)

Installation

Bacterial input modes

Mode 1 — Harmonized taxonomy table (`--bac-gca-table`)

Mode 2 — Plain GCA list (`--gca-list`)

Usage examples

All flags

Bacterial input (mutually exclusive)

Viral input

Required common argument

Download and reference

Extraction options

Clustering options

Cleanup options

Logging

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

procs-maker (ProCs Maker)

Installation

Bacterial input modes

Mode 1 — Harmonized taxonomy table (--bac-gca-table)

Mode 2 — Plain GCA list (--gca-list)

Usage examples

All flags

Bacterial input (mutually exclusive)

Viral input

Required common argument

Download and reference

Extraction options

Clustering options

Cleanup options

Logging

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Mode 1 — Harmonized taxonomy table (`--bac-gca-table`)

Mode 2 — Plain GCA list (`--gca-list`)

Packages