Protein-Clusters Maker — bacterial and viral protein extraction, MMseqs2 clustering, and protein-count + presence/absence matrix construction.
Given a set of target GCA accessions and/or a viral contigs FASTA, this tool:
- Downloads and streams the ProGenomes3 protein reference (~8 GB compressed, only if a bacterial input is provided)
- Extracts bacterial proteins for the target genomes (streaming in batches)
- Extracts viral proteins via Prodigal gene prediction
- Clusters the combined (or single-side) collection with MMseqs2
easy-cluster - Builds two matrices:
pc_matrix.csv— protein-count matrix (rows = genomes/viruses, cols = cluster IDs)pb_matrix.csv— binary presence/absence matrix derived frompc_matrix
The bacterial and viral sides are independently optional; at least one is required.
pip install git+https://github.com/bio-datascience/procs-maker.gitRequires mmseqs/mmseqs2 and prodigal on PATH (Prodigal is only used when
a viral input is provided).
There are two ways to specify which bacterial genomes to target. They are mutually exclusive — use one
This is the recommended mode when running ProCs Maker as part of a workflow.
You pass a .csv file produced by a step of a workflow or from another tool
(ex. progenomes-harmonizer), and the tool extracts the GCA target set automatically.
Required CSV format
The file must be a comma-separated CSV with a row index and at least three columns named exactly as follows:
| Column | Content |
|---|---|
GCA_species |
Species-level GCA accession (e.g. GCA_000001405) |
GCA_genus |
Genus-level representative GCA accession |
GCA_family |
Family-level representative GCA accession |
The exact file produced by progenomes-harmonizer already has these columns.
Do not rename them.
How GCAs are extracted from the table
By default (genus-level mode), the target set is:
GCA_genus ∪ GCA_family
With --species-level, the target set becomes:
GCA_species ∪ GCA_family
NaN values in any column are silently ignored. This means that if a row has
no genus-level representative it is naturally skipped at genus level, and a
family-level genome is always included as a fallback when present.
example
Suppose taxonomy_table_gut_withIDs.csv contains:
,GCA_species,GCA_genus,GCA_family
0,GCA_000001405,GCA_000001405,GCA_000007305
1,GCA_000013425,GCA_000013425,GCA_000007305
2,GCA_000148985,,GCA_000007305
At genus level the extracted set is {GCA_000001405, GCA_000013425, GCA_000007305}.
Row 2 has no genus entry so only its family GCA contributes.
procs_maker \
--bac-gca-table taxonomy_table_gut_with_IDs.csv \
--output-dir output/ \
--download-path downloads/Pass a plain text file with one GCA accession per line. Blank lines are skipped. Use this mode when you already know exactly which genomes you want and do not need the taxonomy-driven selection logic.
File format
GCA_000001405
GCA_000013425
GCA_000007305
No header, no extra columns — just one accession per line.
procs_maker \
--gca-list target_gcas.txt \
--output-dir output/ \
--download-path downloads/# ── Bi-modal (joint bacterial + viral clustering) ──────────────────────────
procs_maker \
--bac-gca-table taxonomy_table_gut_withIDs.csv \
--viral-contigs phage_contigs.fasta \
--output-dir output/ \
--download-path downloads/
# ── Bacterial-only (table mode) ────────────────────────────────────────────
procs_maker \
--bac-gca-table taxonomy_table_gut_withIDs.csv \
--output-dir output/ \
--download-path downloads/
# ── Bacterial-only (list mode) ─────────────────────────────────────────────
procs_maker \
--gca-list target_gcas.txt \
--output-dir output/ \
--download-path downloads/
# ── Viral-only (no --download-path needed) ────────────────────────────────
procs_maker \
--viral-contigs phage_contigs.fasta \
--output-dir output/| Flag | Description |
|---|---|
--bac-gca-table PATH |
Harmonized taxonomy CSV (taxonomy_table_*_withIDs.csv). Required columns: GCA_species, GCA_genus, GCA_family. The GCA target set is derived automatically (see above). |
--gca-list PATH |
Plain text file, one GCA accession per line. Simpler alternative when you already have the exact accession list. |
| Flag | Description |
|---|---|
--viral-contigs PATH |
FASTA file of viral (phage) contigs. Prodigal is run on these to predict proteins. Omit for a bacterial-only run. |
| Flag | Description |
|---|---|
--output-dir PATH |
Root directory for all outputs. Created if it does not exist. |
| Flag | Default | Description |
|---|---|---|
--download-path PATH |
(none) | Directory where the ProGenomes3 protein reference (progenomes3.proteins.representatives.fasta.bz2, ~8 GB) is cached or downloaded. Required whenever a bacterial input is given (--bac-gca-table or --gca-list); ignored in viral-only mode. |
--protein-reference-url URL |
ProGenomes3 EMBL URL | Override the download URL for the protein reference. |
| Flag | Default | Description |
|---|---|---|
--batch-size INT |
1500 |
Number of genomes processed per streaming batch during bacterial protein extraction. The reference archive is large (~8 GB compressed); a larger batch size reduces the number of passes through the archive but increases peak RAM usage. Reduce this value if you run out of memory; increase it on machines with large RAM to speed up extraction. |
--species-level |
off | Switch the GCA target-set extraction to species level (GCA_species ∪ GCA_family) instead of the default genus level (GCA_genus ∪ GCA_family). Only meaningful with --bac-gca-table; has no effect with --gca-list (the list is used verbatim). |
--save-single-genomes |
off | In addition to the combined BacterialProteinsCollection.fasta, save an individual FASTA file per genome in Targets Proteins Extraction/single_genomes/. Useful for downstream per-genome analyses. |
--keep-coords |
off | By default the Prodigal coords.gbk annotation file produced during viral protein prediction is deleted after extraction. Pass this flag to keep it (e.g. for inspecting gene coordinates). |
| Flag | Default | Description |
|---|---|---|
--filter-1bac-1vir |
off | After clustering, discard any cluster that does not contain at least one bacterial protein and at least one viral protein. This enforces cross-domain clusters and is only meaningful in bi-modal runs. Requires both a bacterial input (--bac-gca-table/--gca-list) and --viral-contigs — the tool will error if you pass this flag with a single-side input. |
| Flag | Default | Description |
|---|---|---|
--ref-removal |
off | Delete the ProGenomes3 protein reference archive (progenomes3.proteins.representatives.fasta.bz2) from --download-path after bacterial protein extraction completes. Frees ~8 GB of disk space. The file will be re-downloaded on the next run if bacterial input is given. |
--remove-collections |
off | Delete the intermediate protein FASTA collections (BacterialProteinsCollection.fasta, ViralProteinsCollection.fasta, and CombinedProteinsCollection.fasta in bi-modal runs) after clustering finishes. The matrices are already written at that point, so these files are no longer needed unless you want to inspect the raw protein sequences. |
| Flag | Description |
|---|---|
-v, --verbose |
Enable verbose (INFO-level) logging. By default only warnings and errors are shown. |
<output-dir>/
├── Targets Proteins Extraction/
│ ├── BacterialProteinsCollection.fasta (bacterial or bi-modal runs)
│ ├── ViralProteinsCollection.fasta (viral or bi-modal runs)
│ └── CombinedProteinsCollection.fasta (bi-modal only; input to MMseqs2)
├── Clustering/
│ ├── clusterRes_cluster.tsv (MMseqs2 cluster assignments)
│ └── ... (other MMseqs2 output files)
├── pc_matrix.csv (protein-count matrix)
└── pb_matrix.csv (binary presence/absence matrix)
pc_matrix.csv: rows are genome/virus identifiers, columns are cluster
representative IDs, values are the count of proteins from that genome/virus
assigned to each cluster.
pb_matrix.csv: same shape as pc_matrix.csv but binarized — 1 if the
genome/virus has at least one protein in that cluster, 0 otherwise. This is
the matrix used downstream for network inference.