Skip to content

bio-datascience/procs-maker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

procs-maker (ProCs Maker)

Protein-Clusters Maker — bacterial and viral protein extraction, MMseqs2 clustering, and protein-count + presence/absence matrix construction.

Given a set of target GCA accessions and/or a viral contigs FASTA, this tool:

  1. Downloads and streams the ProGenomes3 protein reference (~8 GB compressed, only if a bacterial input is provided)
  2. Extracts bacterial proteins for the target genomes (streaming in batches)
  3. Extracts viral proteins via Prodigal gene prediction
  4. Clusters the combined (or single-side) collection with MMseqs2 easy-cluster
  5. Builds two matrices:
    • pc_matrix.csv — protein-count matrix (rows = genomes/viruses, cols = cluster IDs)
    • pb_matrix.csv — binary presence/absence matrix derived from pc_matrix

The bacterial and viral sides are independently optional; at least one is required.


Installation

pip install git+https://github.com/bio-datascience/procs-maker.git

Requires mmseqs/mmseqs2 and prodigal on PATH (Prodigal is only used when a viral input is provided).


Bacterial input modes

There are two ways to specify which bacterial genomes to target. They are mutually exclusive — use one

Mode 1 — Harmonized taxonomy table (--bac-gca-table)

This is the recommended mode when running ProCs Maker as part of a workflow. You pass a .csv file produced by a step of a workflow or from another tool (ex. progenomes-harmonizer), and the tool extracts the GCA target set automatically.

Required CSV format

The file must be a comma-separated CSV with a row index and at least three columns named exactly as follows:

Column Content
GCA_species Species-level GCA accession (e.g. GCA_000001405)
GCA_genus Genus-level representative GCA accession
GCA_family Family-level representative GCA accession

The exact file produced by progenomes-harmonizer already has these columns. Do not rename them.

How GCAs are extracted from the table

By default (genus-level mode), the target set is:

GCA_genus ∪ GCA_family

With --species-level, the target set becomes:

GCA_species ∪ GCA_family

NaN values in any column are silently ignored. This means that if a row has no genus-level representative it is naturally skipped at genus level, and a family-level genome is always included as a fallback when present.

example

Suppose taxonomy_table_gut_withIDs.csv contains:

,GCA_species,GCA_genus,GCA_family
0,GCA_000001405,GCA_000001405,GCA_000007305
1,GCA_000013425,GCA_000013425,GCA_000007305
2,GCA_000148985,,GCA_000007305

At genus level the extracted set is {GCA_000001405, GCA_000013425, GCA_000007305}. Row 2 has no genus entry so only its family GCA contributes.

procs_maker \
    --bac-gca-table taxonomy_table_gut_with_IDs.csv \
    --output-dir output/ \
    --download-path downloads/

Mode 2 — Plain GCA list (--gca-list)

Pass a plain text file with one GCA accession per line. Blank lines are skipped. Use this mode when you already know exactly which genomes you want and do not need the taxonomy-driven selection logic.

File format

GCA_000001405
GCA_000013425
GCA_000007305

No header, no extra columns — just one accession per line.

procs_maker \
    --gca-list target_gcas.txt \
    --output-dir output/ \
    --download-path downloads/

Usage examples

# ── Bi-modal (joint bacterial + viral clustering) ──────────────────────────
procs_maker \
    --bac-gca-table taxonomy_table_gut_withIDs.csv \
    --viral-contigs phage_contigs.fasta \
    --output-dir output/ \
    --download-path downloads/

# ── Bacterial-only (table mode) ────────────────────────────────────────────
procs_maker \
    --bac-gca-table taxonomy_table_gut_withIDs.csv \
    --output-dir output/ \
    --download-path downloads/

# ── Bacterial-only (list mode) ─────────────────────────────────────────────
procs_maker \
    --gca-list target_gcas.txt \
    --output-dir output/ \
    --download-path downloads/

# ── Viral-only (no --download-path needed) ────────────────────────────────
procs_maker \
    --viral-contigs phage_contigs.fasta \
    --output-dir output/

All flags

Bacterial input (mutually exclusive)

Flag Description
--bac-gca-table PATH Harmonized taxonomy CSV (taxonomy_table_*_withIDs.csv). Required columns: GCA_species, GCA_genus, GCA_family. The GCA target set is derived automatically (see above).
--gca-list PATH Plain text file, one GCA accession per line. Simpler alternative when you already have the exact accession list.

Viral input

Flag Description
--viral-contigs PATH FASTA file of viral (phage) contigs. Prodigal is run on these to predict proteins. Omit for a bacterial-only run.

Required common argument

Flag Description
--output-dir PATH Root directory for all outputs. Created if it does not exist.

Download and reference

Flag Default Description
--download-path PATH (none) Directory where the ProGenomes3 protein reference (progenomes3.proteins.representatives.fasta.bz2, ~8 GB) is cached or downloaded. Required whenever a bacterial input is given (--bac-gca-table or --gca-list); ignored in viral-only mode.
--protein-reference-url URL ProGenomes3 EMBL URL Override the download URL for the protein reference.

Extraction options

Flag Default Description
--batch-size INT 1500 Number of genomes processed per streaming batch during bacterial protein extraction. The reference archive is large (~8 GB compressed); a larger batch size reduces the number of passes through the archive but increases peak RAM usage. Reduce this value if you run out of memory; increase it on machines with large RAM to speed up extraction.
--species-level off Switch the GCA target-set extraction to species level (GCA_species ∪ GCA_family) instead of the default genus level (GCA_genus ∪ GCA_family). Only meaningful with --bac-gca-table; has no effect with --gca-list (the list is used verbatim).
--save-single-genomes off In addition to the combined BacterialProteinsCollection.fasta, save an individual FASTA file per genome in Targets Proteins Extraction/single_genomes/. Useful for downstream per-genome analyses.
--keep-coords off By default the Prodigal coords.gbk annotation file produced during viral protein prediction is deleted after extraction. Pass this flag to keep it (e.g. for inspecting gene coordinates).

Clustering options

Flag Default Description
--filter-1bac-1vir off After clustering, discard any cluster that does not contain at least one bacterial protein and at least one viral protein. This enforces cross-domain clusters and is only meaningful in bi-modal runs. Requires both a bacterial input (--bac-gca-table/--gca-list) and --viral-contigs — the tool will error if you pass this flag with a single-side input.

Cleanup options

Flag Default Description
--ref-removal off Delete the ProGenomes3 protein reference archive (progenomes3.proteins.representatives.fasta.bz2) from --download-path after bacterial protein extraction completes. Frees ~8 GB of disk space. The file will be re-downloaded on the next run if bacterial input is given.
--remove-collections off Delete the intermediate protein FASTA collections (BacterialProteinsCollection.fasta, ViralProteinsCollection.fasta, and CombinedProteinsCollection.fasta in bi-modal runs) after clustering finishes. The matrices are already written at that point, so these files are no longer needed unless you want to inspect the raw protein sequences.

Logging

Flag Description
-v, --verbose Enable verbose (INFO-level) logging. By default only warnings and errors are shown.

Outputs

<output-dir>/
├── Targets Proteins Extraction/
│   ├── BacterialProteinsCollection.fasta   (bacterial or bi-modal runs)
│   ├── ViralProteinsCollection.fasta        (viral or bi-modal runs)
│   └── CombinedProteinsCollection.fasta    (bi-modal only; input to MMseqs2)
├── Clustering/
│   ├── clusterRes_cluster.tsv              (MMseqs2 cluster assignments)
│   └── ...                                  (other MMseqs2 output files)
├── pc_matrix.csv                            (protein-count matrix)
└── pb_matrix.csv                            (binary presence/absence matrix)

pc_matrix.csv: rows are genome/virus identifiers, columns are cluster representative IDs, values are the count of proteins from that genome/virus assigned to each cluster.

pb_matrix.csv: same shape as pc_matrix.csv but binarized — 1 if the genome/virus has at least one protein in that cluster, 0 otherwise. This is the matrix used downstream for network inference.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages