Gene_fetch
This tool fetches gene sequences from NCBI databases based on taxonomy IDs (taxids) or taxonomic information. It can retrieve both protein and nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).
Feature highlight
- Fetch protein and/or nucleotide sequences from NCBI GenBank database.
- Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats). .
- Support for both protein-coding and rDNA genes.
- Single-taxid mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds are reduced (protein: 50aa, nucleotide: 100bp)).
- Customisable length filtering thresholds for protein and nucleotide sequences.
- Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. I.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
- Validates fetched sequence using higher taxonomy, avoiding potential taxonomic homonyms.
- Robust error handling, error and progress logging, and NCBI API rate limits (10 requests/second).
- Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
- 'Checkpointing': if a run fails/crashes, the script can be rerun using the same arguments and it will resume from where it stopped.
- When more than 50 matching sequences are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders them by length, and processes the top 10 longest sequences.
Contents
- Installation
- Usage
- Examples
- Input
- Output
- Cluster
- Supported targets
- Notes
- Benchmarking
- Future developments
- Contributions and citation
Installation
First, clone the Gene Fetch GitHub repository to your current path, and enter the Gene Fetch installation directory
git clone https://github.com/bge-barcoding/gene_fetch
cd gene_fetch
Run the commands below to install the necessary dependencies and activate the Conda environment. Conda must be installed.
conda env create -n fetch -f fetch.yaml
conda activate fetch
Alternatively, you can install the dependencies below directly or in your own Conda environment
conda install python>=3.9 pip
pip install ratelimit>=2.2.1
pip install biopython>=1.80
Usage
python gene_fetch.py -g/--gene --type -i/--in -o/--out
--h/--help
: Show help and exit.
Required arguments
-g/--gene
: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).--type
: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).-i/--in
: Path to input CSV file containing sample IDs and TaxIDs (see Input section below).i2/--in2
: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see Input section below).o/--out
: Path to output directory. The directory will be created if it does not exist.e/--email
and-k/--api-key
: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found here. ####= Optional arguments--protein_size
: Minimum protein sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 500).--nucleotide_size
: Minimum nucleotide sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 1500).s/--single
: Taxonomic ID for 'single-taxid' sequence search mode (-i
and-i2
ignored when run with-s
mode). 'Single-taxid' mode will fetch all target gene or protein sequences on GenBank for a specific taxonomic ID.--max-sequences
: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single-taxid' mode).
Examples
Fetch both protein and nucleotide sequences for COI with default sequence length thresholds.
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g cox1 -o ./output_dir -i ./samples.csv \
--type both
Fetch rbcL nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g rbcl -o ./output_dir -i2 ./taxonomy.csv \
--type nucleotide --nucleotide_size 1000
Retrieve 1000 available matK protein sequences >400aa for Arabidopsis thaliana (taxid: 3702).
python gene_fetch.py -e your.email@domain.com -k your_api_key \
-g matk -o ./output_dir -s 3702 \
--type protein --protein_size 400 --max-sequences 1000
Input
Example 'samples.csv' input file (-i/--in)
ID | taxid |
---|---|
sample-1 | 177658 |
sample-2 | 177627 |
sample-3 | 3084599 |
Example 'samples_taxonomy.csv' input file (-i2/--in2)
ID | phylum | class | order | family | genus | species |
---|---|---|---|---|---|---|
sample-1 | Arthropoda | Insecta | Diptera | Acroceridae | Astomella | Astomella hispaniae |
sample-2 | Arthropoda | Insecta | Hemiptera | Cicadellidae | Psammotettix | Psammotettix sabulicola |
sample-3 | Arthropoda | Insecta | Trichoptera | Limnephilidae | Dicosmoecus | Dicosmoecus palatus |
Output
'Normal' mode
output_dir/
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── sample-1_dna.fasta
│ ├── sample-2_dna.fasta
│ └── ...
├── sample-1.fasta # Protein sequences.
├── sample-2.fasta
├── sequence_references.csv # Sequence metadata.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
sequence_references.csv output example
ID | taxid | protein_accession | protein_length | nucleotide_accession | nucleotide_length | matched_rank | ncbi_taxonomy | reference_name | protein_reference_path | nucleotide_reference_path |
---|---|---|---|---|---|---|---|---|---|---|
sample-1 | 177658 | AHF21732.1 | 510 | KF756944.1 | 1530 | genus:Apatania | Eukaryota; ...; Apataniinae; Apatania | sample-1 | abs/path/to/protein_references/sample-1.fasta | abs/path/to/protein_references/sample-1_dna.fasta |
sample-2 | 2719103 | QNE85983.1 | 518 | MT410852.1 | 1557 | species:Isoptena serricornis | Eukaryota; ...; Chloroperlinae; Isoptena | sample-2 | abs/path/to/protein_references/sample-2.fasta | abs/path/to/protein_references/sample-2_dna.fasta |
sample-3 | 1876143 | YP_009526503.1 | 512 | NC_039659.1 | 1539 | genus:Triaenodes | Eukaryota; ...; Triaenodini; Triaenodes | sample-3 | abs/path/to/protein_references/sample-3.fasta | abs/path/to/protein_references/sample-3_dna.fasta |
'Single-taxid' mode
output_dir/
├── nucleotide/ # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│ ├── ACCESSION1_dna.fasta
│ ├── ACCESSION2_dna.fasta
│ └── ...
├── ACCESSION1.fasta # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Only populated if '--type nucleotide/both' utilised. Sequence metadata.
├── fetched_protein_sequences.csv # Only populated if '--type protein/both' utilised. Sequence metadata.
├── failed_searches.csv # Failed search attempts (if any).
└── gene_fetch.log # Log.
fetched_protein|nucleotide_sequences.csv output example
ID | taxid | Description |
---|---|---|
PQ645072.1 | 1501 | Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
PQ645071.1 | 1537 | Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
PQ645070.1 | 1501 | Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
PQ645069.1 | 1518 | Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
PP355486.1 | 581 | Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial |
Running gene_fetch on a cluster
- See '1_gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular).
- Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads allocation.
- Change paths and variables as needed.
- Run '1_gene_fetch.sh' with:
sbatch 1_gene_fetch.sh
Supported targets
Gene Fetch does function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the below targets. More targets can be added into script (see 'class config').
- cox1/COI/cytochrome c oxidase subunit I
- cox2/COII/cytochrome c oxidase subunit II
- cox3/COIIIcytochrome c oxidase subunit III
- cytb/cob/cytochrome b
- nd1/NAD1/NADH dehydrogenase subunit 1
- nd2/NAD2/NADH dehydrogenase subunit 2
- rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
- matK/maturase K/maturase type II intron splicing factor
- 16S ribosomal RNA/16s
- SSU/18s
- LSU/28s
- 12S ribosomal RNA/12s
- ITS (ITS1-5.8S-ITS2)
- ITS1/internal transcribed spacer 1
- ITS2/internal transcribed spacer 2
- tRNA-Leucine/trnL
Benchmarking
Sample Description | Run Mode | Target | Input File | Data Type | Memory | CPUs | Run Time |
---|---|---|---|---|---|---|---|
570 Arthropod samples | Normal | COX1 | taxonomy.csv | Both | 10G | 18 | 02:51:06 |
570 Arthropod samples | Normal | COX1 | samples.csv | Nucleotide | 5G | 4 | 02:04:01 |
570 Arthropod samples | Normal | COX1 | samples.csv | Protein | 5G | 4 | 01:50:31 |
570 Arthropod samples | Normal | 18S | samples.csv | Nucleotide | 10G | 8 | 01:38:16 |
570 Arthropod samples | Normal | ND1 | samples.csv | Nucleotide | 10G | 4 | 01:58:35 |
All (159) A. thaliana sequences >300aa | Single-taxid | rbcL | N/A | Protein | 5G | 1 | 00:02:39 |
1000 Culicidae sequences >500bp | Single-taxid | COX1 | N/A | nucleotide | 20G | 16 | 00:30:36 |
1000 M. tubercolisis sequences | Single-taxid | 16S | N/A | nucleotide | 20G | 16 | 00:10:33 |
Future Development
- Add optional alignment of retrieved sequences
- Add support for direct GenBank submission format output
- Enhance LRU caching for taxonomy lookups to reduce API calls
- Further improve efficiency of record searching and selecting the longest sequence
- Add support for additional genetic markers beyond the currently supported set
Contributions and citations
GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2024).
If you use GeneFetch, please cite our publication: XYZ()
If you have any questions or suggested improvements, please do get in touch in the issues!
Version History
main @ 976e79d (latest) Created 17th Apr 2025 at 14:29 by Dan Parsons
Delete snakemakeSFU/workflow directory
Frozen
main
976e79d
main @ 9adc5e6 (earliest) Created 17th Apr 2025 at 13:57 by Dan Parsons
Update gene_fetch.py
Frozen
main
9adc5e6

Creators
Submitter
Views: 623 Downloads: 194
Created: 17th Apr 2025 at 13:57
Last updated: 28th May 2025 at 14:49


None