Gene_fetch

This tool fetches gene sequences from NCBI databases based on taxonomy IDs (taxids) or taxonomic information. It can retrieve both protein and nucleotide sequences for various genes, including protein-coding genes (e.g., cox1, cytb, rbcl, matk) and rRNA genes (e.g., 16S, 18S).

Feature highlight

Fetch protein and/or nucleotide sequences from NCBI GenBank database.
Handles both direct nucleotide sequences and protein-linked nucleotide searches (CDS extraction includes fallback mechanisms for atypical annotation formats). .
Support for both protein-coding and rDNA genes.
Single-taxid mode (-s/--single) for retrieving a specified number of target sequences for a particular taxon (default length thresholds are reduced (protein: 50aa, nucleotide: 100bp)).
Customisable length filtering thresholds for protein and nucleotide sequences.
Automatic taxonomy traversal: Uses fetched NCBI taxonomic lineage for a given taxid when sequences are not found at the input taxonomic level. I.e., Search at given taxid level (e.g., species), if no sequences are found, escalate species->phylum until a suitable sequence is found.
Validates fetched sequence using higher taxonomy, avoiding potential taxonomic homonyms.
Robust error handling, error and progress logging, and NCBI API rate limits (10 requests/second).
Handles complex sequence features (e.g., complement strands, joined sequences, WGS entries) in addition to 'simple' cds extaction (if --type nucleotide/both). The tool avoids "unverified" sequences and WGS entries not containing sequence data (i.e. master records).
'Checkpointing': if a run fails/crashes, the script can be rerun using the same arguments and it will resume from where it stopped.
When more than 50 matching sequences are found for a sample, the tool fetches summary information for all matches (using NCBI esummary API), orders them by length, and processes the top 10 longest sequences.

Installation
Usage
Examples
Input
Output
Cluster
Supported targets
Notes
Benchmarking
Future developments
Contributions and citation

Installation

First, clone the Gene Fetch GitHub repository to your current path, and enter the Gene Fetch installation directory

git clone https://github.com/bge-barcoding/gene_fetch

cd gene_fetch

Run the commands below to install the necessary dependencies and activate the Conda environment. Conda must be installed.

conda env create -n fetch -f fetch.yaml

conda activate fetch

Alternatively, you can install the dependencies below directly or in your own Conda environment

conda install python>=3.9 pip
pip install ratelimit>=2.2.1
pip install biopython>=1.80

Usage

python gene_fetch.py -g/--gene  --type  -i/--in  -o/--out

--h/--help: Show help and exit.

Required arguments

-g/--gene: Name of gene to search for in NCBI GenBank database (e.g., cox1/16s/rbcl).
--type: Sequence type to fetch; 'protein', 'nucleotide', or 'both' ('both' will initially search and fetch a protein sequence, and then fetches the corresponding nucleotide CDS for that protein sequence).
-i/--in: Path to input CSV file containing sample IDs and TaxIDs (see Input section below).
i2/--in2: Path to alternative input CSV file containing sample IDs and taxonomic information for each sample (see Input section below).
o/--out: Path to output directory. The directory will be created if it does not exist.
e/--email and -k/--api-key: Email address and associated API key for NCBI account. An NCBI account is required to run this tool (due to otherwise strict API limitations) - information on how to create an NCBI account and find your API key can be found here. ####= Optional arguments
--protein_size: Minimum protein sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 500).
--nucleotide_size: Minimum nucleotide sequence length filter. Applicable to mode 'normal' and 'single-taxid' search modes (default: 1500).
s/--single: Taxonomic ID for 'single-taxid' sequence search mode (-i and -i2 ignored when run with -s mode). 'Single-taxid' mode will fetch all target gene or protein sequences on GenBank for a specific taxonomic ID.
--max-sequences: Maximum number of sequences to fetch for a specific taxonomic ID (only applies when run in 'single-taxid' mode).

Examples

Fetch both protein and nucleotide sequences for COI with default sequence length thresholds.

python gene_fetch.py -e your.email@domain.com -k your_api_key \
                    -g cox1 -o ./output_dir -i ./samples.csv \
                    --type both

Fetch rbcL nucleotide sequences using sample taxonomic information, applying a minimum nucleotide sequence length of 1000bp

python gene_fetch.py -e your.email@domain.com -k your_api_key \
                    -g rbcl -o ./output_dir -i2 ./taxonomy.csv \
                    --type nucleotide --nucleotide_size 1000

Retrieve 1000 available matK protein sequences >400aa for Arabidopsis thaliana (taxid: 3702).

python gene_fetch.py -e your.email@domain.com -k your_api_key \
                    -g matk -o ./output_dir -s 3702 \
                    --type protein --protein_size 400 --max-sequences 1000

Input

Example 'samples.csv' input file (-i/--in)

ID	taxid
sample-1	177658
sample-2	177627
sample-3	3084599

Example 'samples_taxonomy.csv' input file (-i2/--in2)

ID	phylum	class	order	family	genus	species
sample-1	Arthropoda	Insecta	Diptera	Acroceridae	Astomella	Astomella hispaniae
sample-2	Arthropoda	Insecta	Hemiptera	Cicadellidae	Psammotettix	Psammotettix sabulicola
sample-3	Arthropoda	Insecta	Trichoptera	Limnephilidae	Dicosmoecus	Dicosmoecus palatus

Output

'Normal' mode

output_dir/
├── nucleotide/                 # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── sample-1_dna.fasta   
│   ├── sample-2_dna.fasta
│   └── ...
├── sample-1.fasta              # Protein sequences.
├── sample-2.fasta
├── sequence_references.csv     # Sequence metadata.
├── failed_searches.csv         # Failed search attempts (if any).
└── gene_fetch.log              # Log.

sequence_references.csv output example

ID	taxid	protein_accession	protein_length	nucleotide_accession	nucleotide_length	matched_rank	ncbi_taxonomy	reference_name	protein_reference_path	nucleotide_reference_path
sample-1	177658	AHF21732.1	510	KF756944.1	1530	genus:Apatania	Eukaryota; ...; Apataniinae; Apatania	sample-1	abs/path/to/protein_references/sample-1.fasta	abs/path/to/protein_references/sample-1_dna.fasta
sample-2	2719103	QNE85983.1	518	MT410852.1	1557	species:Isoptena serricornis	Eukaryota; ...; Chloroperlinae; Isoptena	sample-2	abs/path/to/protein_references/sample-2.fasta	abs/path/to/protein_references/sample-2_dna.fasta
sample-3	1876143	YP_009526503.1	512	NC_039659.1	1539	genus:Triaenodes	Eukaryota; ...; Triaenodini; Triaenodes	sample-3	abs/path/to/protein_references/sample-3.fasta	abs/path/to/protein_references/sample-3_dna.fasta

'Single-taxid' mode

output_dir/
├── nucleotide/                      # Nucleotide sequences. Only populated if '--type nucleotide/both' utilised.
│   ├── ACCESSION1_dna.fasta   
│   ├── ACCESSION2_dna.fasta
│   └── ...
├── ACCESSION1.fasta                 # Protein sequences.
├── ACCESSION2.fasta
├── fetched_nucleotide_sequences.csv # Only populated if '--type nucleotide/both' utilised. Sequence metadata.
├── fetched_protein_sequences.csv    # Only populated if '--type protein/both' utilised. Sequence metadata.
├── failed_searches.csv              # Failed search attempts (if any).
└── gene_fetch.log                   # Log.

fetched_protein|nucleotide_sequences.csv output example

ID	taxid	Description
PQ645072.1	1501	Ochlerotatus nigripes isolate Pool11 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645071.1	1537	Ochlerotatus nigripes isolate Pool10 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645070.1	1501	Ochlerotatus impiger isolate Pool2 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PQ645069.1	1518	Ochlerotatus impiger isolate Pool1 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial
PP355486.1	581	Aedes scutellaris isolate NC.033 cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial

Running gene_fetch on a cluster

See '1_gene_fetch.sh' for running gene_fetch.py on a HPC cluster (SLURM job schedular).
Edit 'mem' and/or 'cpus-per-task' to set memory and CPU/threads allocation.
Change paths and variables as needed.
Run '1_gene_fetch.sh' with:

sbatch 1_gene_fetch.sh

Supported targets

Gene Fetch does function with other targets than those listed below, but it has hard-coded name variations and 'smarter' searching for the below targets. More targets can be added into script (see 'class config').

cox1/COI/cytochrome c oxidase subunit I
cox2/COII/cytochrome c oxidase subunit II
cox3/COIIIcytochrome c oxidase subunit III
cytb/cob/cytochrome b
nd1/NAD1/NADH dehydrogenase subunit 1
nd2/NAD2/NADH dehydrogenase subunit 2
rbcL/RuBisCO/ribulose-1,5-bisphosphate carboxylase/oxygenase large subunit
matK/maturase K/maturase type II intron splicing factor
16S ribosomal RNA/16s
SSU/18s
LSU/28s
12S ribosomal RNA/12s
ITS (ITS1-5.8S-ITS2)
ITS1/internal transcribed spacer 1
ITS2/internal transcribed spacer 2
tRNA-Leucine/trnL

Benchmarking

Sample Description	Run Mode	Target	Input File	Data Type	Memory	CPUs	Run Time
570 Arthropod samples	Normal	COX1	taxonomy.csv	Both	10G	18	02:51:06
570 Arthropod samples	Normal	COX1	samples.csv	Nucleotide	5G	4	02:04:01
570 Arthropod samples	Normal	COX1	samples.csv	Protein	5G	4	01:50:31
570 Arthropod samples	Normal	18S	samples.csv	Nucleotide	10G	8	01:38:16
570 Arthropod samples	Normal	ND1	samples.csv	Nucleotide	10G	4	01:58:35
All (159) A. thaliana sequences >300aa	Single-taxid	rbcL	N/A	Protein	5G	1	00:02:39
1000 Culicidae sequences >500bp	Single-taxid	COX1	N/A	nucleotide	20G	16	00:30:36
1000 M. tubercolisis sequences	Single-taxid	16S	N/A	nucleotide	20G	16	00:10:33

Future Development

Add optional alignment of retrieved sequences
Add support for direct GenBank submission format output
Enhance LRU caching for taxonomy lookups to reduce API calls
Further improve efficiency of record searching and selecting the longest sequence
Add support for additional genetic markers beyond the currently supported set

Contributions and citations

GeneFetch was written by Dan Parsons & Ben Price @ NHMUK (2024).

If you use GeneFetch, please cite our publication: XYZ()

If you have any questions or suggested improvements, please do get in touch in the issues!

Gene Fetch
main @ 976e79d (latest)

main @ 976e79d (latest)

main @ 9adc5e6 (earliest)

Gene_fetch

Feature highlight

Contents

Installation

Usage

Required arguments

Examples

Input

Output

'Normal' mode

'Single-taxid' mode

Running gene_fetch on a cluster

Supported targets

Benchmarking

Future Development

Contributions and citations

Version History

main @ 976e79d (latest) Created 17th Apr 2025 at 14:29 by Dan Parsons

main @ 9adc5e6 (earliest) Created 17th Apr 2025 at 13:57 by Dan Parsons

Creators

Submitter

Gene Fetch main @ 976e79d (latest) main @ 976e79d (latest) main @ 9adc5e6 (earliest)

Gene_fetch

Feature highlight

Contents

Installation

Usage

Required arguments

Examples

Input

Output

'Normal' mode

'Single-taxid' mode

Running gene_fetch on a cluster

Supported targets

Benchmarking

Future Development

Contributions and citations

Version History

main @ 976e79d (latest) Created 17th Apr 2025 at 14:29 by Dan Parsons

main @ 9adc5e6 (earliest) Created 17th Apr 2025 at 13:57 by Dan Parsons

Creators

Submitter

Related items

Gene Fetch
main @ 976e79d (latest)

main @ 976e79d (latest)

main @ 9adc5e6 (earliest)