Library curation BOLD
This repository contains scripts and synonymy data for pipelining the automated curation of BOLD data dumps in BCDM TSV format. The goal is to implement the classification of barcode reference sequences as is being developed by the BGE consortium. A living document in which these criteria are being developed is located here.
A further goal of this project is to develop the code in this repository according to the standards developed by the community in terms of automation, reproducibility, and provenance. In practice, this means including the scripts in a pipeline system such as snakemake, adopting an environment configuration system such as conda, and organizing the folder structure in compliance with the requirements of WorkFlowHub. The latter will provide it with a DOI and will help generate RO-crate documents, which means the entire tool chain is FAIR compliant according to the current state of the art.
Clone the repo:
git clone
Change directory:
cd Library_curation_BOLD
The code in this repo depends on various tools. These are managed using
the mamba
program (a drop-in replacement of conda
). The following
sets up an environment in which all needed tools are installed:
mamba env create -f environment.yml
Once set up, this is activated like so:
mamba activate bold-curation
How to run
Although the aim of this project is to integrate all steps of the process
in a simple snakemake pipeline, at present this is not implemented. Instead,
the steps are executed individually on the command line as perl scripts
within the conda/mamba environment. Because the current project has its own
perl modules in the lib
folder, every script needs to be run with the
additional include flag to add the module folder to the search path. Hence,
the invocation looks like the following inside the scripts folder:
perl -I../../lib -arg1 val1 -arg2 val2
Follow the installation instructions above.
Update config/config.yml to define your input data.
Navigate to the directory "workflow" and type:
snakemake -p -c {number of cores} target
If running on an HPC cluster with a SLURM scheduler you could use a bash script like this one:
#SBATCH --partition=hour
#SBATCH --output=job_curate_bold_%j.out
#SBATCH --error=job_curate_bold_%j.err
#SBATCH --mem=24G
#SBATCH --cpus-per-task=2
source activate bold-curation
snakemake -p -c 2 target
echo Complete!
Version History
main @ 4a78148 (earliest) Created 24th Apr 2024 at 09:51 by Rutger Vos
omg it works

Additional credit
Special thanks to Sujeevan Ratnasingham and the team at CBG for the creation of the BCDM data exchange format that this pipeline operates on
Created: 24th Apr 2024 at 09:51
Last updated: 24th Apr 2024 at 10:09
