How to create a workflow ¶

CulebrONT allow you to build a workflow using a simple configuration config.yaml file. In this file :

First, provide data paths
Second, activate tools from assembly to correction.
Third, activate tools from quality checking of assemblies.
And last, manage parameters tools.

To create this file juste run

culebrONT create_config ¶

Create config.yaml for run

culebrONT create_config [OPTIONS]

Options

-c, --configyaml <configyaml>¶: Required Path to create config.yaml

Then edit the relevant sections of the file to customize your flavor of a workflow.

For FASTQ, naming convention accepted by CulebrONT is NAME.fastq.gz or NAME.fq.gz or NAME.fastq or NAME.fq. Preferentially use short names and avoid special characters because report can fail. Avoid to use the long name given directly by sequencer.

All fastq files have to be homogeneous on their extension and can be compressed or not.

Reference fasta file need a fasta or fa extension uncompressed.

2. Choose assemblers, polisher and correctors ¶

Activate/deactivate assemblers, polishers and correctors as you wish. Feel free to activate only assembly, assembly+polishing or assembly+polishing+correction.

Note

If you expect your genome to include a circular replicon (e.g. with prokaryote), it is recommendated to activate CIRCULAR steps

Example:

ASSEMBLY:
    CANU: true
    FLYE: true
    MINIASM: false
    RAVEN: false
    SMARTDENOVO:  false
    SHASTA: false

POLISHING:
    RACON: true

CIRCULAR: false

CORRECTION:
    NANOPOLISH: false
    MEDAKA: false
    PILON: true

3. Choose quality tools ¶

With CulebrONT you can use several quality tools to check assemblies.

If BUSCO or QUAST are used, they will run on every fasta assembly generated along the various steps of the pipeline.
If BLOBTOOLS, ASSEMBLYTICS, FLAGSTATS and KAT are activated only the fasta assembly generated after the last sequence processing step of the pipeline will be used.
KAT quality tool can be activate but Illumina reads are mandatory in this case. These reads can be compressed or not.

# BUSCO and QUAST will be launched on all activated steps (ASSEMBLY, POLISHING, CORRECTION)
QUALITY:
    BUSCO: true
    QUAST: true
#### Others quality tools are launched only in last assemblies
    BLOBTOOLS: true
    ASSEMBLYTICS: true
#### Others quality soft but illumina reads are required
    FLAGSTATS: true

If several assemblers are activated, a multiple alignment of the various assemblies for small genomes (<10-20Mbp) can be computed with Mauve.

If you want to improve alignment with MAUVE on circular molecules, it is recommended to activate the Fixstart step.
Only activate MAUVE if you have more than one assembler per sample and more than one quality step.

#### Alignment of the various assemblies derived from a fastq file for small genomes (<10-20Mbp);
MSA:
    MAUVE: false

4. Parameters for some specific tools ¶

You can manage tools parameters on the params section on config.yaml file.

Specifically for Racon:

Racon can be launched recursively from 1 to 9 rounds.

Specifically for Medaka:

If ‘MEDAKA_TRAIN_WITH_REF’ is activated, Medaka launchs training using the reference found in ‘DATA/REF’ param. Medaka does not take into account other medaka model parameters and use the trained model instead.
If ‘MEDAKA_TRAIN_WITH_REF’ is deactivated, Medaka does not launch training but uses instead the model provided in ‘MEDAKA_MODEL_PATH’. Give to CulebrONT the path of medaka model OR just the model name in order to correct assemblies. This parameter could not be empty.

Important

Medaka models can be downloaded from the medaka repository. You need to install git lfs (see documentation here https://git-lfs.github.com/) to download largest files before git clone https://github.com/nanoporetech/medaka.git\.

Specifically for Pilon:

We set java memory into Singularity.culebront_tools to 8G. If you need to allocate more memory it’s possible to chang this line sed -i "s/-Xmx1g/-Xmx8g/g" /usr/local/miniconda/miniconda3/envs/pilon/bin/pilon in the Containers/Singularity.culebront_tools.def recipe file before building the Singularity image.

Specifically for Busco:

If BUSCO is activated, one must provide CulebrONT the path of busco a database OR only the database name (See the Busco documentation).This parameter cannot be empty.

Specifically for Blobtools: * nodes and names from the ncbi taxdump database can be download from here : https://github.com/DRL/blobtools#download-ncbi-taxdump-and-create-nodesdb

Here you find standard parameters used on CulebrONT. Feel free to adapt it to your own requirements.

params:
    #### ASSEMBLY
    MINIMAP2:
        PRESET_OPTION: 'map-ont' # -x minimap2 preset option is map-pb by default (map-pb, map-ont etc)
    FLYE:
        MODE : '--nano-raw'
        OPTIONS: '' ## use --scaffold if flye>=2.9 # you can also use --resume option
    CANU:
        MODE : '-nanopore'
        OPTIONS: 'useGrid=false'
    SMARTDENOVO:
        KMER_SIZE: 16
        OPTIONS: '-J 5000'
    SHASTA:
        MEM_MODE: 'filesystem'
        MEM_BACKING: 'disk'
        OPTIONS: '--Reads.minReadLength 0'


    #### CIRCULAR
    CIRCLATOR:
        OPTIONS: ''


    #### POLISHING
    RACON:
        RACON_ROUNDS: 2                 #1 to 9


    #### CORRECTION
    CORRECTION_MAKERANGE:
        SEGMENT_LEN: 50000              # segment length to split assembly and correct it  default=50000
        OVERLAP_LEN: 200                # overlap length between segments  default=200

    NANOPOLISH:
        OPTIONS: ''

    MEDAKA:
        MEDAKA_TRAIN_WITH_REF: false    # if 'MEDAKA_TRAIN_WITH_REF' is True, training uses reference to found in DATA REF param.

        # Medaka does not take in count other parameters below if MEDAKA_TRAIN_WITH_REF is TRUE.
        MEDAKA_MODEL_PATH: 'r941_min_high_g303' # use a path if you have downloaded a model (or you want to use your own trained model) OR a simple string like 'r941_min_high_g303'
        MEDAKA_FEATURES_OPTIONS: '--batch_size 10 --chunk_len 100 --chunk_ovlp 10'
        MEDAKA_TRAIN_OPTIONS: '--batch_size 10 --epochs 500 '
        MEDAKA_CONSENSUS_OPTIONS: '--batch 200 '

    PILON:
        PILON_ROUNDS: 2                 #1 to 9
        OPTIONS: ''

    #### QUALITY
    BUSCO:
        #DATABASE: "DATA_DIR/Data-Xoo-sub/bacteria_odb10"
        DATABASE: 'bacteria_odb10 --update-data ' # use a path if you have downloaded a taxonomic database from busco OR a simple string like 'bacteria_odb10'
        MODEL: 'genome'
        SP: ''                         #--augustus-specie parameter on busco

    QUAST:
        GFF: ''
        OPTIONS: '--large'

    DIAMOND:
        DATABASE: 'DATA_DIR/Data-Xoo-sub/testBacteria.dmnd'

    MUMMER:
        MINMATCH: 100                  # is -l option with default 20 on MUMMER
        MINCLUSTER: 500                 # is -c option with default 65 on MUMMER

    ASSEMBLYTICS:
        UNIQUE_ANCHOR_LEN: 10000
        MIN_VARIANT_SIZE: 50
        MAX_VARIANT_SIZE: 10000

    BLOBTOOLS:
        NAMES: 'DATA_DIR/Data-Xoo-sub/blobtools/names.dmp'
        NODES: 'DATA_DIR/Data-Xoo-sub/blobtools/nodes.dmp'

Warning

Please check documentation of each tool and make sure that the settings are correct!

How to run the workflow ¶

Before attempting to run CulebrONT please be sure you have already modified the config.yaml file as explained in 1. Providing data.

If you installed CulebrONT on a HPC cluster with a job scheduler, you can run:

culebrONT run_cluster ¶

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters
through this arguments command using the same syntax.
This will take precedence over Snakemake parameters defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html

Example:: culebrONT run_cluster -c config.yaml –dry-run –jobs 200

culebrONT run_cluster [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>¶: Required Configuration file for run culebrONT

-pdf, --pdf¶

Run snakemake with –dag, –rulegraph and –filegraph

Default:: False

Arguments

SNAKEMAKE_OTHER¶: Optional argument(s)

culebrONT run_local ¶

Run snakemake command line with mandatory parameters.
SNAKEMAKE_OTHER: You can also pass additional Snakemake parameters (check it https://snakemake.readthedocs.io/en/stable/executing/cli.html)
through this arguments command using the same syntax.
This will take precedence over Snakemake parameters defined in the profile.
See: https://snakemake.readthedocs.io/en/stable/executing/cli.html

Example: | culebrONT run_local -c config.yaml –threads 8 –dry-run | culebrONT run_local -c config.yaml –threads 8 –singularity-args ‘–bind /mnt:/mnt’ | # in LOCAL using 6 threads for Canu assembly from the total 8 threads | culebrONT run_local -c config.yaml –threads 8 –set-threads run_canu=6

culebrONT run_local [OPTIONS] [SNAKEMAKE_OTHER]...

Options

-c, --config <config>¶: Required Configuration file for run culebrONT

-t, --threads <threads>¶: Required Number of threads

-p, --pdf¶: Run snakemake with –dag, –rulegraph and –filegraph

Arguments

SNAKEMAKE_OTHER¶: Optional argument(s)

Advance run ¶

Provide more ressources ¶

If cluster default resources are not sufficient, you can edit cluster_config.yaml See 2. Adapting cluster_config.yaml:

culebrONT edit_cluster_config¶

Edit cluster_config.yaml use by profile

culebrONT edit_cluster_config [OPTIONS]

Now, take a coffee or tea, and enjoy !!!!!!

Provide own tools_config.yaml ¶

To change the tools used on culebrONT workflow, you can run See 3. How to configure tools_path.yaml

culebrONT edit_tools¶

Edit own tools version

culebrONT edit_tools [OPTIONS]

Options

-r, --restore¶

Restore default tools_config.yaml (from install)

Default:: False

Output on CulebrONT ¶

The architecture of CulebrONT output is designed as follows:

OUTPUT_CULEBRONT_CIRCULAR/
├── SAMPLE-1
│   ├── AGGREGATED_QC
│   │   ├── DATA
│   │   ├── MAUVE_ALIGN
│   │   └── QUAST_RESULTS
│   ├── ASSEMBLERS
│   │   ├── CANU
│   │   │   ├── ASSEMBLER
│   │   │   ├── CORRECTION
│   │   │   ├── FIXSTART
│   │   │   ├── POLISHING
│   │   │   └── QUALITY
│   │   ├── FLYE
│   │   │   ├── ...
│   │   ├── MINIASM
│   │   │   ├── ...
│   │   ├── RAVEN
│   │   │   ├── ...
│   │   ├── SHASTA
│   │   │   ├── ...
│   │   └── SMARTDENOVO
│   │   │   ├── ...
│   ├── DIVERS
│   │   └── FASTQ2FASTA
│   ├── LOGS
│   └── REPORT
└── FINAL_REPORT
├── SAMPLE-2 ...

Report ¶

CulebrONT generates a useful report including the versions of tools used and, foreach fastq, a summary of interesting statistics and . Please discover an example … and enjoy !!

Note

Because of constraints imposed by Snakemake, we have not been able to include the version of bwa and seqtk in the report https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#step-5-loggin. If you want to know the versions of these tools, go check by yourself ^^.

Important

To visualise the report created by CulebrONT, transfer the folder FINAL_RESULTS on your local computer and open it on a navigator.

Input	Description
FASTQ	Every FASTQ file should contain the whole set of reads to be assembled. Each fastq file will be assembled independently.
REF	Only one REFERENCE genome file will be used by CulebrONT. This REFERENCE will be used for quality steps (ASSEMBLYTICS, QUAST and MAUVE)
GENOME_SIZE	Estimated genome size of the assembly can be done on mega (Mb), giga(Gb) or kilobases (Kb). This size is used on some assemblers (CANU) and also on QUAST quality step
FAST5	Nanopolish uses FAST5 files for polishing and Medaka needs FAST5 files if a model training step is requested. Please give the path of the FAST5 folder in the FAST5 DATA parameter. Inside this directory, a subdirectory with the exact same name as the corresponding FASTQ (before the .fastq.gz) is required. For instance, if in the FASTQ directory we have run1.fastq.gz and run2.fastq.gz, CulebrONT is expecting the run1/ and run2/ subdirectories in the FAST5 main directory
ILLUMINA	Indicate the path to the directory with Illumina sequence data (in fastq or fastq.gz format) to perform pilon correction and for KAT on quality. Use preferentially paired-end data. All fastq files need to be homogeneous on their extension name. Please use run1_R1 and run1_R2 nomenclature.
OUTPUT	output path directory