Usage¶
Configure workflow¶
Configure the workflow according to your needs by editing the file config.yaml. For use without any sequence database downloads, you need to provide the files and pathnames in cleannames.txt and atccs.txt.
Gather your ribosomal protein sequences as protein sequence fasta files in 15 separate files and place in folder named “seqs”. These will be used for either protein or DNA sequence trees as per your choice laid out in the config.yaml. They should be named within the file as L14_rplN, L16_rplP, L18_rplR, L2_rplB, L22_rplV, L24_rplX, L3_rplC, L4_rplD, L5_rplE, L6_rplF, S10_rpsJ, S17_rpsQ, S19_rpsS, S3_rpsC and S8_rpsH, respectively. Staphylococcus sequences are included in the github folder and for examples. Other sets of sequences are coming soon!
Gather all of your genome sequences as annotated genbank files in the same directory as you are working in.
Create a file, “genus.txt” containing the name of the genus and species you want to build a tree for, one per line. Note that the genus or species name must be enclosed with quotes.
If you do not want to download any genomes from the sequence databases, create a file, “cleannames.txt” containing the name of each genome file you want to include in your tree, one per line, and place “no” in the config.yaml under download_genbank options. To use a mixture of download and provided genomes place “yes” in the config.yaml and do not provide a cleannames.txt file.
To update any previous trees, collect any files generated as concatenated deduplicated fasta files that you want to update, and edit the config.yaml file with the filenames as appropriate. NOTE: If you are not updating any previous trees, you will need to include an empty file named “previous.fasta” or filename as specified in the config.yaml
Add Genbank files, ribosomal protein sequence files and a list of the files: If you do not want to download any genomes, add the names of all provided genbank format files to cleannames.txt on a one-line per file basis. The suffix of the genbank files should end in .gbff (following the genbank download nomenclature), .gbk, .gb or .genbank. Add the names of your ribosomal sequence files on a one-line per file basis to atccs.txt. An easy way to generate these text files in unix is with: ls *gbff > cleannames.txt
Example files are contained in example_files folder with files for a testrun in the testfiles folder