| 30 | | Notes: |
|---|
| 31 | | 1) RepeatMasker requires Wu-BLAST and a single file executable called TRF (see RepeatMasker website for details), so please install these before installing RepeatMasker |
|---|
| 32 | | 2) Exonerate Binaries can be downloaded from the website. If you have Mac OSX, however, binaries are only available for version 1.0. This verion will work too. If you would like to compile exonerate, it requires GLIB, a C-library, that has a link from the exonerate website. If you have Mac OSX, this can downloaded using FINK. |
|---|
| 33 | | 3) RepeatMasker requires a repeat library file, which is downloaded from Repbase (http://www.girinst.org/), this is explained on the RepeatMasker website. |
|---|
| 34 | | 4) Please note the location of all of the programs that you have installed. You will need this information in the maker.exe file, one of MAKER's 3 control files. |
|---|
| | 38 | Notes: |
|---|
| | 39 | 1) Wu-BLAST is becoming AB-BLAST. Once AB-BLAST becomes available we will do some testing to see if it is compatible with MAKER. Wu-BLAST is no longer available online, so if you don't already have it, you will have to use NCBI BLAST instead. |
|---|
| | 40 | 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file executable called TRF (see RepeatMasker website for details), so please install these before installing RepeatMasker |
|---|
| | 41 | 3) Exonerate Binaries can be downloaded from the website. If you use Mac OSX, however, binaries are only available for version 1.0. This verion will work too. If you would like to compile exonerate, it requires GLIB, a C-library, that has a link from the exonerate website. If you use Mac OSX, GLIB can downloaded using FINK. |
|---|
| | 42 | 4) RepeatMasker requires a repeat library file, which can be downloaded from Repbase upon registration (http://www.girinst.org/), this is explained on the RepeatMasker website. |
|---|
| | 43 | 5) Please note the location of all of the programs that you have installed, and add them to you $PATH variable in your .profile file. You will need this information in the maker.exe file, one of MAKER's 3 control files. |
|---|
| 102 | | Usage: |
|---|
| 103 | | |
|---|
| 104 | | maker [options] <maker_opts.ctl> <maker_bopts.ctl> <maker_exe.ctl> |
|---|
| 105 | | |
|---|
| 106 | | The three input arguments are user control files that specify how maker should behave. |
|---|
| 107 | | All input files listed in the control options files must be in fasta format. Please |
|---|
| 108 | | see maker documentation to learn more about control file format. The program will |
|---|
| 109 | | automatically try and locate the user control files in the current working |
|---|
| 110 | | directory if these arguments are not supplied when initializing maker. |
|---|
| 111 | | |
|---|
| 112 | | It is important to note that maker does not try and recalculated data that it has |
|---|
| 113 | | already calculated. For example, if you run an analysis twice on the same fasta file |
|---|
| 114 | | you will notice that maker does not rerun any of the blast analyses but instead uses |
|---|
| 115 | | the blast analyses stored from the previous run. To force maker to rerun all |
|---|
| 116 | | analyses, use the -f flag. |
|---|
| 117 | | |
|---|
| 118 | | Options: |
|---|
| 119 | | |
|---|
| 120 | | -genome|g <file_name> Give MAKER a different genome file (this overrides the |
|---|
| 121 | | control file value) |
|---|
| 122 | | |
|---|
| 123 | | -predictor <snap> Selects the gene predictor to use when building annotations (Default |
|---|
| 124 | | <augustus> is 'snap'). The option 'est2genome' builds annotations directly |
|---|
| 125 | | <est2genome> from the EST evidence. |
|---|
| 126 | | |
|---|
| 127 | | -GFF Use an input gff3 format file of repeat elements for repeat masking. |
|---|
| 128 | | You must set rm_gff in maker_opts.ctl to the files location. This |
|---|
| 129 | | option turns off all other repeat masking. |
|---|
| 130 | | |
|---|
| 131 | | -RM_off|R Turns repeat masking off (* See Warning) |
|---|
| 132 | | |
|---|
| 133 | | -force|f Forces maker to rerun all analyses (replaces all previous output). |
|---|
| 134 | | |
|---|
| 135 | | -datastore|d Causes output to be written using datastore. This option is |
|---|
| 136 | | automatically enabled if there are more than 1000 fasta entries |
|---|
| 137 | | in the input file. Output can then accessed using the |
|---|
| 138 | | master_datastore_index file created by the program. |
|---|
| 139 | | |
|---|
| 140 | | -PREDS Outputs ab-initio predictions that do not overlap maker annotation |
|---|
| 141 | | as gene annotations in the final gff3 output file (based on the |
|---|
| 142 | | -predictor flag ). |
|---|
| 143 | | |
|---|
| 144 | | -CTL Generates generic control files in the current working directory. |
|---|
| 145 | | |
|---|
| 146 | | -retry <integer> Re-run failed contigs up to the specified number of re-tries. |
|---|
| 147 | | |
|---|
| 148 | | -cpus|c <integer> Tells how many cpus to use for Blast analysis (this overrides |
|---|
| 149 | | contorol file value). |
|---|
| 150 | | |
|---|
| 151 | | -help|? Prints this usage statement. |
|---|
| 152 | | |
|---|
| 153 | | |
|---|
| 154 | | Warning: |
|---|
| 155 | | |
|---|
| 156 | | *When using the -R flag, maker expects that the input genome file is already masked. |
|---|
| 157 | | Also if your genome file contains lower case characters, maker will consider those |
|---|
| 158 | | characers to be soft masked. |
|---|
| 159 | | |
|---|
| 160 | | |
|---|
| 161 | | #---------- |
|---|
| | 113 | |
|---|
| | 114 | |
|---|
| | 115 | |
|---|
| | 116 | #---------------------------------------------------- |
|---|
| 171 | | MAKER will create at least the following files/directory: |
|---|
| 172 | | |
|---|
| 173 | | seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, or Apollo |
|---|
| 174 | | |
|---|
| 175 | | seq_name.maker.transcripts.fasta - a file of the maker transcript sequences |
|---|
| 176 | | seq_name.maker.snap.transcript.fasta - a file of ab-inito snap transcript sequences |
|---|
| 177 | | seq_name.maker.proteins.fasta - a file of the maker protein sequences |
|---|
| 178 | | seq_name.maker.snap.proteins.fasta - a file of ab-inito snap protein sequences |
|---|
| 179 | | |
|---|
| 180 | | theVoid.seq_name - a directory containing all of the results files produced by maker, including BLAST reports, SNAP output, exonnerate output and the masked sequence |
|---|
| | 126 | MAKER will create at least the following files/directories: |
|---|
| | 127 | |
|---|
| | 128 | XXX.maker.output/ - contains all output for a given run of make |
|---|
| | 129 | XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run progress as well as an index for traversing XXX.maker.output/XXX_datastore/ |
|---|
| | 130 | XXX.maker.output/XXX_datastore/ - contains folders containing the output for each individual contig of the input fasta file |
|---|
| | 131 | *Within these folders |
|---|
| | 132 | seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, or Apollo |
|---|
| | 133 | seq_name.maker.transcripts.fasta - a file of the maker transcript sequences |
|---|
| | 134 | seq_name.maker.proteins.fasta - a file of the maker protein sequences |
|---|
| | 135 | seq_name.maker.XXX.transcript.fasta - a file of ab-inito transcript sequences from program XXX |
|---|
| | 136 | seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein sequences from program XXX |
|---|
| | 137 | seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a file of filtered ab-inito transcript sequences that don't overlap annotations |
|---|
| | 138 | seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a file of filtered ab-inito protein sequences that don't overlap annotations |
|---|
| | 139 | theVoid.seq_name/ - a directory containing all of the raw output files produced by maker, including BLAST reports, SNAP output, exonnerate output and the masked sequence |
|---|
| 194 | | Datastore will be used by maker if there are more than 1,000 sequences in a multi-fasta file or you use the -d flag on the command line. |
|---|
| 195 | | |
|---|
| 196 | | When datastore is implemented, the output files described above will not appear where you would normally expect them to be. Instead they will be located in a series of sub-directory under a new base-directory whose name is determined from the input genome file name, i.e. current_working_directory/input_genome_datastore/EE/Af/seq_name/seq_name.gff. A master_datastore_index file will be made in the current working directory to help you find the output files from each sequence. |
|---|
| 197 | | |
|---|
| 198 | | The master_datastore_index file is a file created to allow the user to easily find the exact output directory corresponding to contigs from the input genome file. The The master_datastore_index file contains two columns of text; the first column shows the sequence identifier from each fasta header, and the second column shows the location of the output files for that sequence. |
|---|
| 199 | | |
|---|
| 200 | | |
|---|
| 201 | | |
|---|
| 202 | | #---------- |
|---|
| | 151 | A deep datastore will be used by maker if there are more than 1,000 sequences in a multi-fasta file. |
|---|
| | 152 | |
|---|
| | 153 | When a datastore is implemented, the output files described above will not appear where you would normally expect them to be. Instead they will be located in a series of sub-directory under a new base-directory whose name is determined from the input genome file name, i.e. current_working_directory/genome_datastore/EE/Af/Contig1/Contig1.gff. A master_datastore_index file will be made in the current working directory to help you find the output files from each sequence. |
|---|
| | 154 | |
|---|
| | 155 | The master_datastore_index file is a file created to allow the user to easily find the exact output directory corresponding to contigs from the input genome file. The The master_datastore_index file contains three columns of text; the first column shows the sequence identifier from each fasta header, and the second column shows the location of the output files for that sequence. The third column is for logging the status of data related to an individual contig. The values of the third column are as follows: |
|---|
| | 156 | STARTED - Indicates that maker has started proccessing this contig. |
|---|
| | 157 | FINISHED - Indicates that maker has finished processing this contig and all data is currently available in that subdirectory. |
|---|
| | 158 | DIED - Indicates that maker failed. |
|---|
| | 159 | DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the specified number of retries and will not try again. |
|---|
| | 160 | RETRY - Indicates that maker is retrying the contig after a failure. |
|---|
| | 161 | SKIPPED_SMALL - Indicates that this contig was skipped because it is too short (based on control file values set by the user) |
|---|
| | 162 | |
|---|
| | 163 | |
|---|
| | 164 | #---------------------------------------------------- |
|---|
| 211 | | |
|---|
| 212 | | Here is what the standard maker_exe.ctl control file looks like: |
|---|
| 213 | | ==================================== |
|---|
| 214 | | |
|---|
| 215 | | #-----Location of executables required by Maker |
|---|
| 216 | | xdformat:/usr/local/wublast/xdformat #location of xdformat executable |
|---|
| 217 | | blastn:/usr/local/wublast/blastn #location of blastn executable |
|---|
| 218 | | blastx:/usr/local/wublast/blastx #location of blastn executable |
|---|
| 219 | | snap:/usr/local/snap/snap #location of snap executable |
|---|
| 220 | | augustus:/usr/local/augustus/bin/augustus #location of augustus executable (optional) |
|---|
| 221 | | RepeatMasker:/usr/local/RepeatMasker/RepeatMasker #location of RepeatMasker executable |
|---|
| 222 | | exonerate:/usr/local/exonerate/bin/exonerate #location of exonerate executable |
|---|
| 223 | | |
|---|
| 224 | | ==================================== |
|---|
| 225 | | |
|---|
| 226 | | Note that for all control files the comments written to help users begin with a pound sign(#). In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon. |
|---|
| | 175 | Here an example of a section of the maker_exe.ctl file: |
|---|
| | 176 | ==================================== |
|---|
| | 177 | #-----Location of Executables Used by Maker/Evaluator |
|---|
| | 178 | formatdb:/usr/local/bin/formatdb #location of NCBI formatdb executable |
|---|
| | 179 | blastall:/usr/local/bin/blastall #location of NCBI blastall executable |
|---|
| | 180 | xdformat:/usr/local/bin/xdformat #location of WUBLAST xdformat executable |
|---|
| | 181 | blastn:/usr/local/bin/blastn #location of WUBLAST blastn executable |
|---|
| | 182 | blastx:/usr/local/bin/blastx #location of WUBLAST blastx executable |
|---|
| | 183 | tblastx:/usr/local/bin/tblastx #location of WUBLAST tblastx executable |
|---|
| | 184 | RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker #location of RepeatMasker executable |
|---|
| | 185 | exonerate:/home/cholt/usr/local/exonerate/bin/exonerate #location of exonerate executable |
|---|
| | 186 | |
|---|
| | 187 | #-----Ab-initio Gene Prediction Algorithms |
|---|
| | 188 | snap:/home/cholt/usr/local/snap/snap #location of snap executable |
|---|
| | 189 | gmhmme3:/home/cholt/usr/local/gmes/gmhmme3 #location of eukaryotic genemark executable |
|---|
| | 190 | augustus:/home/cholt/usr/local/augustus/bin/augustus #location of augustus executable |
|---|
| | 191 | fgenesh:/home/cholt/usr/local/fgenesh/fgenesh #location of fgenesh executable |
|---|
| | 192 | |
|---|
| | 193 | ==================================== |
|---|
| 235 | | percov_blastn:0.80 #Blastn Percent Coverage Threhold EST-Genome Alignments |
|---|
| 236 | | percid_blastn:0.85 #Blastn Percent Identity Threshold EST-Genome Aligments |
|---|
| 237 | | eval_blastn:1e-10 #Blastn eval cutoff |
|---|
| 238 | | bit_blastn:40 #Blastn bit cutoff |
|---|
| 239 | | percov_blastx:0.50 #Blastx Percent Coverage Threhold Protein-Genome Alignments |
|---|
| 240 | | percid_blastx:0.40 #Blastx Percent Identity Threshold Protein-Genome Aligments |
|---|
| 241 | | eval_blastx:1e-6 #Blastx eval cutoff |
|---|
| 242 | | bit_blastx:30 #Blastx bit cutoff |
|---|
| 243 | | e_perc_cov:50 #Exonerate Percent Coverage Thresshold EST_Genome Alignments |
|---|
| 244 | | ep_score_limit:20 #Report alignments scoring at least this percentage of the maximal score exonerate nucleotide |
|---|
| 245 | | en_score_limit:20 #Report alignments scoring at least this percentage of the maximal score exonerate protein |
|---|
| 246 | | |
|---|
| | 201 | blast_type:wublast #set to 'wublast' or 'ncbi' |
|---|
| | 202 | |
|---|
| | 203 | pcov_blastn:0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments |
|---|
| | 204 | pid_blastn:0.85 #Blastn Percent Identity Threshold EST-Genome Aligments |
|---|
| | 205 | eval_blastn:1e-10 #Blastn eval cutoff |
|---|
| | 206 | bit_blastn:40 #Blastn bit cutoff |
|---|
| | 207 | |
|---|
| | 208 | pcov_blastx:0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments |
|---|
| | 209 | pid_blastx:0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments |
|---|
| | 210 | eval_blastx:1e-06 #Blastx eval cutoff |
|---|
| | 211 | bit_blastx:30 #Blastx bit cutoff |
|---|
| | 212 | |
|---|
| | 213 | pcov_rm_blastx:0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking |
|---|
| | 214 | pid_rm_blastx:0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking |
|---|
| | 215 | eval_rm_blastx:1e-06 #Blastx eval cutoff for transposable element masking |
|---|
| | 216 | bit_rm_blastx:30 #Blastx bit cutoff for transposable element masking |
|---|
| 251 | | |
|---|
| 252 | | Here an example maker_opts.ctl: |
|---|
| 253 | | ==================================== |
|---|
| 254 | | |
|---|
| 255 | | #-----sequence and library files |
|---|
| 256 | | genome:fly_assembly.fasta #genome sequence file (required) |
|---|
| 257 | | est:fly_est.fasta #EST sequence file (required) |
|---|
| 258 | | protein:uniprot.fasta #protein sequence file (required) |
|---|
| 259 | | repeat_protein:te_proteins.fasta #a database of transposable element proteins |
|---|
| 260 | | rmlib:fly_specific_repeats.fasta #an organism specific repeat library (optional) |
|---|
| 261 | | rm_gff: #a gff3 format file of repeat elements (only used with -GFF flag) |
|---|
| 262 | | |
|---|
| 263 | | #-----external application specific options |
|---|
| 264 | | snaphmm:fly #SNAP HMM model |
|---|
| 265 | | augustus_species:fly #Augustus gene prediction model |
|---|
| 266 | | model_org:all #RepeatMasker model organism |
|---|
| 267 | | alt_peptide:c #amino acid used to replace non standard amino acids in xdformat |
|---|
| 268 | | cpus:2 #max number of cpus to use in BLAST and RepeatMasker |
|---|
| 269 | | |
|---|
| 270 | | #-----Maker specific options |
|---|
| 271 | | predictor:snap #identifies which gene prediction program to use for annotations |
|---|
| 272 | | te_remove:1 #mask regions with excess similarity to transposable element proteins |
|---|
| 273 | | max_dna_len:100000 #length for dividing up contigs into chunks (larger values increase memory usage) |
|---|
| 274 | | split_hit:10000 #length of the splitting of hits (max intron size for EST and protein alignments) |
|---|
| 275 | | snap_flank:200 #length of sequence surrounding EST and protein evidence used to extend gene predictions |
|---|
| 276 | | single_exon:0 #consider EST hits aligning to single exons when generating annotations, 1 = yes, 0 = no |
|---|
| 277 | | use_seq_dir:1 #place output files in same directory as sequence file: 1 = yes, 0 = no |
|---|
| 278 | | clean_up:0 #remove theVoid directory: 1 = yes, 0 = no |
|---|
| 279 | | |
|---|
| 280 | | ==================================== |
|---|
| 281 | | |
|---|
| 282 | | |
|---|
| 283 | | #---------- |
|---|
| | 221 | Here an example of a section of the maker_opts.ctl file: |
|---|
| | 222 | ==================================== |
|---|
| | 223 | #-----Genome (Required for De-Novo Annotations) |
|---|
| | 224 | genome:input/genome.fasta #genome sequence file in fasta format |
|---|
| | 225 | |
|---|
| | 226 | #-----Re-annotation Options |
|---|
| | 227 | genome_gff: #re-annotate genome based on this gff3 file |
|---|
| | 228 | est_pass:0 #use ests in genome_gff: 1 = yes, 0 = no |
|---|
| | 229 | altest_pass:0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no |
|---|
| | 230 | protein_pass:0 #use proteins in genome_gff: 1 = yes, 0 = no |
|---|
| | 231 | rm_pass:0 #use repeats in genome_gff: 1 = yes, 0 = no |
|---|
| | 232 | model_pass:0 #use gene models in genome_gff: 1 = yes, 0 = no |
|---|
| | 233 | pred_pass:0 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no |
|---|
| | 234 | other_pass:0 #passthrough everything else in genome_gff: 1 = yes, 0 = no |
|---|
| | 235 | |
|---|
| | 236 | #-----EST Evidence (you must provide a value for at least one) |
|---|
| | 237 | est:input/est.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis) |
|---|
| | 238 | est_reads: #un-assembled EST reads in fasta format (for deep nextgen mRNASeq) |
|---|
| | 239 | altest:input/altest.fasta #EST/cDNA sequence file in fasta format from an alternate organism |
|---|
| | 240 | est_gff: #EST evidence from a seperate gff3 file |
|---|
| | 241 | altest_gff: #Alternate organism EST evidence from a seperate gff3 file |
|---|
| | 242 | |
|---|
| | 243 | #-----Protein Homology Evidence (you must provide a value for at least one) |
|---|
| | 244 | protein:input/protein.fasta #protein sequence file in fasta format |
|---|
| | 245 | protein_gff: #protein homology evidence from a gff3 file |
|---|
| | 246 | ==================================== |
|---|
| | 247 | |
|---|
| | 248 | #---------------------------------------------------- |
|---|
| | 249 | GFF3 Passthrough |
|---|
| | 250 | |
|---|
| | 251 | If you have data from a source that MAKER does not support, and you wish to use the data in annotating a genome, then you can pass the data to MAKER as an aligned GFF3 file. This is done by supplying the files location to the appropriate value in the maker_opt.ctl file (i.e. est_gff:input\est.gff). Note that MAKER expects all data sent to it to be of the type specified, so don't put mixed data in a file (i.e. don't mix EST and other data in the file pointed to by est_gff, otherwise it all gets used as EST data). Also the genome_gff option is only for MAKER produced GFF3 files. Other GFF3 files of mixed data must be split by type and identified by the appropriate control file option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction data, est_gff for EST data, etc.). |
|---|
| | 252 | |
|---|
| | 253 | #---------------------------------------------------- |
|---|