| 39 | | 1) Wu-BLAST is becoming AB-BLAST. Once AB-BLAST becomes available we will do some testing to see if it is compatible with MAKER. Wu-BLAST is no longer available online, so if you don't already have it, you will have to use NCBI BLAST instead. |
|---|
| 40 | | 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file executable called TRF (see RepeatMasker website for details), so please install these before installing RepeatMasker |
|---|
| 41 | | 3) Exonerate Binaries can be downloaded from the website. If you use Mac OSX, however, binaries are only available for version 1.0. This verion will work too. If you would like to compile exonerate, it requires GLIB, a C-library, that has a link from the exonerate website. If you use Mac OSX, GLIB can downloaded using FINK. |
|---|
| 42 | | 4) RepeatMasker requires a repeat library file, which can be downloaded from Repbase upon registration (http://www.girinst.org/), this is explained on the RepeatMasker website. |
|---|
| 43 | | 5) Please note the location of all of the programs that you have installed, and add them to you $PATH variable in your .profile file. You will need this information in the maker.exe file, one of MAKER's 3 control files. |
|---|
| 44 | | |
|---|
| 45 | | |
|---|
| 46 | | Now that you have all the necessary programs installed, MAKER can be unpacked using: |
|---|
| 47 | | |
|---|
| 48 | | tar xvfz maker.tar.gz |
|---|
| | 50 | |
|---|
| | 51 | 1) Wu-BLAST is becoming AB-BLAST. Once AB-BLAST becomes available we |
|---|
| | 52 | will do some testing to see if it is compatible with MAKER. |
|---|
| | 53 | Wu-BLAST is no longer available online, so if you don't already |
|---|
| | 54 | have it, you will have to use NCBI BLAST instead. |
|---|
| | 55 | 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file |
|---|
| | 56 | executable called TRF (see RepeatMasker website for details), so |
|---|
| | 57 | please install these before installing RepeatMasker |
|---|
| | 58 | 3) Exonerate Binaries can be downloaded from the website. If you use |
|---|
| | 59 | Mac OSX, however, binaries are only available for version 1.0. |
|---|
| | 60 | This verion will work too. If you would like to compile exonerate, |
|---|
| | 61 | it requires GLIB, a C-library, that has a link from the exonerate |
|---|
| | 62 | website. If you use Mac OSX, GLIB can downloaded using FINK. |
|---|
| | 63 | 4) RepeatMasker requires a repeat library file, which can be |
|---|
| | 64 | downloaded from Repbase upon registration |
|---|
| | 65 | (http://www.girinst.org/), this is explained on the RepeatMasker |
|---|
| | 66 | website. |
|---|
| | 67 | 5) Please note the location of all of the programs that you have |
|---|
| | 68 | installed, and add them to you $PATH variable in your .profile |
|---|
| | 69 | file. You will need this information in the maker.exe file, one |
|---|
| | 70 | of MAKER's 3 control files. |
|---|
| | 71 | |
|---|
| | 72 | Now that you have all the necessary programs installed, MAKER can be |
|---|
| | 73 | unpacked using: |
|---|
| | 74 | |
|---|
| | 75 | tar xvfz maker.tar.gz |
|---|
| 95 | | If you are running maker on an MPI capable cluster, you can install an MPI version of maker by doing the following: |
|---|
| 96 | | |
|---|
| 97 | | 1. Install standard maker and verify that it runs. |
|---|
| 98 | | 2. Install MPICH2 with the --enable-sharedlibs flag set to the appropriate value (See MPICH2 documentation) |
|---|
| 99 | | 3. Use cd to change to the MPI subdirectory in the maker instalation folder (i.e. maker/MPI/) |
|---|
| 100 | | 4. Run Install.PL by typing: perl Install.PL |
|---|
| 101 | | |
|---|
| 102 | | A new version of maker called mpi_maker should now be installed under maker/bin. |
|---|
| 103 | | |
|---|
| 104 | | To run mpi_maker, first verify that your mpi environment is initiated, (i.e. using the mpdboot or mpd command). Now start mpi_maker via mpiexec. |
|---|
| | 140 | If you are running maker on an MPI capable cluster, you can install an |
|---|
| | 141 | MPI version of maker by doing the following: |
|---|
| | 142 | |
|---|
| | 143 | 1. Install standard maker and verify that it runs. |
|---|
| | 144 | 2. Install MPICH2 with the --enable-sharedlibs flag set to the |
|---|
| | 145 | appropriate value (See MPICH2 documentation) |
|---|
| | 146 | 3. Use cd to change to the MPI subdirectory in the maker |
|---|
| | 147 | instalation folder (i.e. maker/MPI/) |
|---|
| | 148 | 4. Run Install.PL by typing: perl Install.PL |
|---|
| | 149 | |
|---|
| | 150 | A new version of maker called mpi_maker should now be installed under |
|---|
| | 151 | maker/bin. |
|---|
| | 152 | |
|---|
| | 153 | To run mpi_maker, first verify that your mpi environment is initiated, |
|---|
| | 154 | (i.e. using the mpdboot or mpd command). Now start mpi_maker via |
|---|
| | 155 | mpiexec. |
|---|
| 120 | | maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator> |
|---|
| 121 | | |
|---|
| 122 | | Maker is a program that produces gene annotations in GFF3 file format using |
|---|
| 123 | | evidence such as EST alignments and protein homology. Maker can be used to |
|---|
| 124 | | produce gene annotations for new genomes as well as update annoations from |
|---|
| 125 | | existing genome databases. |
|---|
| 126 | | |
|---|
| 127 | | The four input arguments are user control files that specify how maker |
|---|
| 128 | | should behave. The evaluator options file contains control options specific |
|---|
| 129 | | for the evaluation of gene annotations. All options for maker should be set |
|---|
| 130 | | in the control files, but a few can also be set on the command line. |
|---|
| 131 | | Command line options provide a convenient machanism to override commonly |
|---|
| 132 | | altered control file values. |
|---|
| 133 | | |
|---|
| 134 | | Input files listed in the control options files must be in fasta format. |
|---|
| 135 | | Please see maker documentation to learn more about control file |
|---|
| 136 | | configuration. Maker will automatically try and locate the user control |
|---|
| 137 | | files in the current working directory if these arguments are not supplied |
|---|
| 138 | | when initializing maker. |
|---|
| 139 | | |
|---|
| 140 | | It is important to note that maker does not try and recalculated data that |
|---|
| 141 | | it has already calculated. For example, if you run an analysis twice on |
|---|
| 142 | | the same dataset file you will notice that maker does not rerun any of the |
|---|
| 143 | | blast analyses, but instead uses the blast analyses stored from the |
|---|
| 144 | | previous run. To force maker to rerun all analyses, use the -f flag. |
|---|
| | 171 | maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator> |
|---|
| | 172 | |
|---|
| | 173 | Maker is a program that produces gene annotations in GFF3 file format |
|---|
| | 174 | using evidence such as EST alignments and protein homology. Maker can |
|---|
| | 175 | be used to produce gene annotations for new genomes as well as update |
|---|
| | 176 | annoations from existing genome databases. |
|---|
| | 177 | |
|---|
| | 178 | The four input arguments are user control files that specify how maker |
|---|
| | 179 | should behave. The evaluator options file contains control options |
|---|
| | 180 | specific for the evaluation of gene annotations. All options for maker |
|---|
| | 181 | should be set in the control files, but a few can also be set on the |
|---|
| | 182 | command line. Command line options provide a convenient machanism to |
|---|
| | 183 | override commonly altered control file values. |
|---|
| | 184 | |
|---|
| | 185 | Input files listed in the control options files must be in fasta |
|---|
| | 186 | format. Please see maker documentation to learn more about control |
|---|
| | 187 | file configuration. Maker will automatically try and locate the user |
|---|
| | 188 | control files in the current working directory if these arguments are |
|---|
| | 189 | not supplied when initializing maker. |
|---|
| | 190 | |
|---|
| | 191 | It is important to note that maker does not try and recalculated data |
|---|
| | 192 | that it has already calculated. For example, if you run an analysis |
|---|
| | 193 | twice on the same dataset file you will notice that maker does not |
|---|
| | 194 | rerun any of the blast analyses, but instead uses the blast analyses |
|---|
| | 195 | stored from the previous run. To force maker to rerun all analyses, |
|---|
| | 196 | use the -f flag. |
|---|
| 149 | | -genome|g <filename> Specify the genome file. |
|---|
| 150 | | |
|---|
| 151 | | -predictor|p <type> Selects the predictor(s) to use when building |
|---|
| 152 | | annotations. Use a ',' to seperate types (no spaces). |
|---|
| 153 | | i.e. -predictor=snap,augustus,fgenesh |
|---|
| 154 | | |
|---|
| 155 | | types: snap |
|---|
| 156 | | augustus |
|---|
| 157 | | fgenesh |
|---|
| 158 | | genemark |
|---|
| 159 | | est2genome (Uses EST's directly) |
|---|
| 160 | | abinit (ab-initio predictions) |
|---|
| 161 | | model_gff (Passes through GFF3 annotations) |
|---|
| 162 | | |
|---|
| 163 | | -RM_off|R Turns all repeat masking off. |
|---|
| 164 | | |
|---|
| 165 | | -retry <integer> Rerun failed contigs up to the specified count. |
|---|
| 166 | | |
|---|
| 167 | | -cpus|c <integer> Tells how many cpus to use for BLAST analysis. |
|---|
| 168 | | |
|---|
| 169 | | -force|f Forces maker to delete old files before running again. |
|---|
| 170 | | This will require all blast analyses to be rerun. |
|---|
| 171 | | |
|---|
| 172 | | -evaluate|e Run Evaluator on final annotations (under development). |
|---|
| 173 | | |
|---|
| 174 | | -quiet|q Silences most of maker's status messages. |
|---|
| 175 | | |
|---|
| 176 | | -CTL Generate empty control files in the current directory. |
|---|
| 177 | | |
|---|
| 178 | | -help|? Prints this usage statement. |
|---|
| 179 | | |
|---|
| 180 | | |
|---|
| 181 | | #---------------------------------------------------- |
|---|
| | 201 | -genome|g <filename> Specify the genome file. |
|---|
| | 202 | |
|---|
| | 203 | -predictor|p <type> Selects the predictor(s) to use when |
|---|
| | 204 | building annotations. Use a ',' to |
|---|
| | 205 | seperate types (no spaces). |
|---|
| | 206 | i.e. -predictor=snap,augustus,fgenesh |
|---|
| | 207 | |
|---|
| | 208 | types: snap |
|---|
| | 209 | augustus |
|---|
| | 210 | fgenesh |
|---|
| | 211 | genemark |
|---|
| | 212 | est2genome (Uses EST's directly) |
|---|
| | 213 | abinit (ab-initio predictions) |
|---|
| | 214 | model_gff (Passes through GFF3 |
|---|
| | 215 | annotations) |
|---|
| | 216 | |
|---|
| | 217 | -RM_off|R Turns all repeat masking off. |
|---|
| | 218 | |
|---|
| | 219 | -retry <integer> Rerun failed contigs up to the specified count. |
|---|
| | 220 | |
|---|
| | 221 | -cpus|c <integer> Tells how many cpus to use for BLAST analysis. |
|---|
| | 222 | |
|---|
| | 223 | -force|f Forces maker to delete old files before running |
|---|
| | 224 | again. This will require all blast analyses to |
|---|
| | 225 | be rerun. |
|---|
| | 226 | |
|---|
| | 227 | -evaluate|e Run Evaluator on final annotations (under |
|---|
| | 228 | development). |
|---|
| | 229 | |
|---|
| | 230 | -quiet|q Silences most of maker's status messages. |
|---|
| | 231 | |
|---|
| | 232 | -CTL Generate empty control files in the current |
|---|
| | 233 | directory. |
|---|
| | 234 | |
|---|
| | 235 | -help|? Prints this usage statement. |
|---|
| | 236 | |
|---|
| | 237 | |
|---|
| | 238 | #--------------------------------------------------------------------- |
|---|
| | 239 | |
|---|
| 193 | | XXX.maker.output/ - contains all output for a given run of make |
|---|
| 194 | | XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run progress as well as an index for traversing XXX.maker.output/XXX_datastore/ |
|---|
| 195 | | XXX.maker.output/XXX_datastore/ - contains folders containing the output for each individual contig of the input fasta file |
|---|
| 196 | | *Within these folders |
|---|
| 197 | | seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, or Apollo |
|---|
| 198 | | seq_name.maker.transcripts.fasta - a file of the maker transcript sequences |
|---|
| 199 | | seq_name.maker.proteins.fasta - a file of the maker protein sequences |
|---|
| 200 | | seq_name.maker.XXX.transcript.fasta - a file of ab-inito transcript sequences from program XXX |
|---|
| 201 | | seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein sequences from program XXX |
|---|
| 202 | | seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a file of filtered ab-inito transcript sequences that don't overlap annotations |
|---|
| 203 | | seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a file of filtered ab-inito protein sequences that don't overlap annotations |
|---|
| 204 | | theVoid.seq_name/ - a directory containing all of the raw output files produced by maker, including BLAST reports, SNAP output, exonnerate output and the masked sequence |
|---|
| | 256 | 1) XXX.maker.output/ - contains all output for a given run of make |
|---|
| | 257 | 2) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run |
|---|
| | 258 | progress as well as an index for traversing |
|---|
| | 259 | XXX.maker.output/XXX_datastore/ |
|---|
| | 260 | 3) XXX.maker.output/XXX_datastore/ - contains folders containing the |
|---|
| | 261 | output for each individual contig of the input fasta file |
|---|
| | 262 | |
|---|
| | 263 | Within these folders: |
|---|
| | 264 | * seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, |
|---|
| | 265 | or Apollo |
|---|
| | 266 | * seq_name.maker.transcripts.fasta - a file of the maker |
|---|
| | 267 | transcript sequences |
|---|
| | 268 | * seq_name.maker.proteins.fasta - a file of the maker protein |
|---|
| | 269 | sequences |
|---|
| | 270 | * seq_name.maker.XXX.transcript.fasta - a file of ab-inito |
|---|
| | 271 | transcript sequences from program XXX |
|---|
| | 272 | * seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein |
|---|
| | 273 | sequences from program XXX |
|---|
| | 274 | * seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a |
|---|
| | 275 | file of filtered ab-inito transcript sequences that don't |
|---|
| | 276 | overlap annotations |
|---|
| | 277 | * seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a |
|---|
| | 278 | file of filtered ab-inito protein sequences that don't overlap |
|---|
| | 279 | annotations |
|---|
| | 280 | * theVoid.seq_name/ - a directory containing all of the raw |
|---|
| | 281 | output files produced by maker, including BLAST reports, SNAP |
|---|
| | 282 | output, exonnerate output and the masked sequence. |
|---|
| 214 | | "Many filesystems have performance problems with large numbers of subdirectories and files within a single directory and even when the underlying filesystems handle things gracefully, access via network filesystems can be an issue. The Datastore modules create a hiearchy of subdirectory layers, starting from a 'base', and mapping end-user's identifiers to the corresponding subdirectory." - quote from http://www.yandell-lab.org/ (See site for more information on the Datastore module) |
|---|
| 215 | | |
|---|
| 216 | | A deep datastore will be used by maker if there are more than 1,000 sequences in a multi-fasta file. |
|---|
| 217 | | |
|---|
| 218 | | When a datastore is implemented, the output files described above will not appear where you would normally expect them to be. Instead they will be located in a series of sub-directory under a new base-directory whose name is determined from the input genome file name, i.e. current_working_directory/genome_datastore/EE/Af/Contig1/Contig1.gff. A master_datastore_index file will be made in the current working directory to help you find the output files from each sequence. |
|---|
| 219 | | |
|---|
| 220 | | The master_datastore_index file is a file created to allow the user to easily find the exact output directory corresponding to contigs from the input genome file. The The master_datastore_index file contains three columns of text; the first column shows the sequence identifier from each fasta header, and the second column shows the location of the output files for that sequence. The third column is for logging the status of data related to an individual contig. The values of the third column are as follows: |
|---|
| 221 | | STARTED - Indicates that maker has started proccessing this contig. |
|---|
| 222 | | FINISHED - Indicates that maker has finished processing this contig and all data is currently available in that subdirectory. |
|---|
| 223 | | DIED - Indicates that maker failed. |
|---|
| 224 | | DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the specified number of retries and will not try again. |
|---|
| 225 | | RETRY - Indicates that maker is retrying the contig after a failure. |
|---|
| 226 | | SKIPPED_SMALL - Indicates that this contig was skipped because it is too short (based on control file values set by the user) |
|---|
| 227 | | |
|---|
| 228 | | |
|---|
| 229 | | #---------------------------------------------------- |
|---|
| | 301 | "Many filesystems have performance problems with large numbers of |
|---|
| | 302 | subdirectories and files within a single directory and even when the |
|---|
| | 303 | underlying filesystems handle things gracefully, access via network |
|---|
| | 304 | filesystems can be an issue. The Datastore modules create a hiearchy |
|---|
| | 305 | of subdirectory layers, starting from a 'base', and mapping end-user's |
|---|
| | 306 | identifiers to the corresponding subdirectory." - quote from |
|---|
| | 307 | http://www.yandell-lab.org/ (See site for more information on the |
|---|
| | 308 | Datastore module) |
|---|
| | 309 | |
|---|
| | 310 | A deep datastore will be used by maker if there are more than 1,000 |
|---|
| | 311 | sequences in a multi-fasta file. |
|---|
| | 312 | |
|---|
| | 313 | When a datastore is implemented, the output files described above will |
|---|
| | 314 | not appear where you would normally expect them to be. Instead they |
|---|
| | 315 | will be located in a series of sub-directory under a new |
|---|
| | 316 | base-directory whose name is determined from the input genome file |
|---|
| | 317 | name: |
|---|
| | 318 | |
|---|
| | 319 | i.e. current_directory/genome_datastore/EE/Af/Contig1/Contig1.gff. |
|---|
| | 320 | |
|---|
| | 321 | A master_datastore_index file will be made in the current working |
|---|
| | 322 | directory to help you find the output files from each sequence. |
|---|
| | 323 | |
|---|
| | 324 | The master_datastore_index file is a file created to allow the user to |
|---|
| | 325 | easily find the exact output directory corresponding to contigs from |
|---|
| | 326 | the input genome file. The The master_datastore_index file contains |
|---|
| | 327 | three columns of text; the first column shows the sequence identifier |
|---|
| | 328 | from each fasta header, and the second column shows the location of |
|---|
| | 329 | the output files for that sequence. The third column is for logging |
|---|
| | 330 | the status of data related to an individual contig. The values of the |
|---|
| | 331 | third column are as follows: |
|---|
| | 332 | * STARTED - Indicates that maker has started proccessing this |
|---|
| | 333 | contig. |
|---|
| | 334 | * FINISHED - Indicates that maker has finished processing this |
|---|
| | 335 | contig and all data is currently available in that subdirectory. |
|---|
| | 336 | * DIED - Indicates that maker failed. |
|---|
| | 337 | * DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the |
|---|
| | 338 | specified number of retries and will not try again. |
|---|
| | 339 | * RETRY - Indicates that maker is retrying the contig after a |
|---|
| | 340 | failure. |
|---|
| | 341 | * SKIPPED_SMALL - Indicates that this contig was skipped because |
|---|
| | 342 | it is too short (based on control file values set by the user). |
|---|
| | 343 | |
|---|
| | 344 | #--------------------------------------------------------------------- |
|---|
| | 345 | |
|---|
| 232 | | Lines in the maker control files have the format key:value whith no spaces before or after the colon(:). If the value is a file name, you can use relative paths and environmental variables, i.e. genome:$HOME/my_genome.fasta |
|---|
| 233 | | |
|---|
| 234 | | |
|---|
| 235 | | MAKER has 3 control files for configuration options. A fourth file evaluator.ctl is used to supply a MAKER related program EVALUATOR with options specific to that program (only important if 'evaluate' is set to 1 in maker_opts.ctl). |
|---|
| 236 | | |
|---|
| 237 | | Note that for all control files the comments written to help users begin with a pound sign(#). In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon. |
|---|
| 238 | | |
|---|
| 239 | | A. maker_exe.ctl - includes information about programs executed by MAKER. |
|---|
| | 348 | Lines in the maker control files have the format key:value whith no |
|---|
| | 349 | spaces before or after the colon(:). If the value is a file name, you |
|---|
| | 350 | can use relative paths and environmental variables, |
|---|
| | 351 | i.e. genome:$HOME/my_genome.fasta |
|---|
| | 352 | |
|---|
| | 353 | |
|---|
| | 354 | MAKER has 3 control files for configuration options. A fourth file |
|---|
| | 355 | evaluator.ctl is used to supply a MAKER related program EVALUATOR with |
|---|
| | 356 | options specific to that program (only important if 'evaluate' is set |
|---|
| | 357 | to 1 in maker_opts.ctl). |
|---|
| | 358 | |
|---|
| | 359 | Note that for all control files the comments written to help users |
|---|
| | 360 | begin with a pound sign(#). In addition, options before the colon(:) |
|---|
| | 361 | can not be changed, nor should there be a space before or after the |
|---|
| | 362 | colon. |
|---|
| | 363 | |
|---|
| | 364 | A. maker_exe.ctl - includes information about programs executed by |
|---|
| | 365 | MAKER. |
|---|
| 243 | | formatdb:/usr/local/bin/formatdb #location of NCBI formatdb executable |
|---|
| 244 | | blastall:/usr/local/bin/blastall #location of NCBI blastall executable |
|---|
| 245 | | xdformat:/usr/local/bin/xdformat #location of WUBLAST xdformat executable |
|---|
| 246 | | blastn:/usr/local/bin/blastn #location of WUBLAST blastn executable |
|---|
| 247 | | blastx:/usr/local/bin/blastx #location of WUBLAST blastx executable |
|---|
| 248 | | tblastx:/usr/local/bin/tblastx #location of WUBLAST tblastx executable |
|---|
| 249 | | RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker #location of RepeatMasker executable |
|---|
| 250 | | exonerate:/home/cholt/usr/local/exonerate/bin/exonerate #location of exonerate executable |
|---|
| | 369 | |
|---|
| | 370 | #location of NCBI formatdb executable |
|---|
| | 371 | formatdb:/usr/local/bin/formatdb |
|---|
| | 372 | #location of NCBI blastall executable |
|---|
| | 373 | blastall:/usr/local/bin/blastall |
|---|
| | 374 | #location of WUBLAST xdformat executable |
|---|
| | 375 | xdformat:/usr/local/bin/xdformat |
|---|
| | 376 | #location of WUBLAST blastn executable |
|---|
| | 377 | blastn:/usr/local/bin/blastn |
|---|
| | 378 | #location of WUBLAST blastx executable |
|---|
| | 379 | blastx:/usr/local/bin/blastx |
|---|
| | 380 | #location of WUBLAST tblastx executable |
|---|
| | 381 | tblastx:/usr/local/bin/tblastx |
|---|
| | 382 | #location of RepeatMasker executable |
|---|
| | 383 | RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker |
|---|
| | 384 | #location of exonerate executable |
|---|
| | 385 | exonerate:/home/cholt/usr/local/exonerate/bin/exonerate |
|---|
| 266 | | blast_type:wublast #set to 'wublast' or 'ncbi' |
|---|
| 267 | | |
|---|
| 268 | | pcov_blastn:0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments |
|---|
| 269 | | pid_blastn:0.85 #Blastn Percent Identity Threshold EST-Genome Aligments |
|---|
| 270 | | eval_blastn:1e-10 #Blastn eval cutoff |
|---|
| 271 | | bit_blastn:40 #Blastn bit cutoff |
|---|
| 272 | | |
|---|
| 273 | | pcov_blastx:0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments |
|---|
| 274 | | pid_blastx:0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments |
|---|
| 275 | | eval_blastx:1e-06 #Blastx eval cutoff |
|---|
| 276 | | bit_blastx:30 #Blastx bit cutoff |
|---|
| 277 | | |
|---|
| 278 | | pcov_rm_blastx:0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking |
|---|
| 279 | | pid_rm_blastx:0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking |
|---|
| 280 | | eval_rm_blastx:1e-06 #Blastx eval cutoff for transposable element masking |
|---|
| 281 | | bit_rm_blastx:30 #Blastx bit cutoff for transposable element masking |
|---|
| 282 | | ==================================== |
|---|
| 283 | | |
|---|
| 284 | | |
|---|
| 285 | | C. maker_opts.ctl - contains options for maker and external programs used by maker |
|---|
| | 403 | #set to 'wublast' or 'ncbi' |
|---|
| | 404 | blast_type:wublast |
|---|
| | 405 | #Blastn Percent Coverage Threhold EST-Genome Alignments |
|---|
| | 406 | pcov_blastn:0.8 |
|---|
| | 407 | #Blastn Percent Identity Threshold EST-Genome Aligments |
|---|
| | 408 | pid_blastn:0.85 |
|---|
| | 409 | #Blastn eval cutoff |
|---|
| | 410 | eval_blastn:1e-10 |
|---|
| | 411 | #Blastn bit cutoff |
|---|
| | 412 | bit_blastn:40 |
|---|
| | 413 | |
|---|
| | 414 | #Blastx Percent Coverage Threhold Protein-Genome Alignments |
|---|
| | 415 | pcov_blastx:0.5 |
|---|
| | 416 | #Blastx Percent Identity Threshold Protein-Genome Aligments |
|---|
| | 417 | pid_blastx:0.4 |
|---|
| | 418 | #Blastx eval cutoff |
|---|
| | 419 | eval_blastx:1e-06 |
|---|
| | 420 | #Blastx bit cutoff |
|---|
| | 421 | bit_blastx:30 |
|---|
| | 422 | |
|---|
| | 423 | #Blastx Percent Coverage Threhold For Transposable Element Masking |
|---|
| | 424 | pcov_rm_blastx:0.5 |
|---|
| | 425 | #Blastx Percent Identity Threshold For Transposbale Element Masking |
|---|
| | 426 | pid_rm_blastx:0.4 |
|---|
| | 427 | #Blastx eval cutoff for transposable element masking |
|---|
| | 428 | eval_rm_blastx:1e-06 |
|---|
| | 429 | #Blastx bit cutoff for transposable element masking |
|---|
| | 430 | bit_rm_blastx:30 |
|---|
| | 431 | |
|---|
| | 432 | C. maker_opts.ctl - contains options for maker and external programs |
|---|
| | 433 | used by maker. |
|---|
| 292 | | genome_gff: #re-annotate genome based on this gff3 file |
|---|
| 293 | | est_pass:0 #use ests in genome_gff: 1 = yes, 0 = no |
|---|
| 294 | | altest_pass:0 #use alternate organism ests in genome_gff: 1 = yes, 0 = no |
|---|
| 295 | | protein_pass:0 #use proteins in genome_gff: 1 = yes, 0 = no |
|---|
| 296 | | rm_pass:0 #use repeats in genome_gff: 1 = yes, 0 = no |
|---|
| 297 | | model_pass:0 #use gene models in genome_gff: 1 = yes, 0 = no |
|---|
| 298 | | pred_pass:0 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no |
|---|
| 299 | | other_pass:0 #passthrough everything else in genome_gff: 1 = yes, 0 = no |
|---|
| | 441 | |
|---|
| | 442 | #re-annotate genome based on this gff3 file |
|---|
| | 443 | genome_gff: |
|---|
| | 444 | #use ests in genome_gff: 1 = yes, 0 = no |
|---|
| | 445 | est_pass:0 |
|---|
| | 446 | #use alternate organism ests in genome_gff: 1 = yes, 0 = no |
|---|
| | 447 | altest_pass:0 |
|---|
| | 448 | #use proteins in genome_gff: 1 = yes, 0 = no |
|---|
| | 449 | protein_pass:0 |
|---|
| | 450 | #use repeats in genome_gff: 1 = yes, 0 = no |
|---|
| | 451 | rm_pass:0 |
|---|
| | 452 | #use gene models in genome_gff: 1 = yes, 0 = no |
|---|
| | 453 | model_pass:0 |
|---|
| | 454 | #use ab-initio predictions in genome_gff: 1 = yes, 0 = no |
|---|
| | 455 | pred_pass:0 |
|---|
| | 456 | #passthrough everything else in genome_gff: 1 = yes, 0 = no |
|---|
| | 457 | other_pass:0 |
|---|
| 302 | | est:input/est.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis) |
|---|
| 303 | | est_reads: #un-assembled EST reads in fasta format (for deep nextgen mRNASeq) |
|---|
| 304 | | altest:input/altest.fasta #EST/cDNA sequence file in fasta format from an alternate organism |
|---|
| 305 | | est_gff: #EST evidence from a seperate gff3 file |
|---|
| 306 | | altest_gff: #Alternate organism EST evidence from a seperate gff3 file |
|---|
| 307 | | |
|---|
| 308 | | #-----Protein Homology Evidence (you must provide a value for at least one) |
|---|
| 309 | | protein:input/protein.fasta #protein sequence file in fasta format |
|---|
| 310 | | protein_gff: #protein homology evidence from a gff3 file |
|---|
| 311 | | ==================================== |
|---|
| 312 | | |
|---|
| 313 | | #---------------------------------------------------- |
|---|
| | 460 | |
|---|
| | 461 | #non-redundant set of assembled ESTs in fasta format (classic EST |
|---|
| | 462 | #analysis) |
|---|
| | 463 | est:input/est.fasta |
|---|
| | 464 | #un-assembled EST reads in fasta format (for deep nextgen mRNASeq) |
|---|
| | 465 | est_reads: |
|---|
| | 466 | #EST/cDNA sequence file in fasta format from an alternate organism |
|---|
| | 467 | altest:input/altest.fasta |
|---|
| | 468 | #EST evidence from a seperate gff3 file |
|---|
| | 469 | est_gff: |
|---|
| | 470 | #Alternate organism EST evidence from a seperate gff3 file |
|---|
| | 471 | altest_gff: |
|---|
| | 472 | |
|---|
| | 473 | #-----Protein Homology Evidence (you must provide a value for at least |
|---|
| | 474 | # one) |
|---|
| | 475 | #protein sequence file in fasta format |
|---|
| | 476 | protein:input/protein.fasta |
|---|
| | 477 | #protein homology evidence from a gff3 file |
|---|
| | 478 | protein_gff: |
|---|
| | 479 | |
|---|
| | 480 | #--------------------------------------------------------------------- |
|---|
| | 481 | |
|---|
| 316 | | If you have data from a source that MAKER does not support, and you wish to use the data in annotating a genome, then you can pass the data to MAKER as an aligned GFF3 file. This is done by supplying the files location to the appropriate value in the maker_opt.ctl file (i.e. est_gff:input\est.gff). Note that MAKER expects all data sent to it to be of the type specified, so don't put mixed data in a file (i.e. don't mix EST and other data in the file pointed to by est_gff, otherwise it all gets used as EST data). Also the genome_gff option is only for MAKER produced GFF3 files. Other GFF3 files of mixed data must be split by type and identified by the appropriate control file option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction data, est_gff for EST data, etc.). |
|---|
| 317 | | |
|---|
| 318 | | #---------------------------------------------------- |
|---|
| | 484 | If you have data from a source that MAKER does not support, and you |
|---|
| | 485 | wish to use the data in annotating a genome, then you can pass the |
|---|
| | 486 | data to MAKER as an aligned GFF3 file. This is done by supplying the |
|---|
| | 487 | files location to the appropriate value in the maker_opt.ctl file |
|---|
| | 488 | (i.e. est_gff:input\est.gff). Note that MAKER expects all data sent |
|---|
| | 489 | to it to be of the type specified, so don't put mixed data in a file |
|---|
| | 490 | (i.e. don't mix EST and other data in the file pointed to by est_gff, |
|---|
| | 491 | otherwise it all gets used as EST data). Also the genome_gff option |
|---|
| | 492 | is only for MAKER produced GFF3 files. Other GFF3 files of mixed data |
|---|
| | 493 | must be split by type and identified by the appropriate control file |
|---|
| | 494 | option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction |
|---|
| | 495 | data, est_gff for EST data, etc.). |
|---|
| | 496 | |
|---|
| | 497 | #--------------------------------------------------------------------- |
|---|
| | 498 | |
|---|
| 338 | | A. First you will need to determine the genes used to model future genes, by determining a high quality gene set (annotations for the high quality gene should be in GFF3 format). The high quality gene set can then be coverted into snap ZFF format using maker2zff.pl found in maker/bin. |
|---|
| 339 | | |
|---|
| 340 | | This program is run with the following command: |
|---|
| 341 | | |
|---|
| 342 | | maker2zff.pl <directory> genome |
|---|
| 343 | | |
|---|
| 344 | | *<directory> is the directory where all of your GFF3 files are located |
|---|
| 345 | | *geneome is the name for the outfile |
|---|
| 346 | | |
|---|
| 347 | | Files Created: |
|---|
| 348 | | |
|---|
| 349 | | genome.ann |
|---|
| 350 | | genome.dna |
|---|
| 351 | | |
|---|
| 352 | | Note: A convenient way to identify and initial high quality gene set for the HMM is to use the -predictor est2genome option in maker. This will produce gene annotations based solely on EST evidence. These annoations can then seed the first HMM. After running maker again using this new HMM and the -predictor snap option, you can use the second round of annotations as the seed for an even better HMM model. In this way the HMM model progressively improves with each run of maker. |
|---|
| 353 | | |
|---|
| 354 | | Another strategy for identifying an initial gene set to model the HMM is to use the program CEGMA (http://korflab.ucdavis.edu/software.html). CEGMA builds a highly reliable set of gene annotations in the absence of experimental data by identifying DNA regions with homology to a set of 458 proteins that are highly conserved among taxa. |
|---|
| 355 | | |
|---|
| 356 | | Combining both CEGMA and maker datasets to build the first HMM is also a good strategy. |
|---|
| 357 | | |
|---|
| 358 | | |
|---|
| 359 | | B. Next you will use the dna and zff file (genome.dna and genome.ann) to produce a SNAP HMM as described in the SNAP documation (which we have provided below): |
|---|
| 360 | | |
|---|
| 361 | | The first step is to look at some features of the genes: |
|---|
| 362 | | |
|---|
| 363 | | fathom genome.ann genome.dna -gene-stats |
|---|
| 364 | | |
|---|
| 365 | | Next, you want to verify that the genes have no obvious errors: |
|---|
| 366 | | |
|---|
| 367 | | fathom genome.ann genome.dna -validate |
|---|
| 368 | | |
|---|
| 369 | | You may find some errors and warnings. Check these out in some kind of genome |
|---|
| 370 | | browser and remove those that are real errors. Next, break up the sequences into |
|---|
| 371 | | fragments with one gene per sequence with the following command: |
|---|
| 372 | | |
|---|
| 373 | | fathom -genome.ann genome.dna -categorize 1000 |
|---|
| 374 | | |
|---|
| 375 | | There will be up to 1000 bp on either side of the genes. You will find |
|---|
| 376 | | several new files. |
|---|
| 377 | | |
|---|
| 378 | | alt.ann, alt.dna (genes with alternative splicing) |
|---|
| 379 | | err.ann, err.dna (genes that have errors) |
|---|
| 380 | | olp.ann, olp.dna (genes that overlap other genes) |
|---|
| 381 | | wrn.ann, wrn.dna (genes with warnings) |
|---|
| 382 | | uni.ann, uni.dna (single gene per sequence) |
|---|
| 383 | | |
|---|
| 384 | | Convert the uni genes to plus stranded with the command: |
|---|
| 385 | | |
|---|
| 386 | | fathom uni.ann uni.dna -export 1000 -plus |
|---|
| 387 | | |
|---|
| 388 | | You will find 4 new files: |
|---|
| 389 | | |
|---|
| 390 | | export.aa proteins corresponding to each gene |
|---|
| 391 | | export.ann gene structure on the plus strand |
|---|
| 392 | | export.dna DNA of the plus strand |
|---|
| 393 | | export.tx transcripts for each gene |
|---|
| 394 | | |
|---|
| 395 | | The parameter estimation program, forge, creates a lot of files. You probably |
|---|
| 396 | | want to create a directory to keep things tidy before you execute the program. |
|---|
| 397 | | |
|---|
| 398 | | mkdir params |
|---|
| 399 | | cd params |
|---|
| 400 | | forge ../export.ann ../export.dna |
|---|
| 401 | | cd .. |
|---|
| 402 | | |
|---|
| 403 | | Last is to build an HMM. |
|---|
| 404 | | |
|---|
| 405 | | hmm-assembler.pl my-genome params > my-genome.hmm |
|---|
| 406 | | |
|---|
| 407 | | |
|---|
| 408 | | Lastly, you will want to add the location of your hmm file to your maker_opts.ctl file. |
|---|
| 409 | | |
|---|
| 410 | | *For more information see SNAP documentation on how to build an HMM |
|---|
| | 529 | |
|---|
| | 530 | A) First you will need to determine the genes used to model future |
|---|
| | 531 | genes, by determining a high quality gene set (annotations for the |
|---|
| | 532 | high quality gene should be in GFF3 format). The high quality gene |
|---|
| | 533 | set can then be coverted into snap ZFF format using maker2zff.pl |
|---|
| | 534 | found in maker/bin. |
|---|
| | 535 | |
|---|
| | 536 | This program is run with the following command: |
|---|
| | 537 | |
|---|
| | 538 | maker2zff.pl <directory> genome |
|---|
| | 539 | |
|---|
| | 540 | * <directory> is the directory where all of your GFF3 files are |
|---|
| | 541 | located |
|---|
| | 542 | * geneome is the name for the outfile |
|---|
| | 543 | |
|---|
| | 544 | Files Created: |
|---|
| | 545 | genome.ann |
|---|
| | 546 | genome.dna |
|---|
| | 547 | |
|---|
| | 548 | Note: A convenient way to identify and initial high quality gene |
|---|
| | 549 | set for the HMM is to use the -predictor est2genome option in |
|---|
| | 550 | maker. This will produce gene annotations based solely on EST |
|---|
| | 551 | evidence. These annoations can then seed the first HMM. After |
|---|
| | 552 | running maker again using this new HMM and the -predictor snap |
|---|
| | 553 | option, you can use the second round of annotations as the seed |
|---|
| | 554 | for an even better HMM model. In this way the HMM model |
|---|
| | 555 | progressively improves with each run of maker. |
|---|
| | 556 | |
|---|
| | 557 | Another strategy for identifying an initial gene set to model the |
|---|
| | 558 | HMM is to use the program CEGMA (http://korflab.ucdavis.edu/ |
|---|
| | 559 | software.html). CEGMA builds a highly reliable set of gene |
|---|
| | 560 | annotations in the absence of experimental data by identifying DNA |
|---|
| | 561 | regions with homology to a set of 458 proteins that are highly |
|---|
| | 562 | conserved among taxa. |
|---|
| | 563 | |
|---|
| | 564 | Combining both CEGMA and maker datasets to build the first HMM is |
|---|
| | 565 | also a good strategy. |
|---|
| | 566 | |
|---|
| | 567 | B) Next you will use the dna and zff file (genome.dna and genome.ann) |
|---|
| | 568 | to produce a SNAP HMM as described in the SNAP documention (which |
|---|
| | 569 | we have provided below): |
|---|
| | 570 | |
|---|
| | 571 | The first step is to look at some features of the genes: |
|---|
| | 572 | |
|---|
| | 573 | fathom genome.ann genome.dna -gene-stats |
|---|
| | 574 | |
|---|
| | 575 | Next, you want to verify that the genes have no obvious errors: |
|---|
| | 576 | |
|---|
| | 577 | fathom genome.ann genome.dna -validate |
|---|
| | 578 | |
|---|
| | 579 | You may find some errors and warnings. Check these out in some kind |
|---|
| | 580 | of genome browser and remove those that are real errors. Next, |
|---|
| | 581 | break up the sequences into fragments with one gene per sequence |
|---|
| | 582 | with the following command: |
|---|
| | 583 | |
|---|
| | 584 | fathom -genome.ann genome.dna -categorize 1000 |
|---|
| | 585 | |
|---|
| | 586 | There will be up to 1000 bp on either side of the genes. You will |
|---|
| | 587 | find several new files. |
|---|
| | 588 | |
|---|
| | 589 | alt.ann, alt.dna (genes with alternative splicing) |
|---|
| | 590 | err.ann, err.dna (genes that have errors) |
|---|
| | 591 | olp.ann, olp.dna (genes that overlap other genes) |
|---|
| | 592 | wrn.ann, wrn.dna (genes with warnings) |
|---|
| | 593 | uni.ann, uni.dna (single gene per sequence) |
|---|
| | 594 | |
|---|
| | 595 | Convert the uni genes to plus stranded with the command: |
|---|
| | 596 | |
|---|
| | 597 | fathom uni.ann uni.dna -export 1000 -plus |
|---|
| | 598 | |
|---|
| | 599 | You will find 4 new files: |
|---|
| | 600 | |
|---|
| | 601 | export.aa proteins corresponding to each gene |
|---|
| | 602 | export.ann gene structure on the plus strand |
|---|
| | 603 | export.dna DNA of the plus strand |
|---|
| | 604 | export.tx transcripts for each gene |
|---|
| | 605 | |
|---|
| | 606 | The parameter estimation program, forge, creates a lot of files. |
|---|
| | 607 | You probably want to create a directory to keep things tidy before |
|---|
| | 608 | you execute the program. |
|---|
| | 609 | |
|---|
| | 610 | mkdir params |
|---|
| | 611 | cd params |
|---|
| | 612 | forge ../export.ann ../export.dna |
|---|
| | 613 | cd .. |
|---|
| | 614 | |
|---|
| | 615 | Last is to build an HMM. |
|---|
| | 616 | |
|---|
| | 617 | hmm-assembler.pl my-genome params > my-genome.hmm |
|---|
| | 618 | |
|---|
| | 619 | Lastly, you will want to add the location of your hmm file to your |
|---|
| | 620 | maker_opts.ctl file. |
|---|
| | 621 | |
|---|
| | 622 | * For more information see SNAP documentation on how to build an HMM |
|---|