| 1 |
***MAKER Documentation*** |
|---|
| 2 |
|
|---|
| 3 |
#--------------------------------------------------------------------- |
|---|
| 4 |
|
|---|
| 5 |
INSTALLATION INSTUCTIONS FOR MAKER |
|---|
| 6 |
|
|---|
| 7 |
*Step by step instructions are also available in the INSTALL text |
|---|
| 8 |
file. |
|---|
| 9 |
|
|---|
| 10 |
MAKER is an annotation pipeline. In other words it links together |
|---|
| 11 |
many steps and programs to produce final annotations. For this |
|---|
| 12 |
reason, you must first install a number of programs that MAKER depends |
|---|
| 13 |
on. |
|---|
| 14 |
|
|---|
| 15 |
|
|---|
| 16 |
To install maker, you will first need to install the following |
|---|
| 17 |
external programs: |
|---|
| 18 |
|
|---|
| 19 |
*PERL 5.8.0 or higher |
|---|
| 20 |
*BioPerl 1.5 or higher (www.bioperl.org) |
|---|
| 21 |
*SNAP version 2009-02-03 or higher (homepage.mac.com/iankorf) |
|---|
| 22 |
*RepeatMasker 3.1.6 or higher (www.repeatmasker.org) |
|---|
| 23 |
*Exonerate 1.4 or higher (www.ebi.ac.uk/~guy/exonerate) |
|---|
| 24 |
|
|---|
| 25 |
You must also install one of the following: |
|---|
| 26 |
|
|---|
| 27 |
*Wu-BLAST 2.0 or higher (Wu-BLAST is becoming AB-BLAST which can |
|---|
| 28 |
not yet be downloaded) |
|---|
| 29 |
or |
|---|
| 30 |
*NCBI BLAST 2.2.X or higher |
|---|
| 31 |
(http://www.ncbi.nlm.nih.gov/BLAST/download.shtml) |
|---|
| 32 |
|
|---|
| 33 |
You might want to also install these optional external programs: |
|---|
| 34 |
|
|---|
| 35 |
*Augustus 2.0 or higher (augustus.gobics.de) |
|---|
| 36 |
*GeneMark.hmm-E 3.9 or higher (exon.biology.gatech.edu) |
|---|
| 37 |
*FgenesH (www.softberry.com/) - requires licence |
|---|
| 38 |
|
|---|
| 39 |
To install mpi_maker, you must have an mpi package installed, try the |
|---|
| 40 |
following: |
|---|
| 41 |
|
|---|
| 42 |
*MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/) |
|---|
| 43 |
|
|---|
| 44 |
Note: Remember to install MPICH2 with the --enable-sharedlibs flag set |
|---|
| 45 |
to the appropriate value (See MPICH2 Installer's Guide at |
|---|
| 46 |
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs). |
|---|
| 47 |
|
|---|
| 48 |
|
|---|
| 49 |
Notes: |
|---|
| 50 |
|
|---|
| 51 |
1) Wu-BLAST is becoming AB-BLAST. Once AB-BLAST becomes available we |
|---|
| 52 |
will do some testing to see if it is compatible with MAKER. |
|---|
| 53 |
Wu-BLAST is no longer available online, so if you don't already |
|---|
| 54 |
have it, you will have to use NCBI BLAST instead. |
|---|
| 55 |
2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file |
|---|
| 56 |
executable called TRF (see RepeatMasker website for details), so |
|---|
| 57 |
please install these before installing RepeatMasker |
|---|
| 58 |
3) Exonerate Binaries can be downloaded from the website. If you use |
|---|
| 59 |
Mac OSX, however, binaries are only available for version 1.0. |
|---|
| 60 |
This verion will work too. If you would like to compile exonerate, |
|---|
| 61 |
it requires GLIB, a C-library, that has a link from the exonerate |
|---|
| 62 |
website. If you use Mac OSX, GLIB can downloaded using FINK. |
|---|
| 63 |
4) RepeatMasker requires a repeat library file, which can be |
|---|
| 64 |
downloaded from Repbase upon registration |
|---|
| 65 |
(http://www.girinst.org/), this is explained on the RepeatMasker |
|---|
| 66 |
website. |
|---|
| 67 |
5) Please note the location of all of the programs that you have |
|---|
| 68 |
installed, and add them to you $PATH variable in your .profile |
|---|
| 69 |
file. You will need this information in the maker.exe file, one |
|---|
| 70 |
of MAKER's 3 control files. |
|---|
| 71 |
|
|---|
| 72 |
Now that you have all the necessary programs installed, MAKER can be |
|---|
| 73 |
unpacked using: |
|---|
| 74 |
|
|---|
| 75 |
tar xvfz maker.tar.gz |
|---|
| 76 |
|
|---|
| 77 |
This will create a directory called maker with 5 sub directories: |
|---|
| 78 |
|
|---|
| 79 |
bin - contains the maker executables. |
|---|
| 80 |
lib - contains all the necessary perl libaries for MAKER. |
|---|
| 81 |
MPI - contains MPI specific data to configure MAKER for a |
|---|
| 82 |
cluster that supports MPI. |
|---|
| 83 |
Apollo - contains gff3.tiers file (See section titled APOLLO |
|---|
| 84 |
below) |
|---|
| 85 |
data - contains some sample data used to make sure everything |
|---|
| 86 |
works. |
|---|
| 87 |
perl - contains perl modules that need to be compiled |
|---|
| 88 |
|
|---|
| 89 |
Finally change to the maker/perl directory and type: 'perl Install.PL' |
|---|
| 90 |
to compile required perl modules. |
|---|
| 91 |
|
|---|
| 92 |
Now you can run MAKER!! |
|---|
| 93 |
|
|---|
| 94 |
Maker uses control files to guide each run. Generic control files can |
|---|
| 95 |
be built using the -CTL flag in maker. These control files can then |
|---|
| 96 |
be edited by the user to identify the location of all required input |
|---|
| 97 |
data and statistics. Control files are run specific and seperate |
|---|
| 98 |
control will need to be built for each genome given to maker. Maker |
|---|
| 99 |
will look for control files in the current working directory, so it is |
|---|
| 100 |
recomended that maker should be ran in a seperate directory containing |
|---|
| 101 |
unique control files for each genome. |
|---|
| 102 |
|
|---|
| 103 |
Control files: |
|---|
| 104 |
|
|---|
| 105 |
1. maker_exe.ctl - contains the path information for needed |
|---|
| 106 |
executables. |
|---|
| 107 |
|
|---|
| 108 |
2. maker_bopts - contains filtering statistics for BLAST and |
|---|
| 109 |
Exonerate. |
|---|
| 110 |
|
|---|
| 111 |
3. maker_opts.ctl - contains all other information for MAKER, |
|---|
| 112 |
including the location of the input genome file. |
|---|
| 113 |
|
|---|
| 114 |
Always remember to examine the control files before each run of MAKER |
|---|
| 115 |
on your specific data. |
|---|
| 116 |
|
|---|
| 117 |
Programs required by maker rely on certain environmental variables |
|---|
| 118 |
being set. If you have not set these variables per the installation |
|---|
| 119 |
instructions of the external programs, a reminder list is provided |
|---|
| 120 |
below: |
|---|
| 121 |
|
|---|
| 122 |
for tcsh: |
|---|
| 123 |
setenv PERL5LIB where_bioperl_is_installed |
|---|
| 124 |
setenv WUBLASTMAT where_wublast_is_installed/matrix |
|---|
| 125 |
setenv ZOE where_snap_is_installed |
|---|
| 126 |
setenv WUBLASTFILTER where_wublast_is_installed/filter |
|---|
| 127 |
setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config |
|---|
| 128 |
|
|---|
| 129 |
for bash: |
|---|
| 130 |
export PERL5LIB=where_bioperl_is_installed |
|---|
| 131 |
export WUBLASTMAT=where_wublast_is_installed/matrix |
|---|
| 132 |
export ZOE=where_snap_is_installed |
|---|
| 133 |
export WUBLASTFILTER=where_wublast_is_installed/filter |
|---|
| 134 |
export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config |
|---|
| 135 |
|
|---|
| 136 |
#--------------------------------------------------------------------- |
|---|
| 137 |
|
|---|
| 138 |
MPI MAKER INSTALL |
|---|
| 139 |
|
|---|
| 140 |
If you are running maker on an MPI capable cluster, you can install an |
|---|
| 141 |
MPI version of maker by doing the following: |
|---|
| 142 |
|
|---|
| 143 |
1. Install standard maker and verify that it runs. |
|---|
| 144 |
2. Install MPICH2 with the --enable-sharedlibs flag set to the |
|---|
| 145 |
appropriate value (See MPICH2 documentation) |
|---|
| 146 |
3. Use cd to change to the MPI subdirectory in the maker |
|---|
| 147 |
instalation folder (i.e. maker/MPI/) |
|---|
| 148 |
4. Run Install.PL by typing: perl Install.PL |
|---|
| 149 |
|
|---|
| 150 |
A new version of maker called mpi_maker should now be installed under |
|---|
| 151 |
maker/bin. |
|---|
| 152 |
|
|---|
| 153 |
To run mpi_maker, first verify that your mpi environment is initiated, |
|---|
| 154 |
(i.e. using the mpdboot or mpd command). Now start mpi_maker via |
|---|
| 155 |
mpiexec. |
|---|
| 156 |
|
|---|
| 157 |
Example: (This will run MAKER on 3 nodes or processors) |
|---|
| 158 |
|
|---|
| 159 |
mpiexec -n 3 perl maker_directory/maker/bin/mpi_maker maker_opts.ctl \ |
|---|
| 160 |
maker_bopts.ctl maker_exe.ctl |
|---|
| 161 |
|
|---|
| 162 |
Please see the documentation of the MPI environment you use for |
|---|
| 163 |
instructions on how to initiate an MPI process. |
|---|
| 164 |
|
|---|
| 165 |
#--------------------------------------------------------------------- |
|---|
| 166 |
|
|---|
| 167 |
MAKER USAGE STATEMENT |
|---|
| 168 |
|
|---|
| 169 |
Usage: |
|---|
| 170 |
|
|---|
| 171 |
maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator> |
|---|
| 172 |
|
|---|
| 173 |
Maker is a program that produces gene annotations in GFF3 file format |
|---|
| 174 |
using evidence such as EST alignments and protein homology. Maker can |
|---|
| 175 |
be used to produce gene annotations for new genomes as well as update |
|---|
| 176 |
annoations from existing genome databases. |
|---|
| 177 |
|
|---|
| 178 |
The four input arguments are user control files that specify how maker |
|---|
| 179 |
should behave. The evaluator options file contains control options |
|---|
| 180 |
specific for the evaluation of gene annotations. All options for maker |
|---|
| 181 |
should be set in the control files, but a few can also be set on the |
|---|
| 182 |
command line. Command line options provide a convenient machanism to |
|---|
| 183 |
override commonly altered control file values. |
|---|
| 184 |
|
|---|
| 185 |
Input files listed in the control options files must be in fasta |
|---|
| 186 |
format. Please see maker documentation to learn more about control |
|---|
| 187 |
file configuration. Maker will automatically try and locate the user |
|---|
| 188 |
control files in the current working directory if these arguments are |
|---|
| 189 |
not supplied when initializing maker. |
|---|
| 190 |
|
|---|
| 191 |
It is important to note that maker does not try and recalculated data |
|---|
| 192 |
that it has already calculated. For example, if you run an analysis |
|---|
| 193 |
twice on the same dataset file you will notice that maker does not |
|---|
| 194 |
rerun any of the blast analyses, but instead uses the blast analyses |
|---|
| 195 |
stored from the previous run. To force maker to rerun all analyses, |
|---|
| 196 |
use the -f flag. |
|---|
| 197 |
|
|---|
| 198 |
|
|---|
| 199 |
Options: |
|---|
| 200 |
|
|---|
| 201 |
-genome|g <filename> Specify the genome file. |
|---|
| 202 |
|
|---|
| 203 |
-predictor|p <type> Selects the predictor(s) to use when |
|---|
| 204 |
building annotations. Use a ',' to |
|---|
| 205 |
seperate types (no spaces). |
|---|
| 206 |
i.e. -predictor=snap,augustus,fgenesh |
|---|
| 207 |
|
|---|
| 208 |
types: snap |
|---|
| 209 |
augustus |
|---|
| 210 |
fgenesh |
|---|
| 211 |
genemark |
|---|
| 212 |
est2genome (Uses EST's directly) |
|---|
| 213 |
abinit (ab-initio predictions) |
|---|
| 214 |
model_gff (Passes through GFF3 |
|---|
| 215 |
annotations) |
|---|
| 216 |
|
|---|
| 217 |
-RM_off|R Turns all repeat masking off. |
|---|
| 218 |
|
|---|
| 219 |
-retry <integer> Rerun failed contigs up to the specified count. |
|---|
| 220 |
|
|---|
| 221 |
-cpus|c <integer> Tells how many cpus to use for BLAST analysis. |
|---|
| 222 |
|
|---|
| 223 |
-force|f Forces maker to delete old files before running |
|---|
| 224 |
again. This will require all blast analyses to |
|---|
| 225 |
be rerun. |
|---|
| 226 |
|
|---|
| 227 |
-evaluate|e Run Evaluator on final annotations (under |
|---|
| 228 |
development). |
|---|
| 229 |
|
|---|
| 230 |
-quiet|q Silences most of maker's status messages. |
|---|
| 231 |
|
|---|
| 232 |
-CTL Generate empty control files in the current |
|---|
| 233 |
directory. |
|---|
| 234 |
|
|---|
| 235 |
-help|? Prints this usage statement. |
|---|
| 236 |
|
|---|
| 237 |
|
|---|
| 238 |
#--------------------------------------------------------------------- |
|---|
| 239 |
|
|---|
| 240 |
RUNNING MAKER WITH EXAMPLE DATA |
|---|
| 241 |
|
|---|
| 242 |
1) Copy the files in the data directories to a temporary directory |
|---|
| 243 |
where you will run an example file. |
|---|
| 244 |
2) Type maker -CTL to generate generic maker control files |
|---|
| 245 |
3) Next you will need to edit the control files to include the path of |
|---|
| 246 |
the genome file, EST file, and protein file, as well as the paths |
|---|
| 247 |
to all required executables. See CONFIG FILE EDITING for more |
|---|
| 248 |
information. |
|---|
| 249 |
4) Then try the following command from your temporary directory: |
|---|
| 250 |
|
|---|
| 251 |
perl maker_directory/bin/maker maker_exe.ctl maker_opts.ctl \ |
|---|
| 252 |
maker_bopts.ctl |
|---|
| 253 |
|
|---|
| 254 |
MAKER will create at least the following files/directories: |
|---|
| 255 |
|
|---|
| 256 |
1) XXX.maker.output/ - contains all output for a given run of make |
|---|
| 257 |
2) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run |
|---|
| 258 |
progress as well as an index for traversing |
|---|
| 259 |
XXX.maker.output/XXX_datastore/ |
|---|
| 260 |
3) XXX.maker.output/XXX_datastore/ - contains folders containing the |
|---|
| 261 |
output for each individual contig of the input fasta file |
|---|
| 262 |
|
|---|
| 263 |
Within these folders: |
|---|
| 264 |
* seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, |
|---|
| 265 |
or Apollo |
|---|
| 266 |
* seq_name.maker.transcripts.fasta - a file of the maker |
|---|
| 267 |
transcript sequences |
|---|
| 268 |
* seq_name.maker.proteins.fasta - a file of the maker protein |
|---|
| 269 |
sequences |
|---|
| 270 |
* seq_name.maker.XXX.transcript.fasta - a file of ab-inito |
|---|
| 271 |
transcript sequences from program XXX |
|---|
| 272 |
* seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein |
|---|
| 273 |
sequences from program XXX |
|---|
| 274 |
* seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a |
|---|
| 275 |
file of filtered ab-inito transcript sequences that don't |
|---|
| 276 |
overlap annotations |
|---|
| 277 |
* seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a |
|---|
| 278 |
file of filtered ab-inito protein sequences that don't overlap |
|---|
| 279 |
annotations |
|---|
| 280 |
* theVoid.seq_name/ - a directory containing all of the raw |
|---|
| 281 |
output files produced by maker, including BLAST reports, SNAP |
|---|
| 282 |
output, exonnerate output and the masked sequence. |
|---|
| 283 |
|
|---|
| 284 |
WARNING: |
|---|
| 285 |
|
|---|
| 286 |
* The names of output files are based on sequence ids. If giving |
|---|
| 287 |
maker a multi-fasta file, it is important to verify that all |
|---|
| 288 |
sequence id are unique, so files are not overwritten. |
|---|
| 289 |
|
|---|
| 290 |
* If there are more than 1,000 sequences in a multi-fasta file a deep |
|---|
| 291 |
datastore structure will be used. see DATASTORE in this document. |
|---|
| 292 |
|
|---|
| 293 |
* If sequence ids contain characters that are illegal in file names, |
|---|
| 294 |
those characters will be replaced automatically before building |
|---|
| 295 |
output file names. |
|---|
| 296 |
|
|---|
| 297 |
#--------------------------------------------------------------------- |
|---|
| 298 |
|
|---|
| 299 |
DATASTORE |
|---|
| 300 |
|
|---|
| 301 |
"Many filesystems have performance problems with large numbers of |
|---|
| 302 |
subdirectories and files within a single directory and even when the |
|---|
| 303 |
underlying filesystems handle things gracefully, access via network |
|---|
| 304 |
filesystems can be an issue. The Datastore modules create a hiearchy |
|---|
| 305 |
of subdirectory layers, starting from a 'base', and mapping end-user's |
|---|
| 306 |
identifiers to the corresponding subdirectory." - quote from |
|---|
| 307 |
http://www.yandell-lab.org/ (See site for more information on the |
|---|
| 308 |
Datastore module) |
|---|
| 309 |
|
|---|
| 310 |
A deep datastore will be used by maker if there are more than 1,000 |
|---|
| 311 |
sequences in a multi-fasta file. |
|---|
| 312 |
|
|---|
| 313 |
When a datastore is implemented, the output files described above will |
|---|
| 314 |
not appear where you would normally expect them to be. Instead they |
|---|
| 315 |
will be located in a series of sub-directory under a new |
|---|
| 316 |
base-directory whose name is determined from the input genome file |
|---|
| 317 |
name: |
|---|
| 318 |
|
|---|
| 319 |
i.e. current_directory/genome_datastore/EE/Af/Contig1/Contig1.gff. |
|---|
| 320 |
|
|---|
| 321 |
A master_datastore_index file will be made in the current working |
|---|
| 322 |
directory to help you find the output files from each sequence. |
|---|
| 323 |
|
|---|
| 324 |
The master_datastore_index file is a file created to allow the user to |
|---|
| 325 |
easily find the exact output directory corresponding to contigs from |
|---|
| 326 |
the input genome file. The The master_datastore_index file contains |
|---|
| 327 |
three columns of text; the first column shows the sequence identifier |
|---|
| 328 |
from each fasta header, and the second column shows the location of |
|---|
| 329 |
the output files for that sequence. The third column is for logging |
|---|
| 330 |
the status of data related to an individual contig. The values of the |
|---|
| 331 |
third column are as follows: |
|---|
| 332 |
* STARTED - Indicates that maker has started proccessing this |
|---|
| 333 |
contig. |
|---|
| 334 |
* FINISHED - Indicates that maker has finished processing this |
|---|
| 335 |
contig and all data is currently available in that subdirectory. |
|---|
| 336 |
* DIED - Indicates that maker failed. |
|---|
| 337 |
* DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the |
|---|
| 338 |
specified number of retries and will not try again. |
|---|
| 339 |
* RETRY - Indicates that maker is retrying the contig after a |
|---|
| 340 |
failure. |
|---|
| 341 |
* SKIPPED_SMALL - Indicates that this contig was skipped because |
|---|
| 342 |
it is too short (based on control file values set by the user). |
|---|
| 343 |
|
|---|
| 344 |
#--------------------------------------------------------------------- |
|---|
| 345 |
|
|---|
| 346 |
CONFIG FILE EDITING |
|---|
| 347 |
|
|---|
| 348 |
Lines in the maker control files have the format key:value whith no |
|---|
| 349 |
spaces before or after the colon(:). If the value is a file name, you |
|---|
| 350 |
can use relative paths and environmental variables, |
|---|
| 351 |
i.e. genome:$HOME/my_genome.fasta |
|---|
| 352 |
|
|---|
| 353 |
|
|---|
| 354 |
MAKER has 3 control files for configuration options. A fourth file |
|---|
| 355 |
evaluator.ctl is used to supply a MAKER related program EVALUATOR with |
|---|
| 356 |
options specific to that program (only important if 'evaluate' is set |
|---|
| 357 |
to 1 in maker_opts.ctl). |
|---|
| 358 |
|
|---|
| 359 |
Note that for all control files the comments written to help users |
|---|
| 360 |
begin with a pound sign(#). In addition, options before the colon(:) |
|---|
| 361 |
can not be changed, nor should there be a space before or after the |
|---|
| 362 |
colon. |
|---|
| 363 |
|
|---|
| 364 |
A. maker_exe.ctl - includes information about programs executed by |
|---|
| 365 |
MAKER. |
|---|
| 366 |
Here an example of a section of the maker_exe.ctl file: |
|---|
| 367 |
|
|---|
| 368 |
#-----Location of Executables Used by Maker/Evaluator |
|---|
| 369 |
|
|---|
| 370 |
#location of NCBI formatdb executable |
|---|
| 371 |
formatdb:/usr/local/bin/formatdb |
|---|
| 372 |
#location of NCBI blastall executable |
|---|
| 373 |
blastall:/usr/local/bin/blastall |
|---|
| 374 |
#location of WUBLAST xdformat executable |
|---|
| 375 |
xdformat:/usr/local/bin/xdformat |
|---|
| 376 |
#location of WUBLAST blastn executable |
|---|
| 377 |
blastn:/usr/local/bin/blastn |
|---|
| 378 |
#location of WUBLAST blastx executable |
|---|
| 379 |
blastx:/usr/local/bin/blastx |
|---|
| 380 |
#location of WUBLAST tblastx executable |
|---|
| 381 |
tblastx:/usr/local/bin/tblastx |
|---|
| 382 |
#location of RepeatMasker executable |
|---|
| 383 |
RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker |
|---|
| 384 |
#location of exonerate executable |
|---|
| 385 |
exonerate:/home/cholt/usr/local/exonerate/bin/exonerate |
|---|
| 386 |
|
|---|
| 387 |
#-----Ab-initio Gene Prediction Algorithms |
|---|
| 388 |
|
|---|
| 389 |
#location of snap executable |
|---|
| 390 |
snap:/home/cholt/usr/local/snap/snap |
|---|
| 391 |
#location of eukaryotic genemark executable |
|---|
| 392 |
gmhmme3:/home/cholt/usr/local/gmes/gmhmme3 |
|---|
| 393 |
#location of augustus executable |
|---|
| 394 |
augustus:/home/cholt/usr/local/augustus/bin/augustus |
|---|
| 395 |
#location of fgenesh executable |
|---|
| 396 |
fgenesh:/home/cholt/usr/local/fgenesh/fgenesh |
|---|
| 397 |
|
|---|
| 398 |
B. maker_bopts.ctl - contains statistics for fltering blast and |
|---|
| 399 |
exonerate data. |
|---|
| 400 |
Here an example of a section of the maker_bopts.ctl file: |
|---|
| 401 |
|
|---|
| 402 |
#-----BLAST and Exonerate statistics thresholds |
|---|
| 403 |
#set to 'wublast' or 'ncbi' |
|---|
| 404 |
blast_type:wublast |
|---|
| 405 |
#Blastn Percent Coverage Threhold EST-Genome Alignments |
|---|
| 406 |
pcov_blastn:0.8 |
|---|
| 407 |
#Blastn Percent Identity Threshold EST-Genome Aligments |
|---|
| 408 |
pid_blastn:0.85 |
|---|
| 409 |
#Blastn eval cutoff |
|---|
| 410 |
eval_blastn:1e-10 |
|---|
| 411 |
#Blastn bit cutoff |
|---|
| 412 |
bit_blastn:40 |
|---|
| 413 |
|
|---|
| 414 |
#Blastx Percent Coverage Threhold Protein-Genome Alignments |
|---|
| 415 |
pcov_blastx:0.5 |
|---|
| 416 |
#Blastx Percent Identity Threshold Protein-Genome Aligments |
|---|
| 417 |
pid_blastx:0.4 |
|---|
| 418 |
#Blastx eval cutoff |
|---|
| 419 |
eval_blastx:1e-06 |
|---|
| 420 |
#Blastx bit cutoff |
|---|
| 421 |
bit_blastx:30 |
|---|
| 422 |
|
|---|
| 423 |
#Blastx Percent Coverage Threhold For Transposable Element Masking |
|---|
| 424 |
pcov_rm_blastx:0.5 |
|---|
| 425 |
#Blastx Percent Identity Threshold For Transposbale Element Masking |
|---|
| 426 |
pid_rm_blastx:0.4 |
|---|
| 427 |
#Blastx eval cutoff for transposable element masking |
|---|
| 428 |
eval_rm_blastx:1e-06 |
|---|
| 429 |
#Blastx bit cutoff for transposable element masking |
|---|
| 430 |
bit_rm_blastx:30 |
|---|
| 431 |
|
|---|
| 432 |
C. maker_opts.ctl - contains options for maker and external programs |
|---|
| 433 |
used by maker. |
|---|
| 434 |
Here an example of a section of the maker_opts.ctl file: |
|---|
| 435 |
|
|---|
| 436 |
#-----Genome (Required for De-Novo Annotations) |
|---|
| 437 |
#genome sequence file in fasta format |
|---|
| 438 |
genome:input/genome.fasta |
|---|
| 439 |
|
|---|
| 440 |
#-----Re-annotation Options |
|---|
| 441 |
|
|---|
| 442 |
#re-annotate genome based on this gff3 file |
|---|
| 443 |
genome_gff: |
|---|
| 444 |
#use ests in genome_gff: 1 = yes, 0 = no |
|---|
| 445 |
est_pass:0 |
|---|
| 446 |
#use alternate organism ests in genome_gff: 1 = yes, 0 = no |
|---|
| 447 |
altest_pass:0 |
|---|
| 448 |
#use proteins in genome_gff: 1 = yes, 0 = no |
|---|
| 449 |
protein_pass:0 |
|---|
| 450 |
#use repeats in genome_gff: 1 = yes, 0 = no |
|---|
| 451 |
rm_pass:0 |
|---|
| 452 |
#use gene models in genome_gff: 1 = yes, 0 = no |
|---|
| 453 |
model_pass:0 |
|---|
| 454 |
#use ab-initio predictions in genome_gff: 1 = yes, 0 = no |
|---|
| 455 |
pred_pass:0 |
|---|
| 456 |
#passthrough everything else in genome_gff: 1 = yes, 0 = no |
|---|
| 457 |
other_pass:0 |
|---|
| 458 |
|
|---|
| 459 |
#-----EST Evidence (you must provide a value for at least one) |
|---|
| 460 |
|
|---|
| 461 |
#non-redundant set of assembled ESTs in fasta format (classic EST |
|---|
| 462 |
#analysis) |
|---|
| 463 |
est:input/est.fasta |
|---|
| 464 |
#un-assembled EST reads in fasta format (for deep nextgen mRNASeq) |
|---|
| 465 |
est_reads: |
|---|
| 466 |
#EST/cDNA sequence file in fasta format from an alternate organism |
|---|
| 467 |
altest:input/altest.fasta |
|---|
| 468 |
#EST evidence from a seperate gff3 file |
|---|
| 469 |
est_gff: |
|---|
| 470 |
#Alternate organism EST evidence from a seperate gff3 file |
|---|
| 471 |
altest_gff: |
|---|
| 472 |
|
|---|
| 473 |
#-----Protein Homology Evidence (you must provide a value for at least |
|---|
| 474 |
# one) |
|---|
| 475 |
#protein sequence file in fasta format |
|---|
| 476 |
protein:input/protein.fasta |
|---|
| 477 |
#protein homology evidence from a gff3 file |
|---|
| 478 |
protein_gff: |
|---|
| 479 |
|
|---|
| 480 |
#--------------------------------------------------------------------- |
|---|
| 481 |
|
|---|
| 482 |
GFF3 Passthrough |
|---|
| 483 |
|
|---|
| 484 |
If you have data from a source that MAKER does not support, and you |
|---|
| 485 |
wish to use the data in annotating a genome, then you can pass the |
|---|
| 486 |
data to MAKER as an aligned GFF3 file. This is done by supplying the |
|---|
| 487 |
files location to the appropriate value in the maker_opt.ctl file |
|---|
| 488 |
(i.e. est_gff:input\est.gff). Note that MAKER expects all data sent |
|---|
| 489 |
to it to be of the type specified, so don't put mixed data in a file |
|---|
| 490 |
(i.e. don't mix EST and other data in the file pointed to by est_gff, |
|---|
| 491 |
otherwise it all gets used as EST data). Also the genome_gff option |
|---|
| 492 |
is only for MAKER produced GFF3 files. Other GFF3 files of mixed data |
|---|
| 493 |
must be split by type and identified by the appropriate control file |
|---|
| 494 |
option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction |
|---|
| 495 |
data, est_gff for EST data, etc.). |
|---|
| 496 |
|
|---|
| 497 |
#--------------------------------------------------------------------- |
|---|
| 498 |
|
|---|
| 499 |
ADDING UTRs for GBROWSE |
|---|
| 500 |
|
|---|
| 501 |
When using APOLLO to visualize gene annotations, UTRs are inferred |
|---|
| 502 |
based on exon and CDS locations. However GMOD and GBROWSE do not |
|---|
| 503 |
infer the UTR, so to visualize the UTR, you will have to run: |
|---|
| 504 |
add_utr_gff.pl with the following command: |
|---|
| 505 |
|
|---|
| 506 |
maker2zff.pl <directory> |
|---|
| 507 |
|
|---|
| 508 |
* <directory> is the directory where all of your GFF files are |
|---|
| 509 |
located. |
|---|
| 510 |
|
|---|
| 511 |
Each GFF file will have a sister file called sequence.wutr.gff3. |
|---|
| 512 |
|
|---|
| 513 |
#--------------------------------------------------------------------- |
|---|
| 514 |
|
|---|
| 515 |
APOLLO |
|---|
| 516 |
|
|---|
| 517 |
Maker is bundled with a configuration file that improves the color and |
|---|
| 518 |
display of maker annotations and evidence in the Apollo genome |
|---|
| 519 |
browser. The configuration file is called "gff3.tiers" and is located |
|---|
| 520 |
in the maker/Apollo/ directory. The file should be copied to the |
|---|
| 521 |
conf/ sub_directory which is located under the Apollo instalation |
|---|
| 522 |
directory. Using the Mac version of Apollo the conf/ directory is |
|---|
| 523 |
located at /Applications/Apollo.app/Contents/Resources/app/conf/. |
|---|
| 524 |
|
|---|
| 525 |
#--------------------------------------------------------------------- |
|---|
| 526 |
|
|---|
| 527 |
HMM BUILDING (based on snap documentation) |
|---|
| 528 |
|
|---|
| 529 |
|
|---|
| 530 |
A) First you will need to determine the genes used to model future |
|---|
| 531 |
genes, by determining a high quality gene set (annotations for the |
|---|
| 532 |
high quality gene should be in GFF3 format). The high quality gene |
|---|
| 533 |
set can then be coverted into snap ZFF format using maker2zff.pl |
|---|
| 534 |
found in maker/bin. |
|---|
| 535 |
|
|---|
| 536 |
This program is run with the following command: |
|---|
| 537 |
|
|---|
| 538 |
maker2zff.pl <directory> genome |
|---|
| 539 |
|
|---|
| 540 |
* <directory> is the directory where all of your GFF3 files are |
|---|
| 541 |
located |
|---|
| 542 |
* geneome is the name for the outfile |
|---|
| 543 |
|
|---|
| 544 |
Files Created: |
|---|
| 545 |
genome.ann |
|---|
| 546 |
genome.dna |
|---|
| 547 |
|
|---|
| 548 |
Note: A convenient way to identify and initial high quality gene |
|---|
| 549 |
set for the HMM is to use the -predictor est2genome option in |
|---|
| 550 |
maker. This will produce gene annotations based solely on EST |
|---|
| 551 |
evidence. These annoations can then seed the first HMM. After |
|---|
| 552 |
running maker again using this new HMM and the -predictor snap |
|---|
| 553 |
option, you can use the second round of annotations as the seed |
|---|
| 554 |
for an even better HMM model. In this way the HMM model |
|---|
| 555 |
progressively improves with each run of maker. |
|---|
| 556 |
|
|---|
| 557 |
Another strategy for identifying an initial gene set to model the |
|---|
| 558 |
HMM is to use the program CEGMA (http://korflab.ucdavis.edu/ |
|---|
| 559 |
software.html). CEGMA builds a highly reliable set of gene |
|---|
| 560 |
annotations in the absence of experimental data by identifying DNA |
|---|
| 561 |
regions with homology to a set of 458 proteins that are highly |
|---|
| 562 |
conserved among taxa. |
|---|
| 563 |
|
|---|
| 564 |
Combining both CEGMA and maker datasets to build the first HMM is |
|---|
| 565 |
also a good strategy. |
|---|
| 566 |
|
|---|
| 567 |
B) Next you will use the dna and zff file (genome.dna and genome.ann) |
|---|
| 568 |
to produce a SNAP HMM as described in the SNAP documention (which |
|---|
| 569 |
we have provided below): |
|---|
| 570 |
|
|---|
| 571 |
The first step is to look at some features of the genes: |
|---|
| 572 |
|
|---|
| 573 |
fathom genome.ann genome.dna -gene-stats |
|---|
| 574 |
|
|---|
| 575 |
Next, you want to verify that the genes have no obvious errors: |
|---|
| 576 |
|
|---|
| 577 |
fathom genome.ann genome.dna -validate |
|---|
| 578 |
|
|---|
| 579 |
You may find some errors and warnings. Check these out in some kind |
|---|
| 580 |
of genome browser and remove those that are real errors. Next, |
|---|
| 581 |
break up the sequences into fragments with one gene per sequence |
|---|
| 582 |
with the following command: |
|---|
| 583 |
|
|---|
| 584 |
fathom -genome.ann genome.dna -categorize 1000 |
|---|
| 585 |
|
|---|
| 586 |
There will be up to 1000 bp on either side of the genes. You will |
|---|
| 587 |
find several new files. |
|---|
| 588 |
|
|---|
| 589 |
alt.ann, alt.dna (genes with alternative splicing) |
|---|
| 590 |
err.ann, err.dna (genes that have errors) |
|---|
| 591 |
olp.ann, olp.dna (genes that overlap other genes) |
|---|
| 592 |
wrn.ann, wrn.dna (genes with warnings) |
|---|
| 593 |
uni.ann, uni.dna (single gene per sequence) |
|---|
| 594 |
|
|---|
| 595 |
Convert the uni genes to plus stranded with the command: |
|---|
| 596 |
|
|---|
| 597 |
fathom uni.ann uni.dna -export 1000 -plus |
|---|
| 598 |
|
|---|
| 599 |
You will find 4 new files: |
|---|
| 600 |
|
|---|
| 601 |
export.aa proteins corresponding to each gene |
|---|
| 602 |
export.ann gene structure on the plus strand |
|---|
| 603 |
export.dna DNA of the plus strand |
|---|
| 604 |
export.tx transcripts for each gene |
|---|
| 605 |
|
|---|
| 606 |
The parameter estimation program, forge, creates a lot of files. |
|---|
| 607 |
You probably want to create a directory to keep things tidy before |
|---|
| 608 |
you execute the program. |
|---|
| 609 |
|
|---|
| 610 |
mkdir params |
|---|
| 611 |
cd params |
|---|
| 612 |
forge ../export.ann ../export.dna |
|---|
| 613 |
cd .. |
|---|
| 614 |
|
|---|
| 615 |
Last is to build an HMM. |
|---|
| 616 |
|
|---|
| 617 |
hmm-assembler.pl my-genome params > my-genome.hmm |
|---|
| 618 |
|
|---|
| 619 |
Lastly, you will want to add the location of your hmm file to your |
|---|
| 620 |
maker_opts.ctl file. |
|---|
| 621 |
|
|---|
| 622 |
* For more information see SNAP documentation on how to build an HMM |
|---|