| 1 |
***MAKER Documentation*** |
|---|
| 2 |
|
|---|
| 3 |
#--------------------------------------------------------------------- |
|---|
| 4 |
|
|---|
| 5 |
INSTALLATION INSTUCTIONS FOR MAKER |
|---|
| 6 |
|
|---|
| 7 |
*Step by step instructions are also available in the INSTALL text |
|---|
| 8 |
file. |
|---|
| 9 |
|
|---|
| 10 |
MAKER is an annotation pipeline. In other words it links together |
|---|
| 11 |
many steps and programs to produce final annotations. For this |
|---|
| 12 |
reason, you must first install a number of programs that MAKER depends |
|---|
| 13 |
on. MAKER works on both eukaryotic and prokaryotic genomes. |
|---|
| 14 |
|
|---|
| 15 |
|
|---|
| 16 |
To install MAKER, you will first need to install the following |
|---|
| 17 |
external programs: |
|---|
| 18 |
|
|---|
| 19 |
*PERL 5.8.0 or higher |
|---|
| 20 |
*BioPerl 1.5 or higher (www.bioperl.org) |
|---|
| 21 |
*SNAP version 2009-02-03 or higher (homepage.mac.com/iankorf) |
|---|
| 22 |
*RepeatMasker 3.1.6 or higher (www.repeatmasker.org) |
|---|
| 23 |
*Exonerate 1.4 or higher (www.ebi.ac.uk/~guy/exonerate) |
|---|
| 24 |
|
|---|
| 25 |
You must also install one of the following: |
|---|
| 26 |
|
|---|
| 27 |
*Wu-BLAST 2.0 or higher |
|---|
| 28 |
(Wu-BLAST is becoming AB-BLAST which can not yet be downloaded) |
|---|
| 29 |
or |
|---|
| 30 |
*NCBI BLAST 2.2.X or higher |
|---|
| 31 |
(http://www.ncbi.nlm.nih.gov/BLAST/download.shtml) |
|---|
| 32 |
|
|---|
| 33 |
You might want to also install these optional external programs: |
|---|
| 34 |
|
|---|
| 35 |
*Augustus 2.0 or higher (augustus.gobics.de) |
|---|
| 36 |
*GeneMark-ES (exon.biology.gatech.edu) |
|---|
| 37 |
*FGENESH 2.6 or higher (www.softberry.com) - requires licence |
|---|
| 38 |
*GeneMarkS for prokaryotic genomes (exon.biology.gatech.edu) |
|---|
| 39 |
|
|---|
| 40 |
To install mpi_maker, you must have an mpi package installed, try the |
|---|
| 41 |
following: |
|---|
| 42 |
|
|---|
| 43 |
*MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/) |
|---|
| 44 |
|
|---|
| 45 |
Note: Remember to install MPICH2 with the --enable-sharedlibs flag set |
|---|
| 46 |
to the appropriate value. See MPICH2 Installer's Guide at: |
|---|
| 47 |
http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs |
|---|
| 48 |
|
|---|
| 49 |
Also see the mpi_maker installation instructions found further bellow. |
|---|
| 50 |
|
|---|
| 51 |
|
|---|
| 52 |
Notes: |
|---|
| 53 |
|
|---|
| 54 |
1) Wu-BLAST is becoming AB-BLAST. Once AB-BLAST becomes available we |
|---|
| 55 |
will do some testing to see if it is compatible with MAKER. |
|---|
| 56 |
Wu-BLAST is no longer available online, so if you don't already |
|---|
| 57 |
have it, you will have to use NCBI BLAST instead. |
|---|
| 58 |
2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file |
|---|
| 59 |
executable called TRF (see RepeatMasker website for details), so |
|---|
| 60 |
please install these before installing RepeatMasker |
|---|
| 61 |
3) Exonerate Binaries can be downloaded from the website. If you use |
|---|
| 62 |
Mac OSX, however, binaries are only available for version 1.0. |
|---|
| 63 |
This verion will work too. If you would like to compile exonerate, |
|---|
| 64 |
it requires GLIB, a C-library, that has a link from the exonerate |
|---|
| 65 |
website. If you use Mac OSX, GLIB can downloaded using FINK. |
|---|
| 66 |
4) RepeatMasker requires a repeat library file, which can be |
|---|
| 67 |
downloaded from Repbase upon registration |
|---|
| 68 |
(http://www.girinst.org/), this is explained on the RepeatMasker |
|---|
| 69 |
website. |
|---|
| 70 |
5) Please note the location of all of the programs that you have |
|---|
| 71 |
installed, and add them to you $PATH variable in your .profile |
|---|
| 72 |
file. You will need this information in the maker_exe file, one |
|---|
| 73 |
of MAKER's 3 control files. |
|---|
| 74 |
|
|---|
| 75 |
Now that you have all the necessary programs installed, MAKER can be |
|---|
| 76 |
unpacked using: |
|---|
| 77 |
|
|---|
| 78 |
tar -zxvf maker.tar.gz |
|---|
| 79 |
|
|---|
| 80 |
This will create a directory called MAKER with 5 sub directories: |
|---|
| 81 |
|
|---|
| 82 |
bin - contains the MAKER executables. |
|---|
| 83 |
lib - contains all the necessary perl libaries for MAKER. |
|---|
| 84 |
MPI - contains MPI specific data to configure MAKER for a |
|---|
| 85 |
cluster that supports MPI. |
|---|
| 86 |
Apollo - contains gff3.tiers file (See section titled APOLLO |
|---|
| 87 |
below) |
|---|
| 88 |
data - contains some sample data used to make sure everything |
|---|
| 89 |
works. |
|---|
| 90 |
perl - contains perl modules that need to be compiled |
|---|
| 91 |
|
|---|
| 92 |
Finally change to the maker/perl directory and type: 'perl Install.PL' |
|---|
| 93 |
to compile required perl modules. |
|---|
| 94 |
|
|---|
| 95 |
Now you can run MAKER!! |
|---|
| 96 |
|
|---|
| 97 |
Programs required by MAKER rely on certain environmental variables |
|---|
| 98 |
being set. If you have not set these variables per the installation |
|---|
| 99 |
instructions of the external programs, a reminder list is provided |
|---|
| 100 |
below: |
|---|
| 101 |
|
|---|
| 102 |
for tcsh: |
|---|
| 103 |
setenv PERL5LIB where_bioperl_is_installed |
|---|
| 104 |
setenv WUBLASTMAT where_wublast_is_installed/matrix |
|---|
| 105 |
setenv ZOE where_snap_is_installed |
|---|
| 106 |
setenv WUBLASTFILTER where_wublast_is_installed/filter |
|---|
| 107 |
setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config |
|---|
| 108 |
|
|---|
| 109 |
for bash: |
|---|
| 110 |
export PERL5LIB=where_bioperl_is_installed |
|---|
| 111 |
export WUBLASTMAT=where_wublast_is_installed/matrix |
|---|
| 112 |
export ZOE=where_snap_is_installed |
|---|
| 113 |
export WUBLASTFILTER=where_wublast_is_installed/filter |
|---|
| 114 |
export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config |
|---|
| 115 |
|
|---|
| 116 |
#--------------------------------------------------------------------- |
|---|
| 117 |
|
|---|
| 118 |
MPI MAKER INSTALL |
|---|
| 119 |
|
|---|
| 120 |
If you are running MAKER on an MPI capable cluster, you can install an |
|---|
| 121 |
MPI version of MAKER by doing the following: |
|---|
| 122 |
|
|---|
| 123 |
1. Install standard MAKER and verify that it runs. |
|---|
| 124 |
2. Install MPICH2 with the --enable-sharedlibs flag set to the |
|---|
| 125 |
appropriate value for your OS (See MPICH2 documentation) |
|---|
| 126 |
3. Use cd to change to the MPI subdirectory in the MAKER |
|---|
| 127 |
instalation folder (i.e. maker/MPI/) |
|---|
| 128 |
4. Run Install.PL by typing: perl Install.PL |
|---|
| 129 |
|
|---|
| 130 |
A new version of MAKER called mpi_maker should now be installed under |
|---|
| 131 |
maker/bin. |
|---|
| 132 |
|
|---|
| 133 |
To run mpi_maker, first verify that your mpi environment is initiated, |
|---|
| 134 |
(i.e. using the mpdboot or mpd command). Now start mpi_maker via |
|---|
| 135 |
mpiexec. |
|---|
| 136 |
|
|---|
| 137 |
Example: (This will run MAKER on 4 nodes or processors) |
|---|
| 138 |
|
|---|
| 139 |
mpiexec -n 4 mpi_maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl |
|---|
| 140 |
|
|---|
| 141 |
Please see the documentation of the MPI environment you use for |
|---|
| 142 |
instructions on how to initiate an MPI process. |
|---|
| 143 |
|
|---|
| 144 |
#--------------------------------------------------------------------- |
|---|
| 145 |
|
|---|
| 146 |
RUNNING MAKER WITH EXAMPLE DATA |
|---|
| 147 |
|
|---|
| 148 |
1) Copy the files in the data directories to a temporary directory |
|---|
| 149 |
where you will run an example file. |
|---|
| 150 |
2) Type maker -CTL to generate generic MAKER control files |
|---|
| 151 |
3) Next you will need to edit the control files to include the path of |
|---|
| 152 |
the genome file, EST file, and protein file, as well as the paths |
|---|
| 153 |
to all required executables. See CONFIG FILE EDITING for more |
|---|
| 154 |
information. |
|---|
| 155 |
4) Then try the following command from your temporary directory: |
|---|
| 156 |
|
|---|
| 157 |
maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl |
|---|
| 158 |
|
|---|
| 159 |
5) Examine the output files. See MAKER OUTPUT and APOLLO sections. |
|---|
| 160 |
|
|---|
| 161 |
#--------------------------------------------------------------------- |
|---|
| 162 |
|
|---|
| 163 |
CONTROL FILE EDITING |
|---|
| 164 |
|
|---|
| 165 |
MAKER uses control files to guide each run. Generic control files can |
|---|
| 166 |
be built using the -CTL flag in maker. These control files can then |
|---|
| 167 |
be edited by the user to identify the location of all required input |
|---|
| 168 |
data and statistics. Control files are run specific and seperate |
|---|
| 169 |
control files will need to be built for each genome given to MAKER. |
|---|
| 170 |
MAKER will look for control files in the current working directory, so |
|---|
| 171 |
it is recomended that MAKER should be ran in a seperate directory |
|---|
| 172 |
containing unique control files for each genome. |
|---|
| 173 |
|
|---|
| 174 |
Control files: |
|---|
| 175 |
|
|---|
| 176 |
1. maker_exe.ctl - contains the path information for needed |
|---|
| 177 |
executables. |
|---|
| 178 |
|
|---|
| 179 |
2. maker_bopts - contains filtering statistics for BLAST and |
|---|
| 180 |
Exonerate. |
|---|
| 181 |
|
|---|
| 182 |
3. maker_opts.ctl - contains all other information for MAKER, |
|---|
| 183 |
including the location of the input genome file. |
|---|
| 184 |
|
|---|
| 185 |
|
|---|
| 186 |
Remember to examine the control files before each run of MAKER on your |
|---|
| 187 |
specific data. |
|---|
| 188 |
|
|---|
| 189 |
Lines in the MAKER control files have the format key:value whith no |
|---|
| 190 |
spaces before or after the colon(:). If the value is a file name, you |
|---|
| 191 |
can use relative paths and environmental variables, |
|---|
| 192 |
i.e. genome:$HOME/my_genome.fasta |
|---|
| 193 |
|
|---|
| 194 |
Note that for all control files the comments written to help users |
|---|
| 195 |
begin with a pound sign(#). In addition, options before the colon(:) |
|---|
| 196 |
can not be changed, nor should there be a space before or after the |
|---|
| 197 |
colon. |
|---|
| 198 |
|
|---|
| 199 |
A. maker_exe.ctl - includes information about programs executed by |
|---|
| 200 |
MAKER. |
|---|
| 201 |
|
|---|
| 202 |
Here is an example of sections of the maker_opts.ctl file: |
|---|
| 203 |
|
|---|
| 204 |
#-----Genome |
|---|
| 205 |
genome:/fastas/genome.fasta |
|---|
| 206 |
|
|---|
| 207 |
#-----EST Evidence |
|---|
| 208 |
est:/fastas/est.fasta |
|---|
| 209 |
altest:/fastas/alt_est.fasta |
|---|
| 210 |
|
|---|
| 211 |
#-----Protein Homology Evidence |
|---|
| 212 |
protein:protein.fasta |
|---|
| 213 |
|
|---|
| 214 |
#-----MAKER Specific Options |
|---|
| 215 |
evaluate:0 |
|---|
| 216 |
max_dna_len:100000 |
|---|
| 217 |
min_contig:1 |
|---|
| 218 |
min_protein:0 |
|---|
| 219 |
split_hit:10000 |
|---|
| 220 |
pred_flank:200 |
|---|
| 221 |
single_exon:0 |
|---|
| 222 |
single_length:250 |
|---|
| 223 |
keep_preds:0 |
|---|
| 224 |
map_forward:0 |
|---|
| 225 |
retry:1 |
|---|
| 226 |
clean_try:0 |
|---|
| 227 |
clean_up:0 |
|---|
| 228 |
|
|---|
| 229 |
|
|---|
| 230 |
#--------------------------------------------------------------------- |
|---|
| 231 |
|
|---|
| 232 |
MAKER OUTPUT |
|---|
| 233 |
|
|---|
| 234 |
MAKER will create at least the following files/directories: |
|---|
| 235 |
|
|---|
| 236 |
1) XXX.maker.output/ - contains all output for a given run of MAKER. |
|---|
| 237 |
2) XXX.maker.output/XXX_datastore/ - contains subdirectories that hold |
|---|
| 238 |
the output for each individual contig of the input fasta file. See |
|---|
| 239 |
DATASTORE DIRECTORY STRUCTURE section. |
|---|
| 240 |
3) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run |
|---|
| 241 |
progress as well as an index for traversing through the output |
|---|
| 242 |
datastructure. |
|---|
| 243 |
4) XXX.maker.output/mpi_blastdb/ - Contains fasta indexes and error |
|---|
| 244 |
corrected fasta files built from the EST and protein databases |
|---|
| 245 |
provided by the user. |
|---|
| 246 |
5) maker_opt.log,maker_exe.log,maker_bopts.log - These are logs of the |
|---|
| 247 |
control files used for this run of MAKER. |
|---|
| 248 |
5) XXX.maker.output/XXX.db - Database of GFF3 files provided by the |
|---|
| 249 |
user. See GFF3 PASSTHROUGH section. |
|---|
| 250 |
|
|---|
| 251 |
Within the XXX_datastore/ subdirectories: |
|---|
| 252 |
* seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, |
|---|
| 253 |
or Apollo |
|---|
| 254 |
* seq_name.maker.transcripts.fasta - a fasta file of the MAKER |
|---|
| 255 |
annotated transcript sequences |
|---|
| 256 |
* seq_name.maker.proteins.fasta - a fasta file of the MAKER |
|---|
| 257 |
annotated protein sequences |
|---|
| 258 |
* seq_name.maker.XXX.transcript.fasta - a fasta file of ab-initio |
|---|
| 259 |
predicted transcript sequences from program XXX |
|---|
| 260 |
* seq_name.maker.XXX.proteins.fasta - a fasta file of ab-inito |
|---|
| 261 |
predicted protein sequences from program XXX |
|---|
| 262 |
* seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a |
|---|
| 263 |
fasta file of filtered ab-inito transcript sequences that don't |
|---|
| 264 |
overlap maker annotations |
|---|
| 265 |
* seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a |
|---|
| 266 |
fasta file of filtered ab-inito protein sequences that don't |
|---|
| 267 |
overlap maker annotations |
|---|
| 268 |
* theVoid.seq_name/ - a directory containing all of the raw |
|---|
| 269 |
output files produced by MAKER, including BLAST reports, SNAP |
|---|
| 270 |
output, exonnerate output and the masked genomeic sequence. |
|---|
| 271 |
|
|---|
| 272 |
WARNING: |
|---|
| 273 |
|
|---|
| 274 |
* The names of output files are based on sequence ids. If giving |
|---|
| 275 |
MAKER a multi-fasta file, it is important to verify that all |
|---|
| 276 |
sequence id are unique, so files are not overwritten. |
|---|
| 277 |
|
|---|
| 278 |
* If there are more than 1,000 sequences in a multi-fasta file a deep |
|---|
| 279 |
datastore structure will be used. See THE DATASTORE DIRECTORY |
|---|
| 280 |
STRUCTURE in this document. |
|---|
| 281 |
|
|---|
| 282 |
* If sequence ids contain characters that are illegal in file names, |
|---|
| 283 |
those characters will be replaced automatically before building |
|---|
| 284 |
output file names. |
|---|
| 285 |
|
|---|
| 286 |
|
|---|
| 287 |
#--------------------------------------------------------------------- |
|---|
| 288 |
|
|---|
| 289 |
DATASTORE DIRECTORY STRUCTURE |
|---|
| 290 |
|
|---|
| 291 |
Many filesystems have performance problems with large numbers of |
|---|
| 292 |
subdirectories and files within a single directory, and even when the |
|---|
| 293 |
underlying filesystems handle things gracefully, access via network |
|---|
| 294 |
filesystems can be an issue. You can imagine that the amount of |
|---|
| 295 |
output produced while annotating an entire genome can be quite |
|---|
| 296 |
overwhelming to the file system. To deal with all the output files |
|---|
| 297 |
MAKER uses a Datastore module to create a hiearchy of subdirectory |
|---|
| 298 |
layers, starting from a 'base', and mapping identifiers to |
|---|
| 299 |
corresponding subdirectory. |
|---|
| 300 |
|
|---|
| 301 |
A deep datastore will be used by MAKER if there are more than 1,000 |
|---|
| 302 |
sequences in a multi-fasta file. When a deep datastore is |
|---|
| 303 |
implemented, MAKER output files will not appear where you would |
|---|
| 304 |
normally expect them to be. Instead they will be located in a series |
|---|
| 305 |
of sub-directory under a new base-directory whose name is determined |
|---|
| 306 |
based on the input genome file name: |
|---|
| 307 |
|
|---|
| 308 |
EXAMPLE: current_directory/fly_datastore/EE/Af/Contig1/Contig1.gff |
|---|
| 309 |
|
|---|
| 310 |
To help you locate output files, a master_datastore_index file is |
|---|
| 311 |
created which lists the exact output directory corresponding to each |
|---|
| 312 |
contig from the input genome file. The The master_datastore_index |
|---|
| 313 |
file contains three columns of text; the first column shows the |
|---|
| 314 |
sequence identifier from each fasta header, and the second column |
|---|
| 315 |
shows the location of the output files for that sequence. The third |
|---|
| 316 |
column is for logging the status of data related to an individual |
|---|
| 317 |
contig. The values of the third column are as follows: |
|---|
| 318 |
* STARTED - Indicates that MAKER has started proccessing this |
|---|
| 319 |
contig. |
|---|
| 320 |
* FINISHED - Indicates that MAKER has finished processing this |
|---|
| 321 |
contig and all data is currently available in that subdirectory. |
|---|
| 322 |
* DIED - Indicates that MAKER failed on this contig. |
|---|
| 323 |
* DIED_SKIPPED_PERMANENT - Indicates that MAKER failed up to the |
|---|
| 324 |
specified number of retries and will not try again. |
|---|
| 325 |
* RETRY - Indicates that MAKER is retrying the contig after a |
|---|
| 326 |
failure. |
|---|
| 327 |
* SKIPPED_SMALL - Indicates that this contig was skipped because |
|---|
| 328 |
it is too short (based on control file values set by the user). |
|---|
| 329 |
|
|---|
| 330 |
|
|---|
| 331 |
#--------------------------------------------------------------------- |
|---|
| 332 |
|
|---|
| 333 |
GFF3 PASSTHROUGH |
|---|
| 334 |
|
|---|
| 335 |
If you have data from a source that MAKER does not support, and you |
|---|
| 336 |
wish to use the data in annotating a genome, then you can pass the |
|---|
| 337 |
data to MAKER as an aligned GFF3 file. This is done by supplying the |
|---|
| 338 |
files location to the appropriate value in the maker_opt.ctl file |
|---|
| 339 |
(i.e. est_gff:input\est.gff). Note that MAKER expects all data sent |
|---|
| 340 |
to it to be of the type specified, so don't put mixed data in a file |
|---|
| 341 |
(i.e. don't mix EST and other data in the file pointed to by est_gff, |
|---|
| 342 |
otherwise it all gets used as EST data). Also the genome_gff option |
|---|
| 343 |
is only for MAKER produced GFF3 files. Other GFF3 files of mixed data |
|---|
| 344 |
must be split by type and identified by the appropriate control file |
|---|
| 345 |
option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction |
|---|
| 346 |
data, est_gff for EST data, etc.). |
|---|
| 347 |
|
|---|
| 348 |
You should use the online GFF3 validator to see if your GFF3 files |
|---|
| 349 |
comply with all GFF3 specifications before running MAKER: |
|---|
| 350 |
|
|---|
| 351 |
http://dev.wormbase.org/db/validate_gff3/validate_gff3_online |
|---|
| 352 |
|
|---|
| 353 |
#--------------------------------------------------------------------- |
|---|
| 354 |
|
|---|
| 355 |
ADDING UTRs FOR GBROWSE |
|---|
| 356 |
|
|---|
| 357 |
When using APOLLO to visualize gene annotations, UTRs are inferred |
|---|
| 358 |
based on exon and CDS locations. However GMOD and GBROWSE do not |
|---|
| 359 |
infer the UTR, so to visualize the UTR, you will have to run: |
|---|
| 360 |
add_utr_gff.pl with the following command: |
|---|
| 361 |
|
|---|
| 362 |
add_utr_gff.pl <directory> |
|---|
| 363 |
|
|---|
| 364 |
<directory> is the directory containing all of your GFF3 files. |
|---|
| 365 |
|
|---|
| 366 |
Each GFF3 file will have a sister file called sequence.wutr.gff3. |
|---|
| 367 |
|
|---|
| 368 |
There is also another script called add_utr_start_stop that adds start |
|---|
| 369 |
and stop codon entries as well as UTR entries. |
|---|
| 370 |
|
|---|
| 371 |
add_utr_start_stop <GFF3 file> |
|---|
| 372 |
|
|---|
| 373 |
#--------------------------------------------------------------------- |
|---|
| 374 |
|
|---|
| 375 |
APOLLO |
|---|
| 376 |
|
|---|
| 377 |
MAKER is bundled with a configuration file that improves the color and |
|---|
| 378 |
display of MAKER annotations and evidence in the Apollo genome |
|---|
| 379 |
browser. The configuration file is called "gff3.tiers" and is located |
|---|
| 380 |
in the maker/Apollo/ directory. The file should be copied to the |
|---|
| 381 |
conf/ sub_directory which is located under the Apollo instalation |
|---|
| 382 |
directory. Using the Mac version of Apollo the conf/ directory is |
|---|
| 383 |
located at /Applications/Apollo.app/Contents/Resources/app/conf/. |
|---|
| 384 |
|
|---|
| 385 |
#--------------------------------------------------------------------- |
|---|
| 386 |
|
|---|
| 387 |
HMM BUILDING (based on snap documentation) |
|---|
| 388 |
|
|---|
| 389 |
|
|---|
| 390 |
A) First you will need to determine the genes used to model future |
|---|
| 391 |
genes, by determining a high quality gene set (annotations for the |
|---|
| 392 |
high quality gene should be in GFF3 format). The high quality gene |
|---|
| 393 |
set can then be coverted into snap ZFF format using maker2zff.pl |
|---|
| 394 |
found in maker/bin. |
|---|
| 395 |
|
|---|
| 396 |
This program is run with the following command: |
|---|
| 397 |
|
|---|
| 398 |
maker2zff.pl <directory> genome |
|---|
| 399 |
|
|---|
| 400 |
* <directory> a the directory where all of your GFF3 files are |
|---|
| 401 |
located |
|---|
| 402 |
* geneome is the name for the outfile |
|---|
| 403 |
|
|---|
| 404 |
Files Created: |
|---|
| 405 |
genome.ann |
|---|
| 406 |
genome.dna |
|---|
| 407 |
|
|---|
| 408 |
Note: A convenient way to identify and initial high quality gene |
|---|
| 409 |
set for the HMM is to use the -predictor est2genome option in |
|---|
| 410 |
MAKER. This will produce gene annotations based solely on EST |
|---|
| 411 |
evidence. These annoations can then seed the first HMM. After |
|---|
| 412 |
running MAKER again using this new HMM and the -predictor snap |
|---|
| 413 |
option, you can use the second round of annotations as the seed |
|---|
| 414 |
for an even better HMM model. In this way the HMM model |
|---|
| 415 |
progressively improves with each run of MAKER. |
|---|
| 416 |
|
|---|
| 417 |
Another strategy for identifying an initial gene set to model the |
|---|
| 418 |
HMM is to use the program CEGMA (http://korflab.ucdavis.edu/ |
|---|
| 419 |
software.html). CEGMA builds a highly reliable set of gene |
|---|
| 420 |
annotations in the absence of experimental data by identifying DNA |
|---|
| 421 |
regions with homology to a set of 458 proteins that are highly |
|---|
| 422 |
conserved among taxa. |
|---|
| 423 |
|
|---|
| 424 |
Combining both CEGMA and MAKER datasets to build the first HMM is |
|---|
| 425 |
also a good strategy. |
|---|
| 426 |
|
|---|
| 427 |
B) Next you will use the dna and zff file (genome.dna and genome.ann) |
|---|
| 428 |
to produce a SNAP HMM as described in the SNAP documention (which |
|---|
| 429 |
we have provided below): |
|---|
| 430 |
|
|---|
| 431 |
The first step is to look at some features of the genes: |
|---|
| 432 |
|
|---|
| 433 |
fathom genome.ann genome.dna -gene-stats |
|---|
| 434 |
|
|---|
| 435 |
Next, you want to verify that the genes have no obvious errors: |
|---|
| 436 |
|
|---|
| 437 |
fathom genome.ann genome.dna -validate |
|---|
| 438 |
|
|---|
| 439 |
You may find some errors and warnings. Check these out in some kind |
|---|
| 440 |
of genome browser and remove those that are real errors. Next, |
|---|
| 441 |
break up the sequences into fragments with one gene per sequence |
|---|
| 442 |
with the following command: |
|---|
| 443 |
|
|---|
| 444 |
fathom -genome.ann genome.dna -categorize 1000 |
|---|
| 445 |
|
|---|
| 446 |
There will be up to 1000 bp on either side of the genes. You will |
|---|
| 447 |
find several new files. |
|---|
| 448 |
|
|---|
| 449 |
alt.ann, alt.dna (genes with alternative splicing) |
|---|
| 450 |
err.ann, err.dna (genes that have errors) |
|---|
| 451 |
olp.ann, olp.dna (genes that overlap other genes) |
|---|
| 452 |
wrn.ann, wrn.dna (genes with warnings) |
|---|
| 453 |
uni.ann, uni.dna (single gene per sequence) |
|---|
| 454 |
|
|---|
| 455 |
Convert the uni genes to plus stranded with the command: |
|---|
| 456 |
|
|---|
| 457 |
fathom uni.ann uni.dna -export 1000 -plus |
|---|
| 458 |
|
|---|
| 459 |
You will find 4 new files: |
|---|
| 460 |
|
|---|
| 461 |
export.aa proteins corresponding to each gene |
|---|
| 462 |
export.ann gene structure on the plus strand |
|---|
| 463 |
export.dna DNA of the plus strand |
|---|
| 464 |
export.tx transcripts for each gene |
|---|
| 465 |
|
|---|
| 466 |
The parameter estimation program, forge, creates a lot of files. |
|---|
| 467 |
You probably want to create a directory to keep things tidy before |
|---|
| 468 |
you execute the program. |
|---|
| 469 |
|
|---|
| 470 |
mkdir params |
|---|
| 471 |
cd params |
|---|
| 472 |
forge ../export.ann ../export.dna |
|---|
| 473 |
cd .. |
|---|
| 474 |
|
|---|
| 475 |
Last is to build an HMM. |
|---|
| 476 |
|
|---|
| 477 |
hmm-assembler.pl my-genome params > my-genome.hmm |
|---|
| 478 |
|
|---|
| 479 |
Lastly, you will want to add the location of your hmm file to your |
|---|
| 480 |
maker_opts.ctl file. |
|---|
| 481 |
|
|---|
| 482 |
* For more information see SNAP documentation on how to build an HMM |
|---|