root/README

Revision 206, 23.8 kB (checked in by cholt, 7 months ago)

fix add_utr_gff

Line 
1 ***MAKER Documentation***
2
3 #---------------------------------------------------------------------
4
5 INSTALLATION INSTUCTIONS FOR MAKER
6
7 *Step by step instructions are also available in the INSTALL text
8 file.
9
10 MAKER is an annotation pipeline.  In other words it links together
11 many steps and programs to produce final annotations.  For this
12 reason, you must first install a number of programs that MAKER depends
13 on.
14
15
16 To install maker, you will first need to install the following
17 external programs:
18
19     *PERL 5.8.0 or higher
20     *BioPerl 1.5 or higher (www.bioperl.org)
21     *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf)
22     *RepeatMasker 3.1.6  or higher (www.repeatmasker.org)
23     *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate)
24
25 You must also install one of the following:
26
27     *Wu-BLAST 2.0 or higher (Wu-BLAST is becoming AB-BLAST which can
28      not yet be downloaded)
29         or
30     *NCBI BLAST 2.2.X or higher
31      (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml)
32  
33 You might want to also install these optional external programs:
34
35     *Augustus 2.0 or higher (augustus.gobics.de)
36     *GeneMark.hmm-E 3.9 or higher (exon.biology.gatech.edu)
37     *FgenesH (www.softberry.com/) - requires licence
38
39 To install mpi_maker, you must have an mpi package installed, try the
40 following:
41
42     *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/)
43
44 Note: Remember to install MPICH2 with the --enable-sharedlibs flag set
45 to the appropriate value (See MPICH2 Installer's Guide at
46 http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs).
47
48
49 Notes:
50
51 1) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we
52    will do some testing to see if it is compatible with MAKER.
53    Wu-BLAST is no longer available online, so if you don't already
54    have it, you will have to use NCBI BLAST instead.
55 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file
56    executable called TRF (see RepeatMasker website for details), so
57    please install these before installing RepeatMasker
58 3) Exonerate Binaries can be downloaded from the website.  If you use
59    Mac OSX, however, binaries are only available for version 1.0.
60    This verion will work too.  If you would like to compile exonerate,
61    it requires GLIB, a C-library, that has a link from the exonerate
62    website.  If you use Mac OSX, GLIB can downloaded using FINK.
63 4) RepeatMasker requires a repeat library file, which can be
64    downloaded from Repbase upon registration
65    (http://www.girinst.org/), this is explained on the RepeatMasker
66    website.
67 5) Please note the location of all of the programs that you have
68    installed, and add them to you $PATH variable in your .profile
69    file.  You will need this information in the maker.exe file, one
70    of MAKER's 3 control files.
71
72 Now that you have all the necessary programs installed, MAKER can be
73 unpacked using:
74
75  tar xvfz maker.tar.gz
76
77 This will create a directory called maker with 5 sub directories:
78
79     bin    - contains the maker executables.
80     lib    - contains all the necessary perl libaries for MAKER.
81     MPI    - contains MPI specific data to configure MAKER for a
82              cluster that supports MPI.
83     Apollo - contains gff3.tiers file (See section titled APOLLO
84              below)
85     data   - contains some sample data used to make sure everything
86              works.
87     perl   - contains perl modules that need to be compiled
88
89 Finally change to the maker/perl directory and type: 'perl Install.PL'
90 to compile required perl modules.
91
92 Now you can run MAKER!!
93
94 Maker uses control files to guide each run.  Generic control files can
95 be built using the -CTL flag in maker.  These control files can then
96 be edited by the user to identify the location of all required input
97 data and statistics.  Control files are run specific and seperate
98 control will need to be built for each genome given to maker.  Maker
99 will look for control files in the current working directory, so it is
100 recomended that maker should be ran in a seperate directory containing
101 unique control files for each genome.
102
103 Control files:
104
105     1. maker_exe.ctl - contains the path information for needed
106        executables.
107
108     2. maker_bopts - contains filtering statistics for BLAST and
109        Exonerate.
110
111     3. maker_opts.ctl - contains all other information for MAKER,
112        including the location of the input genome file.
113
114 Always remember to examine the control files before each run of MAKER
115 on your specific data.
116
117 Programs required by maker rely on certain environmental variables
118 being set.  If you have not set these variables per the installation
119 instructions of the external programs, a reminder list is provided
120 below:
121
122 for tcsh:
123 setenv PERL5LIB where_bioperl_is_installed
124 setenv WUBLASTMAT where_wublast_is_installed/matrix
125 setenv ZOE where_snap_is_installed
126 setenv WUBLASTFILTER where_wublast_is_installed/filter
127 setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config
128
129 for bash:
130 export PERL5LIB=where_bioperl_is_installed
131 export WUBLASTMAT=where_wublast_is_installed/matrix
132 export ZOE=where_snap_is_installed
133 export WUBLASTFILTER=where_wublast_is_installed/filter
134 export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config
135
136 #---------------------------------------------------------------------
137
138 MPI MAKER INSTALL
139
140 If you are running maker on an MPI capable cluster, you can install an
141 MPI version of maker by doing the following:
142
143     1. Install standard maker and verify that it runs.
144     2. Install MPICH2 with the --enable-sharedlibs flag set to the
145        appropriate value (See MPICH2 documentation)
146     3. Use cd to change to the MPI subdirectory in the maker
147        instalation folder (i.e. maker/MPI/)
148     4. Run Install.PL by typing: perl Install.PL
149
150 A new version of maker called mpi_maker should now be installed under
151 maker/bin.
152
153 To run mpi_maker, first verify that your mpi environment is initiated,
154 (i.e. using the mpdboot or mpd command). Now start mpi_maker via
155 mpiexec.
156
157 Example: (This will run MAKER on 3 nodes or processors)
158
159 mpiexec -n 3 perl maker_directory/maker/bin/mpi_maker maker_opts.ctl \
160 maker_bopts.ctl maker_exe.ctl
161
162 Please see the documentation of the MPI environment you use for
163 instructions on how to initiate an MPI process.
164
165 #---------------------------------------------------------------------
166
167 MAKER USAGE STATEMENT
168
169 Usage:
170
171 maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator>
172
173 Maker is a program that produces gene annotations in GFF3 file format
174 using evidence such as EST alignments and protein homology.  Maker can
175 be used to produce gene annotations for new genomes as well as update
176 annoations from existing genome databases.
177
178 The four input arguments are user control files that specify how maker
179 should behave. The evaluator options file contains control options
180 specific for the evaluation of gene annotations. All options for maker
181 should be set in the control files, but a few can also be set on the
182 command line.  Command line options provide a convenient machanism to
183 override commonly altered control file values.
184
185 Input files listed in the control options files must be in fasta
186 format.  Please see maker documentation to learn more about control
187 file configuration.  Maker will automatically try and locate the user
188 control files in the current working directory if these arguments are
189 not supplied when initializing maker.
190
191 It is important to note that maker does not try and recalculated data
192 that it has already calculated.  For example, if you run an analysis
193 twice on the same dataset file you will notice that maker does not
194 rerun any of the blast analyses, but instead uses the blast analyses
195 stored from the previous run.  To force maker to rerun all analyses,
196 use the -f flag.
197
198
199 Options:
200
201   -genome|g <filename> Specify the genome file.
202
203   -predictor|p <type>  Selects the predictor(s) to use when
204                        building annotations.  Use a ',' to
205                        seperate types (no spaces).
206                        i.e. -predictor=snap,augustus,fgenesh
207
208                        types: snap
209                               augustus
210                               fgenesh
211                               genemark
212                               est2genome (Uses EST's directly)
213                               abinit (ab-initio predictions)
214                               model_gff (Passes through GFF3
215                                          annotations)
216
217   -RM_off|R           Turns all repeat masking off.
218
219   -retry   <integer>  Rerun failed contigs up to the specified count.
220
221   -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
222
223   -force|f            Forces maker to delete old files before running
224                       again.  This will require all blast analyses to
225                       be rerun.
226
227   -evaluate|e         Run Evaluator on final annotations (under
228                       development).
229
230   -quiet|q            Silences most of maker's status messages.
231
232   -CTL                Generate empty control files in the current
233                       directory.
234
235   -help|?             Prints this usage statement.
236
237
238 #---------------------------------------------------------------------
239
240 RUNNING MAKER WITH EXAMPLE DATA
241
242 1) Copy the files in the data directories to a temporary directory
243    where you will run an example file.
244 2) Type maker -CTL to generate generic maker control files
245 3) Next you will need to edit the control files to include the path of
246    the genome file, EST file, and protein file, as well as the paths
247    to all required executables.  See CONFIG FILE EDITING for more
248    information.
249 4) Then try the following command from your temporary directory:
250
251 perl maker_directory/bin/maker maker_exe.ctl maker_opts.ctl \
252 maker_bopts.ctl
253
254 MAKER will create at least the following files/directories:
255
256 1) XXX.maker.output/ - contains all output for a given run of make
257 2) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run
258    progress as well as an index for traversing
259    XXX.maker.output/XXX_datastore/
260 3) XXX.maker.output/XXX_datastore/ - contains folders containing the
261    output for each individual contig of the input fasta file
262
263 Within these folders:
264     * seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE,
265       or Apollo
266     * seq_name.maker.transcripts.fasta - a file of the maker
267       transcript sequences
268     * seq_name.maker.proteins.fasta - a file of the maker protein
269       sequences
270     * seq_name.maker.XXX.transcript.fasta - a file of ab-inito
271       transcript sequences from program XXX
272     * seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein
273       sequences from program XXX
274     * seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a
275       file of filtered ab-inito transcript sequences that don't
276       overlap annotations
277     * seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a
278       file of filtered ab-inito protein sequences that don't overlap
279       annotations
280     * theVoid.seq_name/ - a directory containing all of the raw
281       output files produced by maker, including BLAST reports, SNAP
282       output, exonnerate output and the masked sequence.
283
284 WARNING:
285
286 * The names of output files are based on sequence ids.  If giving
287   maker a multi-fasta file, it is important to verify that all
288   sequence id are unique, so files are not overwritten.
289
290 * If there are more than 1,000 sequences in a multi-fasta file a deep
291   datastore structure will be used. see DATASTORE in this document.
292
293 * If sequence ids contain characters that are illegal in file names,
294   those characters will be replaced automatically before building
295   output file names.
296
297 #---------------------------------------------------------------------
298
299 DATASTORE
300
301 "Many filesystems have performance problems with large numbers of
302 subdirectories and files within a single directory and even when the
303 underlying filesystems handle things gracefully, access via network
304 filesystems can be an issue.  The Datastore modules create a hiearchy
305 of subdirectory layers, starting from a 'base', and mapping end-user's
306 identifiers to the corresponding subdirectory." - quote from
307 http://www.yandell-lab.org/ (See site for more information on the
308 Datastore module)
309
310 A deep datastore will be used by maker if there are more than 1,000
311 sequences in a multi-fasta file.
312
313 When a datastore is implemented, the output files described above will
314 not appear where you would normally expect them to be.  Instead they
315 will be located in a series of sub-directory under a new
316 base-directory whose name is determined from the input genome file
317 name:
318
319 i.e. current_directory/genome_datastore/EE/Af/Contig1/Contig1.gff.
320
321 A master_datastore_index file will be made in the current working
322 directory to help you find the output files from each sequence.
323
324 The master_datastore_index file is a file created to allow the user to
325 easily find the exact output directory corresponding to contigs from
326 the input genome file.  The The master_datastore_index file contains
327 three columns of text; the first column shows the sequence identifier
328 from each fasta header, and the second column shows the location of
329 the output files for that sequence. The third column is for logging
330 the status of data related to an individual contig. The values of the
331 third column are as follows:
332     * STARTED - Indicates that maker has started proccessing this
333       contig.
334     * FINISHED - Indicates that maker has finished processing this
335       contig and all data is currently available in that subdirectory.
336     * DIED - Indicates that maker failed.
337     * DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the
338       specified number of retries and will not try again.
339     * RETRY - Indicates that maker is retrying the contig after a
340       failure.
341     * SKIPPED_SMALL - Indicates that this contig was skipped because
342       it is too short (based on control file values set by the user).
343
344 #---------------------------------------------------------------------
345
346 CONFIG FILE EDITING
347
348 Lines in the maker control files have the format key:value whith no
349 spaces before or after the colon(:).  If the value is a file name, you
350 can use relative paths and environmental variables,
351 i.e. genome:$HOME/my_genome.fasta
352
353
354 MAKER has 3 control files for configuration options. A fourth file
355 evaluator.ctl is used to supply a MAKER related program EVALUATOR with
356 options specific to that program (only important if 'evaluate' is set
357 to 1 in maker_opts.ctl).
358
359 Note that for all control files the comments written to help users
360 begin with a pound sign(#).  In addition, options before the colon(:)
361 can not be changed, nor should there be a space before or after the
362 colon.
363
364 A. maker_exe.ctl - includes information about programs executed by
365 MAKER.
366 Here an example of a section of the maker_exe.ctl file:
367
368 #-----Location of Executables Used by Maker/Evaluator
369
370 #location of NCBI formatdb executable   
371 formatdb:/usr/local/bin/formatdb                             
372 #location of NCBI blastall executable   
373 blastall:/usr/local/bin/blastall                             
374 #location of WUBLAST xdformat executable       
375 xdformat:/usr/local/bin/xdformat                             
376 #location of WUBLAST blastn executable 
377 blastn:/usr/local/bin/blastn                                 
378 #location of WUBLAST blastx executable 
379 blastx:/usr/local/bin/blastx                                 
380 #location of WUBLAST tblastx executable
381 tblastx:/usr/local/bin/tblastx                               
382 #location of RepeatMasker executable   
383 RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker 
384 #location of exonerate executable         
385 exonerate:/home/cholt/usr/local/exonerate/bin/exonerate       
386
387 #-----Ab-initio Gene Prediction Algorithms
388
389 #location of snap executable             
390 snap:/home/cholt/usr/local/snap/snap                 
391 #location of eukaryotic genemark executable
392 gmhmme3:/home/cholt/usr/local/gmes/gmhmme3           
393 #location of augustus executable                 
394 augustus:/home/cholt/usr/local/augustus/bin/augustus 
395 #location of fgenesh executable             
396 fgenesh:/home/cholt/usr/local/fgenesh/fgenesh         
397
398 B. maker_bopts.ctl - contains statistics for fltering blast and
399 exonerate data.
400 Here an example of a section of the maker_bopts.ctl file:
401
402 #-----BLAST and Exonerate statistics thresholds
403 #set to 'wublast' or 'ncbi'
404 blast_type:wublast   
405 #Blastn Percent Coverage Threhold EST-Genome Alignments
406 pcov_blastn:0.8       
407 #Blastn Percent Identity Threshold EST-Genome Aligments
408 pid_blastn:0.85       
409 #Blastn eval cutoff                                   
410 eval_blastn:1e-10     
411 #Blastn bit cutoff                                     
412 bit_blastn:40         
413
414 #Blastx Percent Coverage Threhold Protein-Genome Alignments
415 pcov_blastx:0.5       
416 #Blastx Percent Identity Threshold Protein-Genome Aligments
417 pid_blastx:0.4       
418 #Blastx eval cutoff                                     
419 eval_blastx:1e-06     
420 #Blastx bit cutoff                                       
421 bit_blastx:30         
422
423 #Blastx Percent Coverage Threhold For Transposable Element Masking
424 pcov_rm_blastx:0.5   
425 #Blastx Percent Identity Threshold For Transposbale Element Masking
426 pid_rm_blastx:0.4     
427 #Blastx eval cutoff for transposable element masking             
428 eval_rm_blastx:1e-06 
429 #Blastx bit cutoff for transposable element masking             
430 bit_rm_blastx:30     
431
432 C. maker_opts.ctl - contains options for maker and external programs
433 used by maker.
434 Here an example of a section of the maker_opts.ctl file:
435
436 #-----Genome (Required for De-Novo Annotations)
437 #genome sequence file in fasta format
438 genome:input/genome.fasta
439
440 #-----Re-annotation Options
441
442 #re-annotate genome based on this gff3 file                 
443 genome_gff:     
444 #use ests in genome_gff: 1 = yes, 0 = no                   
445 est_pass:0     
446 #use alternate organism ests in genome_gff: 1 = yes, 0 = no
447 altest_pass:0   
448 #use proteins in genome_gff: 1 = yes, 0 = no               
449 protein_pass:0 
450 #use repeats in genome_gff: 1 = yes, 0 = no                 
451 rm_pass:0       
452 #use gene models in genome_gff: 1 = yes, 0 = no             
453 model_pass:0   
454 #use ab-initio predictions in genome_gff: 1 = yes, 0 = no   
455 pred_pass:0     
456 #passthrough everything else in genome_gff: 1 = yes, 0 = no
457 other_pass:0   
458
459 #-----EST Evidence (you must provide a value for at least one)
460
461 #non-redundant set of assembled ESTs in fasta format (classic EST
462 #analysis)
463 est:input/est.fasta       
464 #un-assembled EST reads in fasta format (for deep nextgen mRNASeq)
465 est_reads:                 
466 #EST/cDNA sequence file in fasta format from an alternate organism
467 altest:input/altest.fasta 
468 #EST evidence from a seperate gff3 file                               
469 est_gff:                   
470 #Alternate organism EST evidence from a seperate gff3 file
471 altest_gff:               
472
473 #-----Protein Homology Evidence (you must provide a value for at least
474 #     one)
475 #protein sequence file in fasta format
476 protein:input/protein.fasta
477 #protein homology evidence from a gff3 file
478 protein_gff:
479
480 #---------------------------------------------------------------------
481
482 GFF3 Passthrough
483
484 If you have data from a source that MAKER does not support, and you
485 wish to use the data in annotating a genome, then you can pass the
486 data to MAKER as an aligned GFF3 file.  This is done by supplying the
487 files location to the appropriate value in the maker_opt.ctl file
488 (i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent
489 to it to be of the type specified, so don't put mixed data in a file
490 (i.e. don't mix EST and other data in the file pointed to by est_gff,
491 otherwise it all gets used as EST data).  Also the genome_gff option
492 is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data
493 must be split by type and identified by the appropriate control file
494 option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction
495 data, est_gff for EST data, etc.).
496
497 #---------------------------------------------------------------------
498
499 ADDING UTRs for GBROWSE
500
501 When using APOLLO to visualize gene annotations, UTRs are inferred
502 based on exon and CDS locations.  However GMOD and GBROWSE do not
503 infer the UTR, so to visualize the UTR, you will have to run:
504 add_utr_gff.pl with the following command:
505
506 add_utr_gff.pl <directory>
507
508     * <directory> is the directory where all of your GFF files are
509                   located.
510
511 Each GFF file will have a sister file called sequence.wutr.gff3.
512
513 #---------------------------------------------------------------------
514
515 APOLLO
516
517 Maker is bundled with a configuration file that improves the color and
518 display of maker annotations and evidence in the Apollo genome
519 browser.  The configuration file is called "gff3.tiers" and is located
520 in the maker/Apollo/ directory.  The file should be copied to the
521 conf/ sub_directory which is located under the Apollo instalation
522 directory.  Using the Mac version of Apollo the conf/ directory is
523 located at /Applications/Apollo.app/Contents/Resources/app/conf/.
524
525 #---------------------------------------------------------------------
526
527 HMM BUILDING (based on snap documentation)
528
529
530 A) First you will need to determine the genes used to model future
531    genes, by determining a high quality gene set (annotations for the
532    high quality gene should be in GFF3 format).  The high quality gene
533    set can then be coverted into snap ZFF format using maker2zff.pl
534    found in maker/bin.
535
536    This program is run with the following command:
537
538        maker2zff.pl <directory> genome
539
540        * <directory> a the directory where all of your GFF3 files are
541          located
542        * geneome is the name for the outfile
543
544    Files Created:
545        genome.ann
546        genome.dna
547
548    Note: A convenient way to identify and initial high quality gene
549    set for the HMM is to use the -predictor est2genome option in
550    maker.  This will produce gene annotations based solely on EST
551    evidence.  These annoations can then seed the first HMM.  After
552    running maker again using this new HMM and the -predictor snap
553    option, you can use the second round of annotations as the seed
554    for an even better HMM model. In this way the HMM model
555    progressively improves with each run of maker.
556
557    Another strategy for identifying an initial gene set to model the
558    HMM is to use the program CEGMA (http://korflab.ucdavis.edu/
559    software.html).  CEGMA builds a highly reliable set of gene
560    annotations in the absence of experimental data by identifying DNA
561    regions with homology to a set of 458 proteins that are highly
562    conserved among taxa.
563
564    Combining both CEGMA and maker datasets to build the first HMM is
565    also a good strategy.
566
567 B) Next you will use the dna and zff file (genome.dna and genome.ann)
568    to produce a SNAP HMM as described in the SNAP documention (which
569    we have provided below):
570
571    The first step is to look at some features of the genes:
572
573        fathom genome.ann genome.dna -gene-stats
574
575    Next, you want to verify that the genes have no obvious errors:
576
577        fathom genome.ann genome.dna -validate
578
579    You may find some errors and warnings. Check these out in some kind
580    of genome browser and remove those that are real errors. Next,
581    break up the sequences into fragments with one gene per sequence
582    with the following command:
583
584        fathom -genome.ann genome.dna -categorize 1000
585
586    There will be up to 1000 bp on either side of the genes. You will
587    find several new files.
588
589        alt.ann, alt.dna (genes with alternative splicing)
590        err.ann, err.dna (genes that have errors)
591        olp.ann, olp.dna (genes that overlap other genes)
592        wrn.ann, wrn.dna (genes with warnings)
593        uni.ann, uni.dna (single gene per sequence)
594
595    Convert the uni genes to plus stranded with the command:
596
597        fathom uni.ann uni.dna -export 1000 -plus
598
599    You will find 4 new files:
600
601        export.aa   proteins corresponding to each gene
602        export.ann  gene structure on the plus strand
603        export.dna  DNA of the plus strand
604        export.tx   transcripts for each gene
605
606    The parameter estimation program, forge, creates a lot of files.
607    You probably want to create a directory to keep things tidy before
608    you execute the program.
609
610        mkdir params
611        cd params
612        forge ../export.ann ../export.dna
613        cd ..
614
615    Last is to build an HMM.
616
617        hmm-assembler.pl my-genome params > my-genome.hmm
618
619    Lastly, you will want to add the location of your hmm file to your
620    maker_opts.ctl file.
621
622 * For more information see SNAP documentation on how to build an HMM
Note: See TracBrowser for help on using the browser.