root/README

Revision 180, 23.1 kB (checked in by cholt, 8 months ago)

update documentation

Line 
1 ***MAKER Documentation***
2
3 #----------------------------------------------------
4 INSTALLATION INSTUCTIONS FOR MAKER
5
6 *Step by step instructions are also available in the INSTALL text file.
7
8 MAKER is an annotation pipeline.  In other words it links together many steps and programs to produce final annotations.  For this reason, you must first install a number of programs that MAKER depends on.
9
10
11 To install maker, you will first need to install the following external programs:
12
13      *PERL 5.8.0 or higher
14      *BioPerl 1.5 or higher (www.bioperl.org)
15      *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf)
16      *RepeatMasker 3.1.6  or higher (www.repeatmasker.org)
17      *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate)
18
19 You must also install one of the following:
20
21      *Wu-BLAST 2.0  or higher (Wu-BLAST is becoming AB-BLAST which can not yet be downloaded)
22         or
23      *NCBI BLAST 2.2.X or higher (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml)
24  
25 You might want to also install these optional external programs:
26
27      *Augustus 2.0  or higher (augustus.gobics.de)
28      *GeneMark.hmm-E 3.9 or higher (exon.biology.gatech.edu)
29      *FgenesH (www.softberry.com/) - requires licence
30
31 To install mpi_maker, you must have an mpi package installed, try the following:
32
33      *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/)
34
35 note:  Remember to install MPICH2 with the --enable-sharedlibs flag set to the appropriate value (See MPICH2 Installer's Guide at http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs).
36
37
38 Notes:
39 1) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we will do some testing to see if it is compatible with MAKER.  Wu-BLAST is no longer available online, so if you don't already have it, you will have to use NCBI BLAST instead.
40 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file executable called TRF (see RepeatMasker website for details), so please install these before installing RepeatMasker
41 3) Exonerate Binaries can be downloaded from the website.  If you use Mac OSX, however, binaries are only available for version 1.0.  This verion will work too.  If you would like to compile exonerate, it requires GLIB, a C-library, that has a link from the exonerate website.  If you use Mac OSX, GLIB can downloaded using FINK.
42 4) RepeatMasker requires a repeat library file, which can be downloaded from Repbase upon registration (http://www.girinst.org/), this is explained on the RepeatMasker website.
43 5) Please note the location of all of the programs that you have installed, and add them to you $PATH variable in your .profile file.  You will need this information in the maker.exe file, one of MAKER's 3 control files.
44
45
46 Now that you have all the necessary programs installed, MAKER can be unpacked using:
47
48 tar xvfz maker.tar.gz
49
50 This will create a directory called maker with 5 sub directories:
51
52         bin - contains the maker executables.
53         lib - contains all the necessary perl libaries for MAKER.
54         MPI - contains MPI specific data to configure MAKER for a cluster that supports MPI.
55         Apollo - contains gff3.tiers file (See section titled APOLLO below)
56         data - contains some sample data used to make sure everything works.
57         perl - contains perl modules that need to be compiled
58
59 Finally change to the maker/perl directory and type: 'perl Install.PL' to compile required perl modules.
60
61 Now you can run MAKER!!
62
63 Maker uses control files to guide each run.  Generic control files can be built using the -CTL flag in maker.  These control files can then be edited by the user to identify the location of all required input data and statistics.  Control files are run specific and seperate control will need to be built for each genome given to maker.  Maker will look for control files in the current working directory, so it is recomended that maker should be ran in a seperate directory containing unique control files for each genome.
64
65 Control files:
66
67          1. maker_exe.ctl - contains the path information for needed executables
68          2. maker_bopts - contains filtering statistics for BLAST and Exonerate
69          3. maker_opts.ctl - contains all other information for MAKER, including the location of the input genome file.
70
71
72 Always remember to be examine the control files before each run of MAKER on your specific data
73
74
75 Programs required by maker rely on certain environmental variables being set.  If you have not set these variables per the installation instructions of the external programs, a reminder list is provided below:
76
77 for tcsh:
78 setenv PERL5LIB where_bioperl_is_installed
79 setenv WUBLASTMAT where_wublast_is_installed/matrix
80 setenv ZOE where_snap_is_installed
81 setenv WUBLASTFILTER where_wublast_is_installed/filter
82 setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config
83
84 for bash:
85 export PERL5LIB=where_bioperl_is_installed
86 export WUBLASTMAT=where_wublast_is_installed/matrix
87 export ZOE=where_snap_is_installed
88 export WUBLASTFILTER=where_wublast_is_installed/filter
89 export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config
90
91
92 #----------------------------------------------------
93 MPI MAKER INSTALL
94
95 If you are running maker on an MPI capable cluster, you can install an MPI version of maker by doing the following:
96
97         1. Install standard maker and verify that it runs.
98         2. Install MPICH2 with the --enable-sharedlibs flag set to the appropriate value (See MPICH2 documentation)
99         3. Use cd to change to the MPI subdirectory in the maker instalation folder (i.e. maker/MPI/)
100         4. Run Install.PL by typing:     perl Install.PL
101
102 A new version of maker called mpi_maker should now be installed under maker/bin.
103
104 To run mpi_maker, first verify that your mpi environment is initiated, (i.e. using the mpdboot or mpd command). Now start mpi_maker via mpiexec.
105
106 Example: (This will run MAKER on 3 nodes or processors)
107
108         mpiexec -n 3 perl maker_directory/maker/bin/mpi_maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl
109
110
111
112 Please see the documentation of the MPI environment you use for instructions on how to initiate an MPI process.
113
114
115 #----------------------------------------------------
116 MAKER USAGE STATEMENT
117
118 Usage:
119
120      maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator>
121
122      Maker is a program that produces gene annotations in GFF3 file format using
123      evidence such as EST alignments and protein homology.  Maker can be used to
124      produce gene annotations for new genomes as well as update annoations from
125      existing genome databases.
126
127      The four input arguments are user control files that specify how maker
128      should behave. The evaluator options file contains control options specific
129      for the evaluation of gene annotations. All options for maker should be set
130      in the control files, but a few can also be set on the command line.
131      Command line options provide a convenient machanism to override commonly
132      altered control file values.
133
134      Input files listed in the control options files must be in fasta format.
135      Please see maker documentation to learn more about control file
136      configuration.  Maker will automatically try and locate the user control
137      files in the current working directory if these arguments are not supplied
138      when initializing maker.
139
140      It is important to note that maker does not try and recalculated data that
141      it has already calculated.  For example, if you run an analysis twice on
142      the same dataset file you will notice that maker does not rerun any of the
143      blast analyses, but instead uses the blast analyses stored from the
144      previous run.  To force maker to rerun all analyses, use the -f flag.
145
146
147 Options:
148
149      -genome|g <filename> Specify the genome file.
150
151      -predictor|p <type>  Selects the predictor(s) to use when building
152                           annotations.  Use a ',' to seperate types (no spaces).
153                           i.e. -predictor=snap,augustus,fgenesh
154
155                           types: snap
156                                  augustus
157                                  fgenesh
158                                  genemark
159                                  est2genome (Uses EST's directly)
160                                  abinit (ab-initio predictions)
161                                  model_gff (Passes through GFF3 annotations)
162
163      -RM_off|R           Turns all repeat masking off.
164
165      -retry   <integer>  Rerun failed contigs up to the specified count.
166
167      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis.
168
169      -force|f            Forces maker to delete old files before running again.
170                          This will require all blast analyses to be rerun.
171
172      -evaluate|e         Run Evaluator on final annotations (under development).
173
174      -quiet|q            Silences most of maker's status messages.
175
176      -CTL                Generate empty control files in the current directory.
177
178      -help|?             Prints this usage statement.
179
180
181 #----------------------------------------------------
182 RUNNING MAKER WITH EXAMPLE DATA
183
184 1) Copy the files in the data directories to a temporary directory where you will run an example file.
185 2) Type maker -CTL to generate generic maker control files
186 3) Next you will need to edit the control files to include the path of the genome file, EST file, and protein file, as well as the paths to all required executables.  See CONFIG FILE EDITING for more information.
187 4) Then try the following command from your temporary directory:
188
189 perl maker_directory/bin/maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
190
191 MAKER will create at least the following files/directories:
192
193 XXX.maker.output/ - contains all output for a given run of make
194 XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run progress as well as an index for traversing XXX.maker.output/XXX_datastore/
195 XXX.maker.output/XXX_datastore/ - contains folders containing the output for each individual contig of the input fasta file
196 *Within these folders
197         seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, or Apollo
198         seq_name.maker.transcripts.fasta - a file of the maker transcript sequences
199         seq_name.maker.proteins.fasta - a file of the maker protein sequences
200         seq_name.maker.XXX.transcript.fasta - a file of ab-inito transcript sequences from program XXX
201         seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein sequences from program XXX
202         seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a file of filtered ab-inito transcript sequences that don't overlap annotations
203         seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a file of filtered ab-inito protein sequences that don't overlap annotations
204         theVoid.seq_name/ - a directory containing all of the raw output files produced by maker, including BLAST reports, SNAP output, exonnerate output and the masked sequence
205
206 WARNING:
207 *The names of output files are based on sequence ids.  If giving maker a multi-fasta file, it is important to verify that all sequence id are unique, so files are not overwritten.
208 *If there are more than 1,000 sequences in a multi-fasta file a deep datastore structure will be used. see DATASTORE in this document.
209 *If sequence ids contain characters that are illegal in file names, those characters will be replaced automatically before building output file names.
210
211 #----------------------------------------------------
212 DATASTORE
213
214 "Many filesystems have performance problems with large numbers of subdirectories and files within a single directory and even when the underlying filesystems handle things gracefully, access via network filesystems can be an issue.  The Datastore modules create a hiearchy of subdirectory layers, starting from a 'base', and mapping end-user's identifiers to the corresponding subdirectory." - quote from http://www.yandell-lab.org/  (See site for more information on the Datastore module)
215
216 A deep datastore will be used by maker if there are more than 1,000 sequences in a multi-fasta file.
217
218 When a datastore is implemented, the output files described above will not appear where you would normally expect them to be.  Instead they will be located in a series of sub-directory under a new base-directory whose name is determined from the input genome file name, i.e. current_working_directory/genome_datastore/EE/Af/Contig1/Contig1.gff.  A master_datastore_index file will be made in the current working directory to help you find the output files from each sequence.
219
220 The master_datastore_index file is a file created to allow the user to easily find the exact output directory corresponding to contigs from the input genome file.  The The master_datastore_index file contains three columns of text; the first column shows the sequence identifier from each fasta header, and the second column shows the location of the output files for that sequence. The third column is for logging the status of data related to an individual contig. The values of the third column are as follows:
221         STARTED - Indicates that maker has started proccessing this contig.
222         FINISHED - Indicates that maker has finished processing this contig and all data is currently available in that subdirectory.
223         DIED - Indicates that maker failed.
224         DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the specified number of retries and will not try again.
225         RETRY - Indicates that maker is retrying the contig after a failure.
226         SKIPPED_SMALL - Indicates that this contig was skipped because it is too short (based on control file values set by the user)
227
228
229 #----------------------------------------------------
230 CONFIG FILE EDITING
231
232 Lines in the maker control files have the format key:value whith no spaces before or after the colon(:).  If the value is a file name, you can use relative paths and environmental variables, i.e. genome:$HOME/my_genome.fasta
233
234
235 MAKER has 3 control files for configuration options. A fourth file evaluator.ctl is used to supply a MAKER related program EVALUATOR with options specific to that program (only important if 'evaluate' is set to 1 in maker_opts.ctl).
236
237 Note that for all control files the comments written to help users begin with a pound sign(#).  In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon.
238
239 A. maker_exe.ctl - includes information about programs executed by MAKER.
240 Here an example of a section of the maker_exe.ctl file:
241 ====================================
242 #-----Location of Executables Used by Maker/Evaluator
243 formatdb:/usr/local/bin/formatdb                              #location of NCBI formatdb executable
244 blastall:/usr/local/bin/blastall                              #location of NCBI blastall executable
245 xdformat:/usr/local/bin/xdformat                              #location of WUBLAST xdformat executable
246 blastn:/usr/local/bin/blastn                                  #location of WUBLAST blastn executable
247 blastx:/usr/local/bin/blastx                                  #location of WUBLAST blastx executable
248 tblastx:/usr/local/bin/tblastx                                #location of WUBLAST tblastx executable
249 RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker  #location of RepeatMasker executable
250 exonerate:/home/cholt/usr/local/exonerate/bin/exonerate       #location of exonerate executable
251
252 #-----Ab-initio Gene Prediction Algorithms
253 snap:/home/cholt/usr/local/snap/snap                  #location of snap executable
254 gmhmme3:/home/cholt/usr/local/gmes/gmhmme3            #location of eukaryotic genemark executable
255 augustus:/home/cholt/usr/local/augustus/bin/augustus  #location of augustus executable
256 fgenesh:/home/cholt/usr/local/fgenesh/fgenesh         #location of fgenesh executable
257
258 ====================================
259
260
261 B. maker_bopts.ctl - contains statistics for fltering blast and exonerate data
262 Here an example of a section of the maker_bopts.ctl file:
263 ====================================
264
265 #-----BLAST and Exonerate statistics thresholds
266 blast_type:wublast    #set to 'wublast' or 'ncbi'
267
268 pcov_blastn:0.8       #Blastn Percent Coverage Threhold EST-Genome Alignments
269 pid_blastn:0.85       #Blastn Percent Identity Threshold EST-Genome Aligments
270 eval_blastn:1e-10     #Blastn eval cutoff
271 bit_blastn:40         #Blastn bit cutoff
272
273 pcov_blastx:0.5       #Blastx Percent Coverage Threhold Protein-Genome Alignments
274 pid_blastx:0.4        #Blastx Percent Identity Threshold Protein-Genome Aligments
275 eval_blastx:1e-06     #Blastx eval cutoff
276 bit_blastx:30         #Blastx bit cutoff
277
278 pcov_rm_blastx:0.5    #Blastx Percent Coverage Threhold For Transposable Element Masking
279 pid_rm_blastx:0.4     #Blastx Percent Identity Threshold For Transposbale Element Masking
280 eval_rm_blastx:1e-06  #Blastx eval cutoff for transposable element masking
281 bit_rm_blastx:30      #Blastx bit cutoff for transposable element masking
282 ====================================
283
284
285 C. maker_opts.ctl - contains options for maker and external programs used by maker
286 Here an example of a section of the maker_opts.ctl file:
287 ====================================
288 #-----Genome (Required for De-Novo Annotations)
289 genome:input/genome.fasta  #genome sequence file in fasta format
290
291 #-----Re-annotation Options
292 genome_gff:     #re-annotate genome based on this gff3 file
293 est_pass:0      #use ests in genome_gff: 1 = yes, 0 = no
294 altest_pass:0   #use alternate organism ests in genome_gff: 1 = yes, 0 = no
295 protein_pass:0  #use proteins in genome_gff: 1 = yes, 0 = no
296 rm_pass:0       #use repeats in genome_gff: 1 = yes, 0 = no
297 model_pass:0    #use gene models in genome_gff: 1 = yes, 0 = no
298 pred_pass:0     #use ab-initio predictions in genome_gff: 1 = yes, 0 = no
299 other_pass:0    #passthrough everything else in genome_gff: 1 = yes, 0 = no
300
301 #-----EST Evidence (you must provide a value for at least one)
302 est:input/est.fasta        #non-redundant set of assembled ESTs in fasta format (classic EST analysis)
303 est_reads:                 #un-assembled EST reads in fasta format (for deep nextgen mRNASeq)
304 altest:input/altest.fasta  #EST/cDNA sequence file in fasta format from an alternate organism
305 est_gff:                   #EST evidence from a seperate gff3 file
306 altest_gff:                #Alternate organism EST evidence from a seperate gff3 file
307
308 #-----Protein Homology Evidence (you must provide a value for at least one)
309 protein:input/protein.fasta  #protein sequence file in fasta format
310 protein_gff:                 #protein homology evidence from a gff3 file
311 ====================================
312
313 #----------------------------------------------------
314 GFF3 Passthrough
315
316 If you have data from a source that MAKER does not support, and you wish to use the data in annotating a genome, then you can pass the data to MAKER as an aligned GFF3 file.  This is done by supplying the files location to the appropriate value in the maker_opt.ctl file (i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent to it to be of the type specified, so don't put mixed data in a file (i.e. don't mix EST and other data in the file pointed to by est_gff, otherwise it all gets used as EST data).  Also the genome_gff option is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data must be split by type and identified by the appropriate control file option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction data, est_gff for EST data, etc.).
317
318 #----------------------------------------------------
319 ADDING UTRs for GBROWSE
320
321 * When using APOLLO to visualize gene annotations, UTRs are inferred based on exon and CDS locations.  However GMOD and GBROWSE do not infer the UTR, so to visualize the UTR, you will have to run: add_utr_gff.pl with the following command:
322
323 maker2zff.pl <directory>
324 <directory> is the directory where all of your GFF files are located
325
326 each GFF file will have a sister file called sequence.wutr.gff3
327
328
329 #----------------------------------------------------
330 APOLLO
331
332 Maker is bundled with a configuration file that improves the color and display of maker annotations and evidence in the Apollo genome browser.  The configuration file is called "gff3.tiers" and is located in the maker/Apollo/ directory.  The file should be copied to the conf/ sub_directory which is located under the Apollo instalation directory.  Using the Mac version of Apollo the conf/ directory is located at /Applications/Apollo.app/Contents/Resources/app/conf/.
333
334
335 #----------------------------------------------------
336 HMM BUILDING (based on snap documentation)
337
338 A.  First you will need to determine the genes used to model future genes, by determining a high quality gene set (annotations for the high quality gene should be in GFF3 format).  The high quality gene set can then be coverted into snap ZFF format using maker2zff.pl found in maker/bin.
339
340 This program is run with the following command:
341
342       maker2zff.pl <directory> genome
343
344 *<directory> is the directory where all of your GFF3 files are located
345 *geneome is the name for the outfile
346
347 Files Created:
348
349       genome.ann
350       genome.dna
351
352 Note:  A convenient way to identify and initial high quality gene set for the HMM is to use the -predictor est2genome option in maker.  This will produce gene annotations based solely on EST evidence.  These annoations can then seed the first HMM.  After running maker again using this new HMM and the -predictor snap option, you can use the second round of annotations as the seed for an even better HMM model.  In this way the HMM model progressively improves with each run of maker.
353
354 Another strategy for identifying an initial gene set to model the HMM is to use the program CEGMA (http://korflab.ucdavis.edu/software.html).  CEGMA builds a highly reliable set of gene annotations in the absence of experimental data by identifying DNA regions with homology to a set of 458 proteins that are highly conserved among taxa.
355
356 Combining both CEGMA and maker datasets to build the first HMM is also a good strategy.
357
358
359 B.  Next you will use the dna and zff file (genome.dna and genome.ann) to produce a SNAP HMM as described in the SNAP documation (which we have provided below):
360
361 The first step is to look at some features of the genes:
362
363     fathom genome.ann genome.dna -gene-stats
364
365 Next, you want to verify that the genes have no obvious errors:
366
367     fathom genome.ann genome.dna -validate
368
369 You may find some errors and warnings. Check these out in some kind of genome
370 browser and remove those that are real errors. Next, break up the sequences into
371 fragments with one gene per sequence with the following command:
372
373     fathom -genome.ann genome.dna -categorize 1000
374
375 There will be up to 1000 bp on either side of the genes. You will find
376 several new files.
377
378     alt.ann, alt.dna (genes with alternative splicing)
379     err.ann, err.dna (genes that have errors)
380     olp.ann, olp.dna (genes that overlap other genes)
381     wrn.ann, wrn.dna (genes with warnings)
382     uni.ann, uni.dna (single gene per sequence)
383
384 Convert the uni genes to plus stranded with the command:
385
386     fathom uni.ann uni.dna -export 1000 -plus
387
388 You will find 4 new files:
389
390     export.aa   proteins corresponding to each gene
391     export.ann  gene structure on the plus strand
392     export.dna  DNA of the plus strand
393     export.tx   transcripts for each gene
394
395 The parameter estimation program, forge, creates a lot of files. You probably
396 want to create a directory to keep things tidy before you execute the program.
397
398     mkdir params
399     cd params
400     forge ../export.ann ../export.dna
401     cd ..
402
403 Last is to build an HMM.
404
405     hmm-assembler.pl my-genome params > my-genome.hmm
406
407
408 Lastly, you will want to add the location of your hmm file to your maker_opts.ctl file.
409
410 *For more information see SNAP documentation on how to build an HMM
Note: See TracBrowser for help on using the browser.