root/README

Revision 227, 18.2 kB (checked in by cholt, 5 months ago)

fix documentation, fix te_proteins

Line 
1 ***MAKER Documentation***
2
3 #---------------------------------------------------------------------
4
5 INSTALLATION INSTUCTIONS FOR MAKER
6
7 *Step by step instructions are also available in the INSTALL text
8 file.
9
10 MAKER is an annotation pipeline.  In other words it links together
11 many steps and programs to produce final annotations.  For this
12 reason, you must first install a number of programs that MAKER depends
13 on.
14
15
16 To install MAKER, you will first need to install the following
17 external programs:
18
19     *PERL 5.8.0 or higher
20     *BioPerl 1.5 or higher (www.bioperl.org)
21     *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf)
22     *RepeatMasker 3.1.6  or higher (www.repeatmasker.org)
23     *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate)
24
25 You must also install one of the following:
26
27     *Wu-BLAST 2.0 or higher
28      (Wu-BLAST is becoming AB-BLAST which can not yet be downloaded)
29         or
30     *NCBI BLAST 2.2.X or higher
31      (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml)
32  
33 You might want to also install these optional external programs:
34
35     *Augustus 2.0 or higher (augustus.gobics.de)
36     *GeneMark-ES (exon.biology.gatech.edu)
37     *FGENESH 2.6 or higher (www.softberry.com) - requires licence
38
39 To install mpi_maker, you must have an mpi package installed, try the
40 following:
41
42     *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/)
43
44 Note: Remember to install MPICH2 with the --enable-sharedlibs flag set
45 to the appropriate value.  See MPICH2 Installer's Guide at:
46 http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs
47
48 Also see the mpi_maker installation instructions found further bellow.
49
50
51 Notes:
52
53 1) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we
54    will do some testing to see if it is compatible with MAKER.
55    Wu-BLAST is no longer available online, so if you don't already
56    have it, you will have to use NCBI BLAST instead.
57 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file
58    executable called TRF (see RepeatMasker website for details), so
59    please install these before installing RepeatMasker
60 3) Exonerate Binaries can be downloaded from the website.  If you use
61    Mac OSX, however, binaries are only available for version 1.0.
62    This verion will work too.  If you would like to compile exonerate,
63    it requires GLIB, a C-library, that has a link from the exonerate
64    website.  If you use Mac OSX, GLIB can downloaded using FINK.
65 4) RepeatMasker requires a repeat library file, which can be
66    downloaded from Repbase upon registration
67    (http://www.girinst.org/), this is explained on the RepeatMasker
68    website.
69 5) Please note the location of all of the programs that you have
70    installed, and add them to you $PATH variable in your .profile
71    file.  You will need this information in the maker_exe file, one
72    of MAKER's 3 control files.
73
74 Now that you have all the necessary programs installed, MAKER can be
75 unpacked using:
76
77  tar -zxvf maker.tar.gz
78
79 This will create a directory called MAKER with 5 sub directories:
80
81     bin    - contains the MAKER executables.
82     lib    - contains all the necessary perl libaries for MAKER.
83     MPI    - contains MPI specific data to configure MAKER for a
84              cluster that supports MPI.
85     Apollo - contains gff3.tiers file (See section titled APOLLO
86              below)
87     data   - contains some sample data used to make sure everything
88              works.
89     perl   - contains perl modules that need to be compiled
90
91 Finally change to the maker/perl directory and type: 'perl Install.PL'
92 to compile required perl modules.
93
94 Now you can run MAKER!!
95
96 Programs required by MAKER rely on certain environmental variables
97 being set.  If you have not set these variables per the installation
98 instructions of the external programs, a reminder list is provided
99 below:
100
101 for tcsh:
102 setenv PERL5LIB where_bioperl_is_installed
103 setenv WUBLASTMAT where_wublast_is_installed/matrix
104 setenv ZOE where_snap_is_installed
105 setenv WUBLASTFILTER where_wublast_is_installed/filter
106 setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config
107
108 for bash:
109 export PERL5LIB=where_bioperl_is_installed
110 export WUBLASTMAT=where_wublast_is_installed/matrix
111 export ZOE=where_snap_is_installed
112 export WUBLASTFILTER=where_wublast_is_installed/filter
113 export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config
114
115 #---------------------------------------------------------------------
116
117 MPI MAKER INSTALL
118
119 If you are running MAKER on an MPI capable cluster, you can install an
120 MPI version of MAKER by doing the following:
121
122     1. Install standard MAKER and verify that it runs.
123     2. Install MPICH2 with the --enable-sharedlibs flag set to the
124        appropriate value (See MPICH2 documentation)
125     3. Use cd to change to the MPI subdirectory in the MAKER
126        instalation folder (i.e. maker/MPI/)
127     4. Run Install.PL by typing: perl Install.PL
128
129 A new version of MAKER called mpi_maker should now be installed under
130 maker/bin.
131
132 To run mpi_maker, first verify that your mpi environment is initiated,
133 (i.e. using the mpdboot or mpd command). Now start mpi_maker via
134 mpiexec.
135
136 Example: (This will run MAKER on 4 nodes or processors)
137
138 mpiexec -n 4 mpi_maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl
139
140 Please see the documentation of the MPI environment you use for
141 instructions on how to initiate an MPI process.
142
143 #---------------------------------------------------------------------
144
145 RUNNING MAKER WITH EXAMPLE DATA
146
147 1) Copy the files in the data directories to a temporary directory
148    where you will run an example file.
149 2) Type maker -CTL to generate generic MAKER control files
150 3) Next you will need to edit the control files to include the path of
151    the genome file, EST file, and protein file, as well as the paths
152    to all required executables.  See CONFIG FILE EDITING for more
153    information.
154 4) Then try the following command from your temporary directory:
155
156        maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
157
158 5) Examine the output files.  See MAKER OUTPUT and APOLLO sections.
159
160 #---------------------------------------------------------------------
161
162 CONTROL FILE EDITING
163
164 MAKER uses control files to guide each run.  Generic control files can
165 be built using the -CTL flag in maker.  These control files can then
166 be edited by the user to identify the location of all required input
167 data and statistics.  Control files are run specific and seperate
168 control files will need to be built for each genome given to MAKER.
169 MAKER will look for control files in the current working directory, so
170 it is recomended that MAKER should be ran in a seperate directory
171 containing unique control files for each genome.
172
173 Control files:
174
175     1. maker_exe.ctl - contains the path information for needed
176        executables.
177
178     2. maker_bopts - contains filtering statistics for BLAST and
179        Exonerate.
180
181     3. maker_opts.ctl - contains all other information for MAKER,
182        including the location of the input genome file.
183
184
185 Remember to examine the control files before each run of MAKER on your
186 specific data.
187
188 Lines in the MAKER control files have the format key:value whith no
189 spaces before or after the colon(:).  If the value is a file name, you
190 can use relative paths and environmental variables,
191 i.e. genome:$HOME/my_genome.fasta
192
193 Note that for all control files the comments written to help users
194 begin with a pound sign(#).  In addition, options before the colon(:)
195 can not be changed, nor should there be a space before or after the
196 colon.
197
198 A. maker_exe.ctl - includes information about programs executed by
199 MAKER.
200
201 Here is an example of sections of the maker_opts.ctl file:
202
203 #-----Genome
204 genome:/fastas/genome.fasta
205
206 #-----EST Evidence
207 est:/fastas/est.fasta
208 altest:/fastas/alt_est.fasta
209
210 #-----Protein Homology Evidence
211 protein:protein.fasta
212
213 #-----MAKER Specific Options
214 evaluate:0
215 max_dna_len:100000
216 min_contig:1
217 min_protein:0
218 split_hit:10000
219 pred_flank:200
220 single_exon:0
221 single_length:250
222 keep_preds:0
223 map_forward:0
224 retry:1
225 clean_try:0
226 clean_up:0
227
228
229 #---------------------------------------------------------------------
230
231 MAKER OUTPUT
232
233 MAKER will create at least the following files/directories:
234
235 1) XXX.maker.output/ - contains all output for a given run of MAKER.
236 2) XXX.maker.output/XXX_datastore/ - contains subdirectories that hold
237    the output for each individual contig of the input fasta file.  See
238    DATASTORE DIRECTORY STRUCTURE section.
239 3) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run
240    progress as well as an index for traversing through the output
241    datastructure.
242 4) XXX.maker.output/mpi_blastdb/ - Contains fasta indexes and error
243    corrected fasta files built from the EST and protein databases
244    provided by the user.
245 5) maker_opt.log,maker_exe.log,maker_bopts.log - These are logs of the
246    control files used for this run of MAKER.
247 5) XXX.maker.output/XXX.db - Database of GFF3 files provided by the
248    user.  See GFF3 PASSTHROUGH section.
249
250 Within the XXX_datastore/ subdirectories:
251     * seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE,
252       or Apollo
253     * seq_name.maker.transcripts.fasta - a fasta file of the MAKER
254       annotated transcript sequences
255     * seq_name.maker.proteins.fasta - a fasta file of the MAKER
256       annotated protein sequences
257     * seq_name.maker.XXX.transcript.fasta - a fasta file of ab-initio
258       predicted transcript sequences from program XXX
259     * seq_name.maker.XXX.proteins.fasta - a fasta file of ab-inito
260       predicted protein sequences from program XXX
261     * seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a
262       fasta file of filtered ab-inito transcript sequences that don't
263       overlap maker annotations
264     * seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a
265       fasta file of filtered ab-inito protein sequences that don't
266       overlap maker annotations
267     * theVoid.seq_name/ - a directory containing all of the raw
268       output files produced by MAKER, including BLAST reports, SNAP
269       output, exonnerate output and the masked genomeic sequence.
270
271 WARNING:
272
273 * The names of output files are based on sequence ids.  If giving
274   MAKER a multi-fasta file, it is important to verify that all
275   sequence id are unique, so files are not overwritten.
276
277 * If there are more than 1,000 sequences in a multi-fasta file a deep
278   datastore structure will be used. See THE DATASTORE DIRECTORY
279   STRUCTURE in this document.
280
281 * If sequence ids contain characters that are illegal in file names,
282   those characters will be replaced automatically before building
283   output file names.
284
285
286 #---------------------------------------------------------------------
287
288 DATASTORE DIRECTORY STRUCTURE
289
290 Many filesystems have performance problems with large numbers of
291 subdirectories and files within a single directory, and even when the
292 underlying filesystems handle things gracefully, access via network
293 filesystems can be an issue.  You can imagine that the amount of
294 output produced while annotating an entire genome can be quite
295 overwhelming to the file system.  To deal with all the output files
296 MAKER uses a Datastore module to create a hiearchy of subdirectory
297 layers, starting from a 'base', and mapping identifiers to
298 corresponding subdirectory.
299
300 A deep datastore will be used by MAKER if there are more than 1,000
301 sequences in a multi-fasta file.  When a deep datastore is
302 implemented, MAKER output files will not appear where you would
303 normally expect them to be.  Instead they will be located in a series
304 of sub-directory under a new base-directory whose name is determined
305 based on the input genome file name:
306
307 EXAMPLE: current_directory/fly_datastore/EE/Af/Contig1/Contig1.gff
308
309 To help you locate output files, a master_datastore_index file is
310 created which lists the exact output directory corresponding to each
311 contig from the input genome file.  The The master_datastore_index
312 file contains three columns of text; the first column shows the
313 sequence identifier from each fasta header, and the second column
314 shows the location of the output files for that sequence. The third
315 column is for logging the status of data related to an individual
316 contig. The values of the third column are as follows:
317     * STARTED - Indicates that MAKER has started proccessing this
318       contig.
319     * FINISHED - Indicates that MAKER has finished processing this
320       contig and all data is currently available in that subdirectory.
321     * DIED - Indicates that MAKER failed on this contig.
322     * DIED_SKIPPED_PERMANENT - Indicates that MAKER failed up to the
323       specified number of retries and will not try again.
324     * RETRY - Indicates that MAKER is retrying the contig after a
325       failure.
326     * SKIPPED_SMALL - Indicates that this contig was skipped because
327       it is too short (based on control file values set by the user).
328
329
330 #---------------------------------------------------------------------
331
332 GFF3 PASSTHROUGH
333
334 If you have data from a source that MAKER does not support, and you
335 wish to use the data in annotating a genome, then you can pass the
336 data to MAKER as an aligned GFF3 file.  This is done by supplying the
337 files location to the appropriate value in the maker_opt.ctl file
338 (i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent
339 to it to be of the type specified, so don't put mixed data in a file
340 (i.e. don't mix EST and other data in the file pointed to by est_gff,
341 otherwise it all gets used as EST data).  Also the genome_gff option
342 is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data
343 must be split by type and identified by the appropriate control file
344 option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction
345 data, est_gff for EST data, etc.).
346
347 You should use the online GFF3 validator to see if your GFF3 files
348 comply with all GFF3 specifications before running MAKER:
349
350      http://dev.wormbase.org/db/validate_gff3/validate_gff3_online
351
352 #---------------------------------------------------------------------
353
354 ADDING UTRs FOR GBROWSE
355
356 When using APOLLO to visualize gene annotations, UTRs are inferred
357 based on exon and CDS locations.  However GMOD and GBROWSE do not
358 infer the UTR, so to visualize the UTR, you will have to run:
359 add_utr_gff.pl with the following command:
360
361      add_utr_gff.pl <directory>
362
363      <directory> is the directory containing all of your GFF3 files.
364
365 Each GFF3 file will have a sister file called sequence.wutr.gff3.
366
367 There is also another script called add_utr_start_stop that adds start
368 and stop codon entries as well as UTR entries.
369
370      add_utr_start_stop <GFF3 file>
371
372 #---------------------------------------------------------------------
373
374 APOLLO
375
376 MAKER is bundled with a configuration file that improves the color and
377 display of MAKER annotations and evidence in the Apollo genome
378 browser.  The configuration file is called "gff3.tiers" and is located
379 in the maker/Apollo/ directory.  The file should be copied to the
380 conf/ sub_directory which is located under the Apollo instalation
381 directory.  Using the Mac version of Apollo the conf/ directory is
382 located at /Applications/Apollo.app/Contents/Resources/app/conf/.
383
384 #---------------------------------------------------------------------
385
386 HMM BUILDING (based on snap documentation)
387
388
389 A) First you will need to determine the genes used to model future
390    genes, by determining a high quality gene set (annotations for the
391    high quality gene should be in GFF3 format).  The high quality gene
392    set can then be coverted into snap ZFF format using maker2zff.pl
393    found in maker/bin.
394
395    This program is run with the following command:
396
397        maker2zff.pl <directory> genome
398
399        * <directory> a the directory where all of your GFF3 files are
400          located
401        * geneome is the name for the outfile
402
403    Files Created:
404        genome.ann
405        genome.dna
406
407    Note: A convenient way to identify and initial high quality gene
408    set for the HMM is to use the -predictor est2genome option in
409    MAKER.  This will produce gene annotations based solely on EST
410    evidence.  These annoations can then seed the first HMM.  After
411    running MAKER again using this new HMM and the -predictor snap
412    option, you can use the second round of annotations as the seed
413    for an even better HMM model. In this way the HMM model
414    progressively improves with each run of MAKER.
415
416    Another strategy for identifying an initial gene set to model the
417    HMM is to use the program CEGMA (http://korflab.ucdavis.edu/
418    software.html).  CEGMA builds a highly reliable set of gene
419    annotations in the absence of experimental data by identifying DNA
420    regions with homology to a set of 458 proteins that are highly
421    conserved among taxa.
422
423    Combining both CEGMA and MAKER datasets to build the first HMM is
424    also a good strategy.
425
426 B) Next you will use the dna and zff file (genome.dna and genome.ann)
427    to produce a SNAP HMM as described in the SNAP documention (which
428    we have provided below):
429
430    The first step is to look at some features of the genes:
431
432        fathom genome.ann genome.dna -gene-stats
433
434    Next, you want to verify that the genes have no obvious errors:
435
436        fathom genome.ann genome.dna -validate
437
438    You may find some errors and warnings. Check these out in some kind
439    of genome browser and remove those that are real errors. Next,
440    break up the sequences into fragments with one gene per sequence
441    with the following command:
442
443        fathom -genome.ann genome.dna -categorize 1000
444
445    There will be up to 1000 bp on either side of the genes. You will
446    find several new files.
447
448        alt.ann, alt.dna (genes with alternative splicing)
449        err.ann, err.dna (genes that have errors)
450        olp.ann, olp.dna (genes that overlap other genes)
451        wrn.ann, wrn.dna (genes with warnings)
452        uni.ann, uni.dna (single gene per sequence)
453
454    Convert the uni genes to plus stranded with the command:
455
456        fathom uni.ann uni.dna -export 1000 -plus
457
458    You will find 4 new files:
459
460        export.aa   proteins corresponding to each gene
461        export.ann  gene structure on the plus strand
462        export.dna  DNA of the plus strand
463        export.tx   transcripts for each gene
464
465    The parameter estimation program, forge, creates a lot of files.
466    You probably want to create a directory to keep things tidy before
467    you execute the program.
468
469        mkdir params
470        cd params
471        forge ../export.ann ../export.dna
472        cd ..
473
474    Last is to build an HMM.
475
476        hmm-assembler.pl my-genome params > my-genome.hmm
477
478    Lastly, you will want to add the location of your hmm file to your
479    maker_opts.ctl file.
480
481 * For more information see SNAP documentation on how to build an HMM
Note: See TracBrowser for help on using the browser.