root/README

Revision 244, 18.3 kB (checked in by cholt, 3 months ago)

bug in repeat mask cause by prok upgrade

Line 
1 ***MAKER Documentation***
2
3 #---------------------------------------------------------------------
4
5 INSTALLATION INSTUCTIONS FOR MAKER
6
7 *Step by step instructions are also available in the INSTALL text
8 file.
9
10 MAKER is an annotation pipeline.  In other words it links together
11 many steps and programs to produce final annotations.  For this
12 reason, you must first install a number of programs that MAKER depends
13 on.  MAKER works on both eukaryotic and prokaryotic genomes.
14
15
16 To install MAKER, you will first need to install the following
17 external programs:
18
19     *PERL 5.8.0 or higher
20     *BioPerl 1.5 or higher (www.bioperl.org)
21     *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf)
22     *RepeatMasker 3.1.6  or higher (www.repeatmasker.org)
23     *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate)
24
25 You must also install one of the following:
26
27     *Wu-BLAST 2.0 or higher
28      (Wu-BLAST is becoming AB-BLAST which can not yet be downloaded)
29         or
30     *NCBI BLAST 2.2.X or higher
31      (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml)
32  
33 You might want to also install these optional external programs:
34
35     *Augustus 2.0 or higher (augustus.gobics.de)
36     *GeneMark-ES (exon.biology.gatech.edu)
37     *FGENESH 2.6 or higher (www.softberry.com) - requires licence
38     *GeneMarkS for prokaryotic genomes (exon.biology.gatech.edu)
39
40 To install mpi_maker, you must have an mpi package installed, try the
41 following:
42
43     *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/)
44
45 Note: Remember to install MPICH2 with the --enable-sharedlibs flag set
46 to the appropriate value.  See MPICH2 Installer's Guide at:
47 http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs
48
49 Also see the mpi_maker installation instructions found further bellow.
50
51
52 Notes:
53
54 1) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we
55    will do some testing to see if it is compatible with MAKER.
56    Wu-BLAST is no longer available online, so if you don't already
57    have it, you will have to use NCBI BLAST instead.
58 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file
59    executable called TRF (see RepeatMasker website for details), so
60    please install these before installing RepeatMasker
61 3) Exonerate Binaries can be downloaded from the website.  If you use
62    Mac OSX, however, binaries are only available for version 1.0.
63    This verion will work too.  If you would like to compile exonerate,
64    it requires GLIB, a C-library, that has a link from the exonerate
65    website.  If you use Mac OSX, GLIB can downloaded using FINK.
66 4) RepeatMasker requires a repeat library file, which can be
67    downloaded from Repbase upon registration
68    (http://www.girinst.org/), this is explained on the RepeatMasker
69    website.
70 5) Please note the location of all of the programs that you have
71    installed, and add them to you $PATH variable in your .profile
72    file.  You will need this information in the maker_exe file, one
73    of MAKER's 3 control files.
74
75 Now that you have all the necessary programs installed, MAKER can be
76 unpacked using:
77
78  tar -zxvf maker.tar.gz
79
80 This will create a directory called MAKER with 5 sub directories:
81
82     bin    - contains the MAKER executables.
83     lib    - contains all the necessary perl libaries for MAKER.
84     MPI    - contains MPI specific data to configure MAKER for a
85              cluster that supports MPI.
86     Apollo - contains gff3.tiers file (See section titled APOLLO
87              below)
88     data   - contains some sample data used to make sure everything
89              works.
90     perl   - contains perl modules that need to be compiled
91
92 Finally change to the maker/perl directory and type: 'perl Install.PL'
93 to compile required perl modules.
94
95 Now you can run MAKER!!
96
97 Programs required by MAKER rely on certain environmental variables
98 being set.  If you have not set these variables per the installation
99 instructions of the external programs, a reminder list is provided
100 below:
101
102 for tcsh:
103 setenv PERL5LIB where_bioperl_is_installed
104 setenv WUBLASTMAT where_wublast_is_installed/matrix
105 setenv ZOE where_snap_is_installed
106 setenv WUBLASTFILTER where_wublast_is_installed/filter
107 setenv AUGUSTUS_CONFIG_PATH where_augustus_is_installed/config
108
109 for bash:
110 export PERL5LIB=where_bioperl_is_installed
111 export WUBLASTMAT=where_wublast_is_installed/matrix
112 export ZOE=where_snap_is_installed
113 export WUBLASTFILTER=where_wublast_is_installed/filter
114 export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config
115
116 #---------------------------------------------------------------------
117
118 MPI MAKER INSTALL
119
120 If you are running MAKER on an MPI capable cluster, you can install an
121 MPI version of MAKER by doing the following:
122
123     1. Install standard MAKER and verify that it runs.
124     2. Install MPICH2 with the --enable-sharedlibs flag set to the
125        appropriate value for your OS (See MPICH2 documentation)
126     3. Use cd to change to the MPI subdirectory in the MAKER
127        instalation folder (i.e. maker/MPI/)
128     4. Run Install.PL by typing: perl Install.PL
129
130 A new version of MAKER called mpi_maker should now be installed under
131 maker/bin.
132
133 To run mpi_maker, first verify that your mpi environment is initiated,
134 (i.e. using the mpdboot or mpd command). Now start mpi_maker via
135 mpiexec.
136
137 Example: (This will run MAKER on 4 nodes or processors)
138
139 mpiexec -n 4 mpi_maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl
140
141 Please see the documentation of the MPI environment you use for
142 instructions on how to initiate an MPI process.
143
144 #---------------------------------------------------------------------
145
146 RUNNING MAKER WITH EXAMPLE DATA
147
148 1) Copy the files in the data directories to a temporary directory
149    where you will run an example file.
150 2) Type maker -CTL to generate generic MAKER control files
151 3) Next you will need to edit the control files to include the path of
152    the genome file, EST file, and protein file, as well as the paths
153    to all required executables.  See CONFIG FILE EDITING for more
154    information.
155 4) Then try the following command from your temporary directory:
156
157        maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
158
159 5) Examine the output files.  See MAKER OUTPUT and APOLLO sections.
160
161 #---------------------------------------------------------------------
162
163 CONTROL FILE EDITING
164
165 MAKER uses control files to guide each run.  Generic control files can
166 be built using the -CTL flag in maker.  These control files can then
167 be edited by the user to identify the location of all required input
168 data and statistics.  Control files are run specific and seperate
169 control files will need to be built for each genome given to MAKER.
170 MAKER will look for control files in the current working directory, so
171 it is recomended that MAKER should be ran in a seperate directory
172 containing unique control files for each genome.
173
174 Control files:
175
176     1. maker_exe.ctl - contains the path information for needed
177        executables.
178
179     2. maker_bopts - contains filtering statistics for BLAST and
180        Exonerate.
181
182     3. maker_opts.ctl - contains all other information for MAKER,
183        including the location of the input genome file.
184
185
186 Remember to examine the control files before each run of MAKER on your
187 specific data.
188
189 Lines in the MAKER control files have the format key:value whith no
190 spaces before or after the colon(:).  If the value is a file name, you
191 can use relative paths and environmental variables,
192 i.e. genome:$HOME/my_genome.fasta
193
194 Note that for all control files the comments written to help users
195 begin with a pound sign(#).  In addition, options before the colon(:)
196 can not be changed, nor should there be a space before or after the
197 colon.
198
199 A. maker_exe.ctl - includes information about programs executed by
200 MAKER.
201
202 Here is an example of sections of the maker_opts.ctl file:
203
204 #-----Genome
205 genome:/fastas/genome.fasta
206
207 #-----EST Evidence
208 est:/fastas/est.fasta
209 altest:/fastas/alt_est.fasta
210
211 #-----Protein Homology Evidence
212 protein:protein.fasta
213
214 #-----MAKER Specific Options
215 evaluate:0
216 max_dna_len:100000
217 min_contig:1
218 min_protein:0
219 split_hit:10000
220 pred_flank:200
221 single_exon:0
222 single_length:250
223 keep_preds:0
224 map_forward:0
225 retry:1
226 clean_try:0
227 clean_up:0
228
229
230 #---------------------------------------------------------------------
231
232 MAKER OUTPUT
233
234 MAKER will create at least the following files/directories:
235
236 1) XXX.maker.output/ - contains all output for a given run of MAKER.
237 2) XXX.maker.output/XXX_datastore/ - contains subdirectories that hold
238    the output for each individual contig of the input fasta file.  See
239    DATASTORE DIRECTORY STRUCTURE section.
240 3) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run
241    progress as well as an index for traversing through the output
242    datastructure.
243 4) XXX.maker.output/mpi_blastdb/ - Contains fasta indexes and error
244    corrected fasta files built from the EST and protein databases
245    provided by the user.
246 5) maker_opt.log,maker_exe.log,maker_bopts.log - These are logs of the
247    control files used for this run of MAKER.
248 5) XXX.maker.output/XXX.db - Database of GFF3 files provided by the
249    user.  See GFF3 PASSTHROUGH section.
250
251 Within the XXX_datastore/ subdirectories:
252     * seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE,
253       or Apollo
254     * seq_name.maker.transcripts.fasta - a fasta file of the MAKER
255       annotated transcript sequences
256     * seq_name.maker.proteins.fasta - a fasta file of the MAKER
257       annotated protein sequences
258     * seq_name.maker.XXX.transcript.fasta - a fasta file of ab-initio
259       predicted transcript sequences from program XXX
260     * seq_name.maker.XXX.proteins.fasta - a fasta file of ab-inito
261       predicted protein sequences from program XXX
262     * seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a
263       fasta file of filtered ab-inito transcript sequences that don't
264       overlap maker annotations
265     * seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a
266       fasta file of filtered ab-inito protein sequences that don't
267       overlap maker annotations
268     * theVoid.seq_name/ - a directory containing all of the raw
269       output files produced by MAKER, including BLAST reports, SNAP
270       output, exonnerate output and the masked genomeic sequence.
271
272 WARNING:
273
274 * The names of output files are based on sequence ids.  If giving
275   MAKER a multi-fasta file, it is important to verify that all
276   sequence id are unique, so files are not overwritten.
277
278 * If there are more than 1,000 sequences in a multi-fasta file a deep
279   datastore structure will be used. See THE DATASTORE DIRECTORY
280   STRUCTURE in this document.
281
282 * If sequence ids contain characters that are illegal in file names,
283   those characters will be replaced automatically before building
284   output file names.
285
286
287 #---------------------------------------------------------------------
288
289 DATASTORE DIRECTORY STRUCTURE
290
291 Many filesystems have performance problems with large numbers of
292 subdirectories and files within a single directory, and even when the
293 underlying filesystems handle things gracefully, access via network
294 filesystems can be an issue.  You can imagine that the amount of
295 output produced while annotating an entire genome can be quite
296 overwhelming to the file system.  To deal with all the output files
297 MAKER uses a Datastore module to create a hiearchy of subdirectory
298 layers, starting from a 'base', and mapping identifiers to
299 corresponding subdirectory.
300
301 A deep datastore will be used by MAKER if there are more than 1,000
302 sequences in a multi-fasta file.  When a deep datastore is
303 implemented, MAKER output files will not appear where you would
304 normally expect them to be.  Instead they will be located in a series
305 of sub-directory under a new base-directory whose name is determined
306 based on the input genome file name:
307
308 EXAMPLE: current_directory/fly_datastore/EE/Af/Contig1/Contig1.gff
309
310 To help you locate output files, a master_datastore_index file is
311 created which lists the exact output directory corresponding to each
312 contig from the input genome file.  The The master_datastore_index
313 file contains three columns of text; the first column shows the
314 sequence identifier from each fasta header, and the second column
315 shows the location of the output files for that sequence. The third
316 column is for logging the status of data related to an individual
317 contig. The values of the third column are as follows:
318     * STARTED - Indicates that MAKER has started proccessing this
319       contig.
320     * FINISHED - Indicates that MAKER has finished processing this
321       contig and all data is currently available in that subdirectory.
322     * DIED - Indicates that MAKER failed on this contig.
323     * DIED_SKIPPED_PERMANENT - Indicates that MAKER failed up to the
324       specified number of retries and will not try again.
325     * RETRY - Indicates that MAKER is retrying the contig after a
326       failure.
327     * SKIPPED_SMALL - Indicates that this contig was skipped because
328       it is too short (based on control file values set by the user).
329
330
331 #---------------------------------------------------------------------
332
333 GFF3 PASSTHROUGH
334
335 If you have data from a source that MAKER does not support, and you
336 wish to use the data in annotating a genome, then you can pass the
337 data to MAKER as an aligned GFF3 file.  This is done by supplying the
338 files location to the appropriate value in the maker_opt.ctl file
339 (i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent
340 to it to be of the type specified, so don't put mixed data in a file
341 (i.e. don't mix EST and other data in the file pointed to by est_gff,
342 otherwise it all gets used as EST data).  Also the genome_gff option
343 is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data
344 must be split by type and identified by the appropriate control file
345 option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction
346 data, est_gff for EST data, etc.).
347
348 You should use the online GFF3 validator to see if your GFF3 files
349 comply with all GFF3 specifications before running MAKER:
350
351      http://dev.wormbase.org/db/validate_gff3/validate_gff3_online
352
353 #---------------------------------------------------------------------
354
355 ADDING UTRs FOR GBROWSE
356
357 When using APOLLO to visualize gene annotations, UTRs are inferred
358 based on exon and CDS locations.  However GMOD and GBROWSE do not
359 infer the UTR, so to visualize the UTR, you will have to run:
360 add_utr_gff.pl with the following command:
361
362      add_utr_gff.pl <directory>
363
364      <directory> is the directory containing all of your GFF3 files.
365
366 Each GFF3 file will have a sister file called sequence.wutr.gff3.
367
368 There is also another script called add_utr_start_stop that adds start
369 and stop codon entries as well as UTR entries.
370
371      add_utr_start_stop <GFF3 file>
372
373 #---------------------------------------------------------------------
374
375 APOLLO
376
377 MAKER is bundled with a configuration file that improves the color and
378 display of MAKER annotations and evidence in the Apollo genome
379 browser.  The configuration file is called "gff3.tiers" and is located
380 in the maker/Apollo/ directory.  The file should be copied to the
381 conf/ sub_directory which is located under the Apollo instalation
382 directory.  Using the Mac version of Apollo the conf/ directory is
383 located at /Applications/Apollo.app/Contents/Resources/app/conf/.
384
385 #---------------------------------------------------------------------
386
387 HMM BUILDING (based on snap documentation)
388
389
390 A) First you will need to determine the genes used to model future
391    genes, by determining a high quality gene set (annotations for the
392    high quality gene should be in GFF3 format).  The high quality gene
393    set can then be coverted into snap ZFF format using maker2zff.pl
394    found in maker/bin.
395
396    This program is run with the following command:
397
398        maker2zff.pl <directory> genome
399
400        * <directory> a the directory where all of your GFF3 files are
401          located
402        * geneome is the name for the outfile
403
404    Files Created:
405        genome.ann
406        genome.dna
407
408    Note: A convenient way to identify and initial high quality gene
409    set for the HMM is to use the -predictor est2genome option in
410    MAKER.  This will produce gene annotations based solely on EST
411    evidence.  These annoations can then seed the first HMM.  After
412    running MAKER again using this new HMM and the -predictor snap
413    option, you can use the second round of annotations as the seed
414    for an even better HMM model. In this way the HMM model
415    progressively improves with each run of MAKER.
416
417    Another strategy for identifying an initial gene set to model the
418    HMM is to use the program CEGMA (http://korflab.ucdavis.edu/
419    software.html).  CEGMA builds a highly reliable set of gene
420    annotations in the absence of experimental data by identifying DNA
421    regions with homology to a set of 458 proteins that are highly
422    conserved among taxa.
423
424    Combining both CEGMA and MAKER datasets to build the first HMM is
425    also a good strategy.
426
427 B) Next you will use the dna and zff file (genome.dna and genome.ann)
428    to produce a SNAP HMM as described in the SNAP documention (which
429    we have provided below):
430
431    The first step is to look at some features of the genes:
432
433        fathom genome.ann genome.dna -gene-stats
434
435    Next, you want to verify that the genes have no obvious errors:
436
437        fathom genome.ann genome.dna -validate
438
439    You may find some errors and warnings. Check these out in some kind
440    of genome browser and remove those that are real errors. Next,
441    break up the sequences into fragments with one gene per sequence
442    with the following command:
443
444        fathom -genome.ann genome.dna -categorize 1000
445
446    There will be up to 1000 bp on either side of the genes. You will
447    find several new files.
448
449        alt.ann, alt.dna (genes with alternative splicing)
450        err.ann, err.dna (genes that have errors)
451        olp.ann, olp.dna (genes that overlap other genes)
452        wrn.ann, wrn.dna (genes with warnings)
453        uni.ann, uni.dna (single gene per sequence)
454
455    Convert the uni genes to plus stranded with the command:
456
457        fathom uni.ann uni.dna -export 1000 -plus
458
459    You will find 4 new files:
460
461        export.aa   proteins corresponding to each gene
462        export.ann  gene structure on the plus strand
463        export.dna  DNA of the plus strand
464        export.tx   transcripts for each gene
465
466    The parameter estimation program, forge, creates a lot of files.
467    You probably want to create a directory to keep things tidy before
468    you execute the program.
469
470        mkdir params
471        cd params
472        forge ../export.ann ../export.dna
473        cd ..
474
475    Last is to build an HMM.
476
477        hmm-assembler.pl my-genome params > my-genome.hmm
478
479    Lastly, you will want to add the location of your hmm file to your
480    maker_opts.ctl file.
481
482 * For more information see SNAP documentation on how to build an HMM
Note: See TracBrowser for help on using the browser.