Changeset 199

Show
Ignore:
Timestamp:
04/10/09 12:00:34 (8 months ago)
Author:
bmoore
Message:

updates to ID mapping scripts and reformatted README for printer and screen

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • README

    r180 r199  
    11***MAKER Documentation*** 
    22 
    3 #---------------------------------------------------- 
     3#--------------------------------------------------------------------- 
     4 
    45INSTALLATION INSTUCTIONS FOR MAKER 
    56 
    6 *Step by step instructions are also available in the INSTALL text file. 
    7  
    8 MAKER is an annotation pipeline.  In other words it links together many steps and programs to produce final annotations.  For this reason, you must first install a number of programs that MAKER depends on. 
    9  
    10  
    11 To install maker, you will first need to install the following external programs: 
    12  
    13      *PERL 5.8.0 or higher 
    14      *BioPerl 1.5 or higher (www.bioperl.org) 
    15      *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf) 
    16      *RepeatMasker 3.1.6  or higher (www.repeatmasker.org) 
    17      *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate) 
     7*Step by step instructions are also available in the INSTALL text 
     8file. 
     9 
     10MAKER is an annotation pipeline.  In other words it links together 
     11many steps and programs to produce final annotations.  For this 
     12reason, you must first install a number of programs that MAKER depends 
     13on. 
     14 
     15 
     16To install maker, you will first need to install the following 
     17external programs: 
     18 
     19    *PERL 5.8.0 or higher 
     20    *BioPerl 1.5 or higher (www.bioperl.org) 
     21    *SNAP version 2009-02-03  or higher (homepage.mac.com/iankorf) 
     22    *RepeatMasker 3.1.6  or higher (www.repeatmasker.org) 
     23    *Exonerate 1.4  or higher (www.ebi.ac.uk/~guy/exonerate) 
    1824 
    1925You must also install one of the following: 
    2026 
    21      *Wu-BLAST 2.0  or higher (Wu-BLAST is becoming AB-BLAST which can not yet be downloaded) 
     27    *Wu-BLAST 2.0 or higher (Wu-BLAST is becoming AB-BLAST which can 
     28     not yet be downloaded) 
    2229        or 
    23      *NCBI BLAST 2.2.X or higher (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml) 
     30    *NCBI BLAST 2.2.X or higher 
     31     (http://www.ncbi.nlm.nih.gov/BLAST/download.shtml) 
    2432  
    2533You might want to also install these optional external programs: 
    2634 
    27      *Augustus 2.0  or higher (augustus.gobics.de) 
    28      *GeneMark.hmm-E 3.9 or higher (exon.biology.gatech.edu) 
    29      *FgenesH (www.softberry.com/) - requires licence 
    30  
    31 To install mpi_maker, you must have an mpi package installed, try the following: 
    32  
    33      *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/) 
    34  
    35 note:  Remember to install MPICH2 with the --enable-sharedlibs flag set to the appropriate value (See MPICH2 Installer's Guide at http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs). 
     35    *Augustus 2.0 or higher (augustus.gobics.de) 
     36    *GeneMark.hmm-E 3.9 or higher (exon.biology.gatech.edu) 
     37    *FgenesH (www.softberry.com/) - requires licence 
     38 
     39To install mpi_maker, you must have an mpi package installed, try the 
     40following: 
     41 
     42    *MPICH2 (http://www.mcs.anl.gov/research/projects/mpich2/) 
     43 
     44Note: Remember to install MPICH2 with the --enable-sharedlibs flag set 
     45to the appropriate value (See MPICH2 Installer's Guide at 
     46http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs). 
    3647 
    3748 
    3849Notes: 
    39 1) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we will do some testing to see if it is compatible with MAKER.  Wu-BLAST is no longer available online, so if you don't already have it, you will have to use NCBI BLAST instead. 
    40 2) RepeatMasker requires Wu-BLAST or Cross_Match and a single file executable called TRF (see RepeatMasker website for details), so please install these before installing RepeatMasker 
    41 3) Exonerate Binaries can be downloaded from the website.  If you use Mac OSX, however, binaries are only available for version 1.0.  This verion will work too.  If you would like to compile exonerate, it requires GLIB, a C-library, that has a link from the exonerate website.  If you use Mac OSX, GLIB can downloaded using FINK. 
    42 4) RepeatMasker requires a repeat library file, which can be downloaded from Repbase upon registration (http://www.girinst.org/), this is explained on the RepeatMasker website. 
    43 5) Please note the location of all of the programs that you have installed, and add them to you $PATH variable in your .profile file.  You will need this information in the maker.exe file, one of MAKER's 3 control files. 
    44  
    45  
    46 Now that you have all the necessary programs installed, MAKER can be unpacked using: 
    47  
    48 tar xvfz maker.tar.gz 
     50 
     511) Wu-BLAST is becoming AB-BLAST.  Once AB-BLAST becomes available we 
     52   will do some testing to see if it is compatible with MAKER. 
     53   Wu-BLAST is no longer available online, so if you don't already 
     54   have it, you will have to use NCBI BLAST instead. 
     552) RepeatMasker requires Wu-BLAST or Cross_Match and a single file 
     56   executable called TRF (see RepeatMasker website for details), so 
     57   please install these before installing RepeatMasker 
     583) Exonerate Binaries can be downloaded from the website.  If you use 
     59   Mac OSX, however, binaries are only available for version 1.0. 
     60   This verion will work too.  If you would like to compile exonerate, 
     61   it requires GLIB, a C-library, that has a link from the exonerate 
     62   website.  If you use Mac OSX, GLIB can downloaded using FINK. 
     634) RepeatMasker requires a repeat library file, which can be 
     64   downloaded from Repbase upon registration 
     65   (http://www.girinst.org/), this is explained on the RepeatMasker 
     66   website. 
     675) Please note the location of all of the programs that you have 
     68   installed, and add them to you $PATH variable in your .profile 
     69   file.  You will need this information in the maker.exe file, one 
     70   of MAKER's 3 control files. 
     71 
     72Now that you have all the necessary programs installed, MAKER can be 
     73unpacked using: 
     74 
     75 tar xvfz maker.tar.gz 
    4976 
    5077This will create a directory called maker with 5 sub directories: 
    5178 
    52         bin - contains the maker executables. 
    53         lib - contains all the necessary perl libaries for MAKER. 
    54         MPI - contains MPI specific data to configure MAKER for a cluster that supports MPI. 
    55         Apollo - contains gff3.tiers file (See section titled APOLLO below) 
    56         data - contains some sample data used to make sure everything works. 
    57         perl - contains perl modules that need to be compiled 
    58  
    59 Finally change to the maker/perl directory and type: 'perl Install.PL' to compile required perl modules. 
     79    bin    - contains the maker executables. 
     80    lib    - contains all the necessary perl libaries for MAKER. 
     81    MPI    - contains MPI specific data to configure MAKER for a 
     82             cluster that supports MPI. 
     83    Apollo - contains gff3.tiers file (See section titled APOLLO 
     84             below) 
     85    data   - contains some sample data used to make sure everything 
     86             works. 
     87    perl   - contains perl modules that need to be compiled 
     88 
     89Finally change to the maker/perl directory and type: 'perl Install.PL' 
     90to compile required perl modules. 
    6091 
    6192Now you can run MAKER!! 
    6293 
    63 Maker uses control files to guide each run.  Generic control files can be built using the -CTL flag in maker.  These control files can then be edited by the user to identify the location of all required input data and statistics.  Control files are run specific and seperate control will need to be built for each genome given to maker.  Maker will look for control files in the current working directory, so it is recomended that maker should be ran in a seperate directory containing unique control files for each genome. 
     94Maker uses control files to guide each run.  Generic control files can 
     95be built using the -CTL flag in maker.  These control files can then 
     96be edited by the user to identify the location of all required input 
     97data and statistics.  Control files are run specific and seperate 
     98control will need to be built for each genome given to maker.  Maker 
     99will look for control files in the current working directory, so it is 
     100recomended that maker should be ran in a seperate directory containing 
     101unique control files for each genome. 
    64102 
    65103Control files: 
    66104 
    67          1. maker_exe.ctl - contains the path information for needed executables 
    68          2. maker_bopts - contains filtering statistics for BLAST and Exonerate 
    69          3. maker_opts.ctl - contains all other information for MAKER, including the location of the input genome file. 
    70  
    71  
    72 Always remember to be examine the control files before each run of MAKER on your specific data 
    73  
    74  
    75 Programs required by maker rely on certain environmental variables being set.  If you have not set these variables per the installation instructions of the external programs, a reminder list is provided below: 
     105    1. maker_exe.ctl - contains the path information for needed 
     106       executables. 
     107 
     108    2. maker_bopts - contains filtering statistics for BLAST and 
     109       Exonerate. 
     110 
     111    3. maker_opts.ctl - contains all other information for MAKER, 
     112       including the location of the input genome file. 
     113 
     114Always remember to examine the control files before each run of MAKER 
     115on your specific data. 
     116 
     117Programs required by maker rely on certain environmental variables 
     118being set.  If you have not set these variables per the installation 
     119instructions of the external programs, a reminder list is provided 
     120below: 
    76121 
    77122for tcsh: 
     
    89134export AUGUSTUS_CONFIG_PATH=where_augustus_is_installed/config 
    90135 
    91  
    92 #---------------------------------------------------- 
     136#--------------------------------------------------------------------- 
     137 
    93138MPI MAKER INSTALL 
    94139 
    95 If you are running maker on an MPI capable cluster, you can install an MPI version of maker by doing the following: 
    96  
    97         1. Install standard maker and verify that it runs. 
    98         2. Install MPICH2 with the --enable-sharedlibs flag set to the appropriate value (See MPICH2 documentation) 
    99         3. Use cd to change to the MPI subdirectory in the maker instalation folder (i.e. maker/MPI/) 
    100         4. Run Install.PL by typing:     perl Install.PL 
    101  
    102 A new version of maker called mpi_maker should now be installed under maker/bin. 
    103  
    104 To run mpi_maker, first verify that your mpi environment is initiated, (i.e. using the mpdboot or mpd command). Now start mpi_maker via mpiexec. 
     140If you are running maker on an MPI capable cluster, you can install an 
     141MPI version of maker by doing the following: 
     142 
     143    1. Install standard maker and verify that it runs. 
     144    2. Install MPICH2 with the --enable-sharedlibs flag set to the 
     145       appropriate value (See MPICH2 documentation) 
     146    3. Use cd to change to the MPI subdirectory in the maker 
     147       instalation folder (i.e. maker/MPI/) 
     148    4. Run Install.PL by typing: perl Install.PL 
     149 
     150A new version of maker called mpi_maker should now be installed under 
     151maker/bin. 
     152 
     153To run mpi_maker, first verify that your mpi environment is initiated, 
     154(i.e. using the mpdboot or mpd command). Now start mpi_maker via 
     155mpiexec. 
    105156 
    106157Example: (This will run MAKER on 3 nodes or processors) 
    107158 
    108         mpiexec -n 3 perl maker_directory/maker/bin/mpi_maker maker_opts.ctl maker_bopts.ctl maker_exe.ctl 
    109  
    110  
    111  
    112 Please see the documentation of the MPI environment you use for instructions on how to initiate an MPI process. 
    113  
    114  
    115 #---------------------------------------------------- 
     159mpiexec -n 3 perl maker_directory/maker/bin/mpi_maker maker_opts.ctl \ 
     160maker_bopts.ctl maker_exe.ctl 
     161 
     162Please see the documentation of the MPI environment you use for 
     163instructions on how to initiate an MPI process. 
     164 
     165#--------------------------------------------------------------------- 
     166 
    116167MAKER USAGE STATEMENT 
    117168 
    118169Usage: 
    119170 
    120      maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator> 
    121  
    122      Maker is a program that produces gene annotations in GFF3 file format using 
    123      evidence such as EST alignments and protein homology.  Maker can be used to 
    124      produce gene annotations for new genomes as well as update annoations from 
    125      existing genome databases. 
    126  
    127      The four input arguments are user control files that specify how maker 
    128      should behave. The evaluator options file contains control options specific 
    129      for the evaluation of gene annotations. All options for maker should be set 
    130      in the control files, but a few can also be set on the command line. 
    131      Command line options provide a convenient machanism to override commonly 
    132      altered control file values. 
    133  
    134      Input files listed in the control options files must be in fasta format. 
    135      Please see maker documentation to learn more about control file 
    136      configuration.  Maker will automatically try and locate the user control 
    137      files in the current working directory if these arguments are not supplied 
    138      when initializing maker. 
    139  
    140      It is important to note that maker does not try and recalculated data that 
    141      it has already calculated.  For example, if you run an analysis twice on 
    142      the same dataset file you will notice that maker does not rerun any of the 
    143      blast analyses, but instead uses the blast analyses stored from the 
    144      previous run.  To force maker to rerun all analyses, use the -f flag. 
     171maker [options] <maker_opts> <maker_bopts> <maker_exe> <evaluator> 
     172 
     173Maker is a program that produces gene annotations in GFF3 file format 
     174using evidence such as EST alignments and protein homology.  Maker can 
     175be used to produce gene annotations for new genomes as well as update 
     176annoations from existing genome databases. 
     177 
     178The four input arguments are user control files that specify how maker 
     179should behave. The evaluator options file contains control options 
     180specific for the evaluation of gene annotations. All options for maker 
     181should be set in the control files, but a few can also be set on the 
     182command line.  Command line options provide a convenient machanism to 
     183override commonly altered control file values. 
     184 
     185Input files listed in the control options files must be in fasta 
     186format.  Please see maker documentation to learn more about control 
     187file configuration.  Maker will automatically try and locate the user 
     188control files in the current working directory if these arguments are 
     189not supplied when initializing maker. 
     190 
     191It is important to note that maker does not try and recalculated data 
     192that it has already calculated.  For example, if you run an analysis 
     193twice on the same dataset file you will notice that maker does not 
     194rerun any of the blast analyses, but instead uses the blast analyses 
     195stored from the previous run.  To force maker to rerun all analyses, 
     196use the -f flag. 
    145197 
    146198 
    147199Options: 
    148200 
    149      -genome|g <filename> Specify the genome file. 
    150  
    151      -predictor|p <type>  Selects the predictor(s) to use when building 
    152                           annotations.  Use a ',' to seperate types (no spaces). 
    153                           i.e. -predictor=snap,augustus,fgenesh 
    154  
    155                           types: snap 
    156                                  augustus 
    157                                  fgenesh 
    158                                  genemark 
    159                                  est2genome (Uses EST's directly) 
    160                                  abinit (ab-initio predictions) 
    161                                  model_gff (Passes through GFF3 annotations) 
    162  
    163      -RM_off|R           Turns all repeat masking off. 
    164  
    165      -retry   <integer>  Rerun failed contigs up to the specified count. 
    166  
    167      -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis. 
    168  
    169      -force|f            Forces maker to delete old files before running again. 
    170                          This will require all blast analyses to be rerun. 
    171  
    172      -evaluate|e         Run Evaluator on final annotations (under development). 
    173  
    174      -quiet|q            Silences most of maker's status messages. 
    175  
    176      -CTL                Generate empty control files in the current directory. 
    177  
    178      -help|?             Prints this usage statement. 
    179  
    180  
    181 #---------------------------------------------------- 
     201  -genome|g <filename> Specify the genome file. 
     202 
     203  -predictor|p <type>  Selects the predictor(s) to use when 
     204                       building annotations.  Use a ',' to 
     205                       seperate types (no spaces). 
     206                       i.e. -predictor=snap,augustus,fgenesh 
     207 
     208                       types: snap 
     209                              augustus 
     210                              fgenesh 
     211                              genemark 
     212                              est2genome (Uses EST's directly) 
     213                              abinit (ab-initio predictions) 
     214                              model_gff (Passes through GFF3 
     215                                         annotations) 
     216 
     217  -RM_off|R           Turns all repeat masking off. 
     218 
     219  -retry   <integer>  Rerun failed contigs up to the specified count. 
     220 
     221  -cpus|c  <integer>  Tells how many cpus to use for BLAST analysis. 
     222 
     223  -force|f            Forces maker to delete old files before running 
     224                      again.  This will require all blast analyses to 
     225                      be rerun. 
     226 
     227  -evaluate|e         Run Evaluator on final annotations (under 
     228                      development). 
     229 
     230  -quiet|q            Silences most of maker's status messages. 
     231 
     232  -CTL                Generate empty control files in the current 
     233                      directory. 
     234 
     235  -help|?             Prints this usage statement. 
     236 
     237 
     238#--------------------------------------------------------------------- 
     239 
    182240RUNNING MAKER WITH EXAMPLE DATA 
    183241 
    184 1) Copy the files in the data directories to a temporary directory where you will run an example file. 
     2421) Copy the files in the data directories to a temporary directory 
     243   where you will run an example file. 
    1852442) Type maker -CTL to generate generic maker control files 
    186 3) Next you will need to edit the control files to include the path of the genome file, EST file, and protein file, as well as the paths to all required executables.  See CONFIG FILE EDITING for more information. 
     2453) Next you will need to edit the control files to include the path of 
     246   the genome file, EST file, and protein file, as well as the paths 
     247   to all required executables.  See CONFIG FILE EDITING for more 
     248   information. 
    1872494) Then try the following command from your temporary directory: 
    188250 
    189 perl maker_directory/bin/maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl 
     251perl maker_directory/bin/maker maker_exe.ctl maker_opts.ctl \ 
     252maker_bopts.ctl 
    190253 
    191254MAKER will create at least the following files/directories: 
    192255 
    193 XXX.maker.output/ - contains all output for a given run of make 
    194 XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run progress as well as an index for traversing XXX.maker.output/XXX_datastore/ 
    195 XXX.maker.output/XXX_datastore/ - contains folders containing the output for each individual contig of the input fasta file 
    196 *Within these folders  
    197         seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, or Apollo 
    198         seq_name.maker.transcripts.fasta - a file of the maker transcript sequences 
    199         seq_name.maker.proteins.fasta - a file of the maker protein sequences 
    200         seq_name.maker.XXX.transcript.fasta - a file of ab-inito transcript sequences from program XXX 
    201         seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein sequences from program XXX 
    202         seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a file of filtered ab-inito transcript sequences that don't overlap annotations 
    203         seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a file of filtered ab-inito protein sequences that don't overlap annotations 
    204         theVoid.seq_name/ - a directory containing all of the raw output files produced by maker, including BLAST reports, SNAP output, exonnerate output and the masked sequence 
     2561) XXX.maker.output/ - contains all output for a given run of make 
     2572) XXX.maker.output/XXX_master_datastore_index.log - log of MAKER run 
     258   progress as well as an index for traversing 
     259   XXX.maker.output/XXX_datastore/ 
     2603) XXX.maker.output/XXX_datastore/ - contains folders containing the 
     261   output for each individual contig of the input fasta file 
     262 
     263Within these folders: 
     264    * seq_name.gff - a gff file that can be loaded into GMOD, GBROWSE, 
     265      or Apollo 
     266    * seq_name.maker.transcripts.fasta - a file of the maker 
     267      transcript sequences 
     268    * seq_name.maker.proteins.fasta - a file of the maker protein 
     269      sequences 
     270    * seq_name.maker.XXX.transcript.fasta - a file of ab-inito 
     271      transcript sequences from program XXX 
     272    * seq_name.maker.XXX.proteins.fasta - a file of ab-inito protein 
     273      sequences from program XXX 
     274    * seq_name.maker.non_overlapping_ab_initio.transcripts.fasta - a 
     275      file of filtered ab-inito transcript sequences that don't 
     276      overlap annotations 
     277    * seq_name.maker.non_overlapping_ab_initio.proteins.fasta - a 
     278      file of filtered ab-inito protein sequences that don't overlap 
     279      annotations 
     280    * theVoid.seq_name/ - a directory containing all of the raw 
     281      output files produced by maker, including BLAST reports, SNAP 
     282      output, exonnerate output and the masked sequence. 
    205283 
    206284WARNING: 
    207 *The names of output files are based on sequence ids.  If giving maker a multi-fasta file, it is important to verify that all sequence id are unique, so files are not overwritten. 
    208 *If there are more than 1,000 sequences in a multi-fasta file a deep datastore structure will be used. see DATASTORE in this document. 
    209 *If sequence ids contain characters that are illegal in file names, those characters will be replaced automatically before building output file names. 
    210  
    211 #---------------------------------------------------- 
     285 
     286* The names of output files are based on sequence ids.  If giving 
     287  maker a multi-fasta file, it is important to verify that all 
     288  sequence id are unique, so files are not overwritten. 
     289 
     290* If there are more than 1,000 sequences in a multi-fasta file a deep 
     291  datastore structure will be used. see DATASTORE in this document. 
     292 
     293* If sequence ids contain characters that are illegal in file names, 
     294  those characters will be replaced automatically before building 
     295  output file names. 
     296 
     297#--------------------------------------------------------------------- 
     298 
    212299DATASTORE 
    213300 
    214 "Many filesystems have performance problems with large numbers of subdirectories and files within a single directory and even when the underlying filesystems handle things gracefully, access via network filesystems can be an issue.  The Datastore modules create a hiearchy of subdirectory layers, starting from a 'base', and mapping end-user's identifiers to the corresponding subdirectory." - quote from http://www.yandell-lab.org/  (See site for more information on the Datastore module) 
    215  
    216 A deep datastore will be used by maker if there are more than 1,000 sequences in a multi-fasta file. 
    217  
    218 When a datastore is implemented, the output files described above will not appear where you would normally expect them to be.  Instead they will be located in a series of sub-directory under a new base-directory whose name is determined from the input genome file name, i.e. current_working_directory/genome_datastore/EE/Af/Contig1/Contig1.gff.  A master_datastore_index file will be made in the current working directory to help you find the output files from each sequence. 
    219  
    220 The master_datastore_index file is a file created to allow the user to easily find the exact output directory corresponding to contigs from the input genome file.  The The master_datastore_index file contains three columns of text; the first column shows the sequence identifier from each fasta header, and the second column shows the location of the output files for that sequence. The third column is for logging the status of data related to an individual contig. The values of the third column are as follows: 
    221         STARTED - Indicates that maker has started proccessing this contig. 
    222         FINISHED - Indicates that maker has finished processing this contig and all data is currently available in that subdirectory. 
    223         DIED - Indicates that maker failed. 
    224         DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the specified number of retries and will not try again. 
    225         RETRY - Indicates that maker is retrying the contig after a failure. 
    226         SKIPPED_SMALL - Indicates that this contig was skipped because it is too short (based on control file values set by the user) 
    227  
    228  
    229 #---------------------------------------------------- 
     301"Many filesystems have performance problems with large numbers of 
     302subdirectories and files within a single directory and even when the 
     303underlying filesystems handle things gracefully, access via network 
     304filesystems can be an issue.  The Datastore modules create a hiearchy 
     305of subdirectory layers, starting from a 'base', and mapping end-user's 
     306identifiers to the corresponding subdirectory." - quote from 
     307http://www.yandell-lab.org/ (See site for more information on the 
     308Datastore module) 
     309 
     310A deep datastore will be used by maker if there are more than 1,000 
     311sequences in a multi-fasta file. 
     312 
     313When a datastore is implemented, the output files described above will 
     314not appear where you would normally expect them to be.  Instead they 
     315will be located in a series of sub-directory under a new 
     316base-directory whose name is determined from the input genome file 
     317name: 
     318 
     319i.e. current_directory/genome_datastore/EE/Af/Contig1/Contig1.gff. 
     320 
     321A master_datastore_index file will be made in the current working 
     322directory to help you find the output files from each sequence. 
     323 
     324The master_datastore_index file is a file created to allow the user to 
     325easily find the exact output directory corresponding to contigs from 
     326the input genome file.  The The master_datastore_index file contains 
     327three columns of text; the first column shows the sequence identifier 
     328from each fasta header, and the second column shows the location of 
     329the output files for that sequence. The third column is for logging 
     330the status of data related to an individual contig. The values of the 
     331third column are as follows: 
     332    * STARTED - Indicates that maker has started proccessing this 
     333      contig. 
     334    * FINISHED - Indicates that maker has finished processing this 
     335      contig and all data is currently available in that subdirectory. 
     336    * DIED - Indicates that maker failed. 
     337    * DIED_SKIPPED_PERMANENT - Indicates that maker failed up to the 
     338      specified number of retries and will not try again. 
     339    * RETRY - Indicates that maker is retrying the contig after a 
     340      failure. 
     341    * SKIPPED_SMALL - Indicates that this contig was skipped because 
     342      it is too short (based on control file values set by the user). 
     343 
     344#--------------------------------------------------------------------- 
     345 
    230346CONFIG FILE EDITING 
    231347 
    232 Lines in the maker control files have the format key:value whith no spaces before or after the colon(:).  If the value is a file name, you can use relative paths and environmental variables, i.e. genome:$HOME/my_genome.fasta 
    233  
    234  
    235 MAKER has 3 control files for configuration options. A fourth file evaluator.ctl is used to supply a MAKER related program EVALUATOR with options specific to that program (only important if 'evaluate' is set to 1 in maker_opts.ctl). 
    236  
    237 Note that for all control files the comments written to help users begin with a pound sign(#).  In addition, options before the colon(:) can not be changed, nor should there be a space before or after the colon. 
    238  
    239 A. maker_exe.ctl - includes information about programs executed by MAKER. 
     348Lines in the maker control files have the format key:value whith no 
     349spaces before or after the colon(:).  If the value is a file name, you 
     350can use relative paths and environmental variables, 
     351i.e. genome:$HOME/my_genome.fasta 
     352 
     353 
     354MAKER has 3 control files for configuration options. A fourth file 
     355evaluator.ctl is used to supply a MAKER related program EVALUATOR with 
     356options specific to that program (only important if 'evaluate' is set 
     357to 1 in maker_opts.ctl). 
     358 
     359Note that for all control files the comments written to help users 
     360begin with a pound sign(#).  In addition, options before the colon(:) 
     361can not be changed, nor should there be a space before or after the 
     362colon. 
     363 
     364A. maker_exe.ctl - includes information about programs executed by 
     365MAKER. 
    240366Here an example of a section of the maker_exe.ctl file: 
    241 ==================================== 
     367 
    242368#-----Location of Executables Used by Maker/Evaluator 
    243 formatdb:/usr/local/bin/formatdb                              #location of NCBI formatdb executable 
    244 blastall:/usr/local/bin/blastall                              #location of NCBI blastall executable 
    245 xdformat:/usr/local/bin/xdformat                              #location of WUBLAST xdformat executable 
    246 blastn:/usr/local/bin/blastn                                  #location of WUBLAST blastn executable 
    247 blastx:/usr/local/bin/blastx                                  #location of WUBLAST blastx executable 
    248 tblastx:/usr/local/bin/tblastx                                #location of WUBLAST tblastx executable 
    249 RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker  #location of RepeatMasker executable 
    250 exonerate:/home/cholt/usr/local/exonerate/bin/exonerate       #location of exonerate executable 
     369 
     370#location of NCBI formatdb executable    
     371formatdb:/usr/local/bin/formatdb                               
     372#location of NCBI blastall executable    
     373blastall:/usr/local/bin/blastall                               
     374#location of WUBLAST xdformat executable         
     375xdformat:/usr/local/bin/xdformat                               
     376#location of WUBLAST blastn executable   
     377blastn:/usr/local/bin/blastn                                   
     378#location of WUBLAST blastx executable   
     379blastx:/usr/local/bin/blastx                                   
     380#location of WUBLAST tblastx executable  
     381tblastx:/usr/local/bin/tblastx                                 
     382#location of RepeatMasker executable     
     383RepeatMasker:/home/cholt/usr/local/RepeatMasker/RepeatMasker   
     384#location of exonerate executable          
     385exonerate:/home/cholt/usr/local/exonerate/bin/exonerate        
    251386 
    252387#-----Ab-initio Gene Prediction Algorithms 
    253 snap:/home/cholt/usr/local/snap/snap                  #location of snap executable 
    254 gmhmme3:/home/cholt/usr/local/gmes/gmhmme3            #location of eukaryotic genemark executable 
    255 augustus:/home/cholt/usr/local/augustus/bin/augustus  #location of augustus executable 
    256 fgenesh:/home/cholt/usr/local/fgenesh/fgenesh         #location of fgenesh executable 
    257  
    258 ==================================== 
    259  
    260  
    261 B. maker_bopts.ctl - contains statistics for fltering blast and exonerate data 
     388 
     389#location of snap executable               
     390snap:/home/cholt/usr/local/snap/snap                   
     391#location of eukaryotic genemark executable  
     392gmhmme3:/home/cholt/usr/local/gmes/gmhmme3             
     393#location of augustus executable                   
     394augustus:/home/cholt/usr/local/augustus/bin/augustus   
     395#location of fgenesh executable              
     396fgenesh:/home/cholt/usr/local/fgenesh/fgenesh          
     397 
     398B. maker_bopts.ctl - contains statistics for fltering blast and 
     399exonerate data. 
    262400Here an example of a section of the maker_bopts.ctl file: 
    263 ==================================== 
    264401 
    265402#-----BLAST and Exonerate statistics thresholds 
    266 blast_type:wublast    #set to 'wublast' or 'ncbi' 
    267  
    268 pcov_blastn:0.8       #Blastn Percent Coverage Threhold EST-Genome Alignments 
    269 pid_blastn:0.85       #Blastn Percent Identity Threshold EST-Genome Aligments 
    270 eval_blastn:1e-10     #Blastn eval cutoff 
    271 bit_blastn:40         #Blastn bit cutoff 
    272  
    273 pcov_blastx:0.5       #Blastx Percent Coverage Threhold Protein-Genome Alignments 
    274 pid_blastx:0.4        #Blastx Percent Identity Threshold Protein-Genome Aligments 
    275 eval_blastx:1e-06     #Blastx eval cutoff 
    276 bit_blastx:30         #Blastx bit cutoff 
    277  
    278 pcov_rm_blastx:0.5    #Blastx Percent Coverage Threhold For Transposable Element Masking 
    279 pid_rm_blastx:0.4     #Blastx Percent Identity Threshold For Transposbale Element Masking 
    280 eval_rm_blastx:1e-06  #Blastx eval cutoff for transposable element masking 
    281 bit_rm_blastx:30      #Blastx bit cutoff for transposable element masking 
    282 ==================================== 
    283  
    284  
    285 C. maker_opts.ctl - contains options for maker and external programs used by maker 
     403#set to 'wublast' or 'ncbi' 
     404blast_type:wublast     
     405#Blastn Percent Coverage Threhold EST-Genome Alignments  
     406pcov_blastn:0.8        
     407#Blastn Percent Identity Threshold EST-Genome Aligments  
     408pid_blastn:0.85        
     409#Blastn eval cutoff                                    
     410eval_blastn:1e-10      
     411#Blastn bit cutoff                                       
     412bit_blastn:40          
     413 
     414#Blastx Percent Coverage Threhold Protein-Genome Alignments 
     415pcov_blastx:0.5        
     416#Blastx Percent Identity Threshold Protein-Genome Aligments 
     417pid_blastx:0.4         
     418#Blastx eval cutoff                                       
     419eval_blastx:1e-06      
     420#Blastx bit cutoff                                        
     421bit_blastx:30          
     422 
     423#Blastx Percent Coverage Threhold For Transposable Element Masking  
     424pcov_rm_blastx:0.5     
     425#Blastx Percent Identity Threshold For Transposbale Element Masking 
     426pid_rm_blastx:0.4      
     427#Blastx eval cutoff for transposable element masking              
     428eval_rm_blastx:1e-06   
     429#Blastx bit cutoff for transposable element masking               
     430bit_rm_blastx:30       
     431 
     432C. maker_opts.ctl - contains options for maker and external programs 
     433used by maker. 
    286434Here an example of a section of the maker_opts.ctl file: 
    287 ==================================== 
     435 
    288436#-----Genome (Required for De-Novo Annotations) 
    289 genome:input/genome.fasta  #genome sequence file in fasta format 
     437#genome sequence file in fasta format 
     438genome:input/genome.fasta 
    290439 
    291440#-----Re-annotation Options 
    292 genome_gff:     #re-annotate genome based on this gff3 file 
    293 est_pass:0      #use ests in genome_gff: 1 = yes, 0 = no 
    294 altest_pass:0   #use alternate organism ests in genome_gff: 1 = yes, 0 = no 
    295 protein_pass:0  #use proteins in genome_gff: 1 = yes, 0 = no 
    296 rm_pass:0       #use repeats in genome_gff: 1 = yes, 0 = no 
    297 model_pass:0    #use gene models in genome_gff: 1 = yes, 0 = no 
    298 pred_pass:0     #use ab-initio predictions in genome_gff: 1 = yes, 0 = no 
    299 other_pass:0    #passthrough everything else in genome_gff: 1 = yes, 0 = no 
     441 
     442#re-annotate genome based on this gff3 file                  
     443genome_gff:      
     444#use ests in genome_gff: 1 = yes, 0 = no                     
     445est_pass:0       
     446#use alternate organism ests in genome_gff: 1 = yes, 0 = no  
     447altest_pass:0    
     448#use proteins in genome_gff: 1 = yes, 0 = no                 
     449protein_pass:0   
     450#use repeats in genome_gff: 1 = yes, 0 = no                  
     451rm_pass:0        
     452#use gene models in genome_gff: 1 = yes, 0 = no              
     453model_pass:0     
     454#use ab-initio predictions in genome_gff: 1 = yes, 0 = no    
     455pred_pass:0      
     456#passthrough everything else in genome_gff: 1 = yes, 0 = no  
     457other_pass:0     
    300458 
    301459#-----EST Evidence (you must provide a value for at least one) 
    302 est:input/est.fasta        #non-redundant set of assembled ESTs in fasta format (classic EST analysis) 
    303 est_reads:                 #un-assembled EST reads in fasta format (for deep nextgen mRNASeq) 
    304 altest:input/altest.fasta  #EST/cDNA sequence file in fasta format from an alternate organism 
    305 est_gff:                   #EST evidence from a seperate gff3 file 
    306 altest_gff:                #Alternate organism EST evidence from a seperate gff3 file 
    307  
    308 #-----Protein Homology Evidence (you must provide a value for at least one) 
    309 protein:input/protein.fasta  #protein sequence file in fasta format 
    310 protein_gff:                 #protein homology evidence from a gff3 file 
    311 ==================================== 
    312  
    313 #---------------------------------------------------- 
     460 
     461#non-redundant set of assembled ESTs in fasta format (classic EST 
     462#analysis) 
     463est:input/est.fasta         
     464#un-assembled EST reads in fasta format (for deep nextgen mRNASeq) 
     465est_reads:                  
     466#EST/cDNA sequence file in fasta format from an alternate organism 
     467altest:input/altest.fasta   
     468#EST evidence from a seperate gff3 file                                
     469est_gff:                    
     470#Alternate organism EST evidence from a seperate gff3 file 
     471altest_gff:                 
     472 
     473#-----Protein Homology Evidence (you must provide a value for at least 
     474#     one) 
     475#protein sequence file in fasta format 
     476protein:input/protein.fasta 
     477#protein homology evidence from a gff3 file 
     478protein_gff: 
     479 
     480#--------------------------------------------------------------------- 
     481 
    314482GFF3 Passthrough 
    315483 
    316 If you have data from a source that MAKER does not support, and you wish to use the data in annotating a genome, then you can pass the data to MAKER as an aligned GFF3 file.  This is done by supplying the files location to the appropriate value in the maker_opt.ctl file (i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent to it to be of the type specified, so don't put mixed data in a file (i.e. don't mix EST and other data in the file pointed to by est_gff, otherwise it all gets used as EST data).  Also the genome_gff option is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data must be split by type and identified by the appropriate control file option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction data, est_gff for EST data, etc.).  
    317  
    318 #---------------------------------------------------- 
     484If you have data from a source that MAKER does not support, and you 
     485wish to use the data in annotating a genome, then you can pass the 
     486data to MAKER as an aligned GFF3 file.  This is done by supplying the 
     487files location to the appropriate value in the maker_opt.ctl file 
     488(i.e. est_gff:input\est.gff).  Note that MAKER expects all data sent 
     489to it to be of the type specified, so don't put mixed data in a file 
     490(i.e. don't mix EST and other data in the file pointed to by est_gff, 
     491otherwise it all gets used as EST data).  Also the genome_gff option 
     492is only for MAKER produced GFF3 files.  Other GFF3 files of mixed data 
     493must be split by type and identified by the appropriate control file 
     494option (i.e. rm_gff for repeat data, pred_gff for ab-initio prediction 
     495data, est_gff for EST data, etc.). 
     496 
     497#--------------------------------------------------------------------- 
     498 
    319499ADDING UTRs for GBROWSE 
    320500 
    321 * When using APOLLO to visualize gene annotations, UTRs are inferred based on exon and CDS locations.  However GMOD and GBROWSE do not infer the UTR, so to visualize the UTR, you will have to run: add_utr_gff.pl with the following command: 
     501When using APOLLO to visualize gene annotations, UTRs are inferred 
     502based on exon and CDS locations.  However GMOD and GBROWSE do not 
     503infer the UTR, so to visualize the UTR, you will have to run: 
     504add_utr_gff.pl with the following command: 
    322505 
    323506maker2zff.pl <directory> 
    324 <directory> is the directory where all of your GFF files are located 
    325  
    326 each GFF file will have a sister file called sequence.wutr.gff3 
    327  
    328  
    329 #---------------------------------------------------- 
     507 
     508    * <directory> is the directory where all of your GFF files are 
     509                  located. 
     510 
     511Each GFF file will have a sister file called sequence.wutr.gff3. 
     512 
     513#--------------------------------------------------------------------- 
     514 
    330515APOLLO 
    331516 
    332 Maker is bundled with a configuration file that improves the color and display of maker annotations and evidence in the Apollo genome browser.  The configuration file is called "gff3.tiers" and is located in the maker/Apollo/ directory.  The file should be copied to the conf/ sub_directory which is located under the Apollo instalation directory.  Using the Mac version of Apollo the conf/ directory is located at /Applications/Apollo.app/Contents/Resources/app/conf/. 
    333  
    334  
    335 #---------------------------------------------------- 
     517Maker is bundled with a configuration file that improves the color and 
     518display of maker annotations and evidence in the Apollo genome 
     519browser.  The configuration file is called "gff3.tiers" and is located 
     520in the maker/Apollo/ directory.  The file should be copied to the 
     521conf/ sub_directory which is located under the Apollo instalation 
     522directory.  Using the Mac version of Apollo the conf/ directory is 
     523located at /Applications/Apollo.app/Contents/Resources/app/conf/. 
     524 
     525#--------------------------------------------------------------------- 
     526 
    336527HMM BUILDING (based on snap documentation) 
    337528 
    338 A.  First you will need to determine the genes used to model future genes, by determining a high quality gene set (annotations for the high quality gene should be in GFF3 format).  The high quality gene set can then be coverted into snap ZFF format using maker2zff.pl found in maker/bin. 
    339  
    340 This program is run with the following command: 
    341  
    342       maker2zff.pl <directory> genome 
    343  
    344 *<directory> is the directory where all of your GFF3 files are located 
    345 *geneome is the name for the outfile 
    346  
    347 Files Created: 
    348  
    349       genome.ann 
    350       genome.dna 
    351  
    352 Note:  A convenient way to identify and initial high quality gene set for the HMM is to use the -predictor est2genome option in maker.  This will produce gene annotations based solely on EST evidence.  These annoations can then seed the first HMM.  After running maker again using this new HMM and the -predictor snap option, you can use the second round of annotations as the seed for an even better HMM model.  In this way the HMM model progressively improves with each run of maker. 
    353  
    354 Another strategy for identifying an initial gene set to model the HMM is to use the program CEGMA (http://korflab.ucdavis.edu/software.html).  CEGMA builds a highly reliable set of gene annotations in the absence of experimental data by identifying DNA regions with homology to a set of 458 proteins that are highly conserved among taxa. 
    355  
    356 Combining both CEGMA and maker datasets to build the first HMM is also a good strategy. 
    357  
    358  
    359 B.  Next you will use the dna and zff file (genome.dna and genome.ann) to produce a SNAP HMM as described in the SNAP documation (which we have provided below): 
    360  
    361 The first step is to look at some features of the genes: 
    362  
    363     fathom genome.ann genome.dna -gene-stats  
    364  
    365 Next, you want to verify that the genes have no obvious errors: 
    366  
    367     fathom genome.ann genome.dna -validate 
    368  
    369 You may find some errors and warnings. Check these out in some kind of genome 
    370 browser and remove those that are real errors. Next, break up the sequences into 
    371 fragments with one gene per sequence with the following command: 
    372  
    373     fathom -genome.ann genome.dna -categorize 1000 
    374  
    375 There will be up to 1000 bp on either side of the genes. You will find 
    376 several new files. 
    377  
    378     alt.ann, alt.dna (genes with alternative splicing) 
    379     err.ann, err.dna (genes that have errors) 
    380     olp.ann, olp.dna (genes that overlap other genes) 
    381     wrn.ann, wrn.dna (genes with warnings) 
    382     uni.ann, uni.dna (single gene per sequence) 
    383  
    384 Convert the uni genes to plus stranded with the command: 
    385  
    386     fathom uni.ann uni.dna -export 1000 -plus 
    387  
    388 You will find 4 new files: 
    389  
    390     export.aa   proteins corresponding to each gene 
    391     export.ann  gene structure on the plus strand 
    392     export.dna  DNA of the plus strand 
    393     export.tx   transcripts for each gene 
    394  
    395 The parameter estimation program, forge, creates a lot of files. You probably 
    396 want to create a directory to keep things tidy before you execute the program. 
    397  
    398     mkdir params 
    399     cd params 
    400     forge ../export.ann ../export.dna 
    401     cd .. 
    402  
    403 Last is to build an HMM. 
    404  
    405     hmm-assembler.pl my-genome params > my-genome.hmm 
    406  
    407  
    408 Lastly, you will want to add the location of your hmm file to your maker_opts.ctl file. 
    409  
    410 *For more information see SNAP documentation on how to build an HMM 
     529 
     530A) First you will need to determine the genes used to model future 
     531   genes, by determining a high quality gene set (annotations for the 
     532   high quality gene should be in GFF3 format).  The high quality gene 
     533   set can then be coverted into snap ZFF format using maker2zff.pl 
     534   found in maker/bin. 
     535 
     536   This program is run with the following command: 
     537 
     538       maker2zff.pl <directory> genome 
     539 
     540       * <directory> is the directory where all of your GFF3 files are 
     541         located 
     542       * geneome is the name for the outfile 
     543 
     544   Files Created: 
     545       genome.ann 
     546       genome.dna 
     547 
     548   Note: A convenient way to identify and initial high quality gene 
     549   set for the HMM is to use the -predictor est2genome option in 
     550   maker.  This will produce gene annotations based solely on EST 
     551   evidence.  These annoations can then seed the first HMM.  After 
     552   running maker again using this new HMM and the -predictor snap 
     553   option, you can use the second round of annotations as the seed 
     554   for an even better HMM model. In this way the HMM model 
     555   progressively improves with each run of maker. 
     556 
     557   Another strategy for identifying an initial gene set to model the 
     558   HMM is to use the program CEGMA (http://korflab.ucdavis.edu/ 
     559   software.html).  CEGMA builds a highly reliable set of gene 
     560   annotations in the absence of experimental data by identifying DNA 
     561   regions with homology to a set of 458 proteins that are highly 
     562   conserved among taxa. 
     563 
     564   Combining both CEGMA and maker datasets to build the first HMM is 
     565   also a good strategy. 
     566 
     567B) Next you will use the dna and zff file (genome.dna and genome.ann) 
     568   to produce a SNAP HMM as described in the SNAP documention (which 
     569   we have provided below): 
     570 
     571   The first step is to look at some features of the genes: 
     572 
     573       fathom genome.ann genome.dna -gene-stats  
     574 
     575   Next, you want to verify that the genes have no obvious errors: 
     576 
     577       fathom genome.ann genome.dna -validate 
     578 
     579   You may find some errors and warnings. Check these out in some kind 
     580   of genome browser and remove those that are real errors. Next, 
     581   break up the sequences into fragments with one gene per sequence 
     582   with the following command: 
     583 
     584       fathom -genome.ann genome.dna -categorize 1000 
     585 
     586   There will be up to 1000 bp on either side of the genes. You will 
     587   find several new files. 
     588 
     589       alt.ann, alt.dna (genes with alternative splicing) 
     590       err.ann, err.dna (genes that have errors) 
     591       olp.ann, olp.dna (genes that overlap other genes) 
     592       wrn.ann, wrn.dna (genes with warnings) 
     593       uni.ann, uni.dna (single gene per sequence) 
     594 
     595   Convert the uni genes to plus stranded with the command: 
     596 
     597       fathom uni.ann uni.dna -export 1000 -plus 
     598 
     599   You will find 4 new files: 
     600 
     601       export.aa   proteins corresponding to each gene 
     602       export.ann  gene structure on the plus strand 
     603       export.dna  DNA of the plus strand 
     604       export.tx   transcripts for each gene 
     605 
     606   The parameter estimation program, forge, creates a lot of files. 
     607   You probably want to create a directory to keep things tidy before 
     608   you execute the program. 
     609 
     610       mkdir params 
     611       cd params 
     612       forge ../export.ann ../export.dna 
     613       cd .. 
     614 
     615   Last is to build an HMM. 
     616 
     617       hmm-assembler.pl my-genome params > my-genome.hmm 
     618 
     619   Lastly, you will want to add the location of your hmm file to your 
     620   maker_opts.ctl file. 
     621 
     622* For more information see SNAP documentation on how to build an HMM 
  • bin/maker_map_ids

    r192 r199  
    1111Synopsis: 
    1212 
    13 maker_map_ids genome.all.gff > genome.all.id.map 
     13maker_map_ids --prefix PYU1_ --justify 6 genome.all.gff > genome.all.id.map 
    1414 
    1515Description: 
    1616 
    17 This script wil take the output from a maker GFF file and create a mapping 
    18 file of maker gene and transcript IDs to new human friendly unique IDs. 
     17This script wil take a GFF3 file and create a mapping file of gene and 
     18transcript IDs to more numerically incremented human friendly unique 
     19IDs. 
     20 
     21Options: 
     22 
     23  --prefix      The prefix to use for all IDs 
     24  --justify     The unique integer portion of the ID will be right justified 
     25                with '0's to this length. 
     26  --sort_order  A tab delimited file containing two columns: contig_id 
     27                and sort_order.  Genes and transcripts will be named 
     28                in consecutive order along the contigs, and the 
     29                contigs will be sorted in the order specified by the 
     30                file.  If sort_order is not given and there are 
     31                ##sequence-region directives at the top of the gff 
     32                file then naming will be ordered by decreasing contig 
     33                length. 
     34 
     35Assumptions: 
     36 
     37  The script limits ID mapping to gene and mRNA features, and in 
     38  includes 'G' and 'T' abbreviations within the IDs for genes and 
     39  mRNAs respectively.  These behaviors are hard coded. 
    1940 
    2041"; 
    2142 
    2243 
    23 my ($help); 
    24 my $opt_success = GetOptions('help'    => \$help, 
     44my ($help, $prefix, $justify, $sort_order); 
     45my $opt_success = GetOptions('help'         => \$help, 
     46                             'prefix'       => \$prefix, 
     47                             'justify=s'    => \$justify, 
     48                             'sort_order=s' => \$sort_order, 
    2549                              ); 
    2650 
    27 die $usage if $help || ! $opt_success; 
     51die $usage if $help || ! $opt_success  || ! $prefix; 
     52$justify ||= 6; 
    2853 
    29 my $file = shift; 
    30 die $usage unless $file; 
    31 open (my $IN, '<', $file) or die "Can't open $file for reading\n$!\n"; 
     54my $gff_file = shift; 
     55die $usage unless $gff_file; 
    3256 
    33 my %contig_length; 
     57my $sort_map = parse_sort_order($sort_order, $gff_file); 
     58 
     59open (my $IN, '<', $gff_file) or die "Can't open $gff_file for reading\n$!\n"; 
     60 
    3461my %counter; 
    35 my %parents; 
     62my %ids; 
     63# Build a hash of ids that need to be mapped; 
    3664while (<$IN>) { 
    37  
    38         if (/\#\#sequence-region\s+(\S+)\s+(\d+)\s+(\d+)/) { 
    39                 my ($contig_id, $start, $end) = ($1, $2, $3); 
    40                 $contig_length{$contig_id} = ($end - $start); 
    41                 next; 
    42         } 
    4365 
    4466        last if /^\#\#FASTA/; 
     
    4870            $phase, $attrb_text) = split /\t/, $_; 
    4971 
     72        # Here we explicity limit the mapping to genes and mRNAs 
    5073        if ($type =~ /^(gene|mRNA)$/) { 
    5174                my ($id) = $attrb_text =~ /ID=(.*?);/; 
    52                 $parents{$seq}{$start}{$id} = $type; 
     75                $ids{$seq}{$start}{$id} = $type; 
    5376        } 
    5477} 
    55  
    56 for my $contig_id (sort {$contig_length{$b} <=> $contig_length{$a}} keys %parents) { 
    57         my $contig = $parents{$contig_id}; 
     78# Create the new ID map.  Sort contigs by $sort_map in outer loop and  
     79# features by $start in second loop. 
     80for my $contig_id (sort {sort_contigs($a, $b, $sort_map)} keys %ids) { 
     81        my $contig = $ids{$contig_id}; 
    5882        for my $start (sort {$b <=> $a} keys %{$contig}) { 
    5983                for my $id (keys %{$contig->{$start}}) { 
    6084                        my $type  = $contig->{$start}{$id}; 
     85                        #Here we create gene and mRNA abbreviation. 
    6186                        my $abrv  = $type eq 'gene' ? 'G' : 'T'; 
    6287                        my $count = sprintf '%06s', ++$counter{$type}; 
     
    6792        } 
    6893} 
    69  
    70 __END__ 
    71  
    72  
    73  
    74                     my @attrb_list = split /\s*;\s*/, $attrbs_text; 
    75                     my %attrbs; 
    76                     for my $attrb_pair (@attrb_list) { 
    77                             my ($tag, $value_text) = split /\s*=\s*/, $attrb_pair; 
    78                             my @values = split /\s*, \s*/, $value_text; 
    79                             push @{$attrbs{$tag}} = @values; 
    80                     } 
    81                      
    82  
    83                              
    84                     } 
    85             } 
    86  
    8794#----------------------------------------------------------------------------- 
    8895#-------------------------------- SUBROUTINES -------------------------------- 
    8996#----------------------------------------------------------------------------- 
     97sub parse_sort_order { 
    9098 
    91 sub { 
     99        my ($sort_order_file, $gff_file) = @_; 
    92100 
     101        my %sort_order; 
     102 
     103        if ($sort_order_file) { 
     104                open(my $IN, '<', $sort_order_file) or die "Can't open $sort_order_file\n$!\n"; 
     105                my %sort_order = (<$IN>); 
     106                close $IN; 
     107        } 
     108        elsif (`grep -P '\#\#sequence-region' $gff_file`) { 
     109                open(my $IN, '<', $gff_file) or die "Can't open $gff_file\n$!\n"; 
     110                my $flag; 
     111                while (my $line = <$IN>) { 
     112                        if ($line =~ /\#\#sequence-region\s+(\S+)\s+(\d+)\s+(\d+)/) { 
     113                                $flag++; 
     114                                my ($contig_id, $start, $end) = ($1, $2, $3); 
     115                                $sort_order{$contig_id} = ($end - $start); 
     116                        } 
     117                        last if $flag; 
     118                } 
     119                close $IN; 
     120        } 
     121        else { 
     122                open(my $IN, '<', $gff_file) or die "Can't open $gff_file\n$!\n"; 
     123 
     124                while (<$IN>) { 
     125                        my @fields = split /\t/, $_; 
     126                        next unless @fields == 9; 
     127                        my $seq = $fields[0]; 
     128                        $sort_order{$seq}++; 
     129                } 
     130        } 
     131        return \%sort_order; 
    93132} 
    94  
     133#----------------------------------------------------------------------------- 
     134sub sort_contigs { 
     135        my ($a, $b, $sort_map) = @_; 
     136        return ($sort_order->{$a} <=> $sort_map->{$b}) if $sort_order; 
     137        return ($sort_order->{$b} <=> $sort_map->{$a}); 
     138
  • bin/map_fasta_ids

    r192 r199  
    3232die $usage unless $map_file && $fasta_file; 
    3333 
     34# Read the map file and build a map hash; 
    3435open (my $MAP, '<', $map_file) or die "Can't open $map_file for reading\n$!\n"; 
    3536my %map; 
     
    3738close $MAP; 
    3839 
     40# Open the fasta file for input unlink it to avoid clobbering it and open the  
     41# same file for output. 
    3942open (my $IN, '<', $fasta_file) or die "Can't open $fasta_file for reading\n$!\n"; 
    4043unlink($fasta_file); 
    4144open(my $OUT, '>', $fasta_file) or die "Can't open $fasta_file for writing\n$!\n"; 
    4245 
     46# Just do it! 
    4347while (<$IN>) { 
    4448        if  (/^>/) { 
  • bin/map_gff_ids

    r192 r199  
    1616 
    1717This script takes a id map file and changes the name of the 
    18 appropriate attributes in a gff file.  The map file is a two column 
    19 tab delimited file with two columns: old_name and new_name.  If the 
    20 feature has a Parent attribute, it uses this as old_name and modifies 
    21 all other attributes that cantain this name even if it's the root of 
    22 another name (i.e. an exon with a Parent=old_name and an 
    23 ID=old_name:exon_1 would become Parent=new_name and 
    24 ID=new_name:exon_1).  In features that have an ID and not a pareent, 
    25 the old_name is moved to the Alias attribute. 
     18appropriate attributes in a gff file.  The map file is tab delimited 
     19file with two columns: old_name and new_name.The script requires that 
     20old_name (or the part of old_name before the first colon) be equal to 
     21an attribute value before it will map it. 
    2622 
    2723"; 
     
    3531die $usage unless $map_file && $gff_file; 
    3632 
     33# Read map file and create a map hash 
    3734open (my $MAP, '<', $map_file) or die "Can't open $map_file for reading\n$!\n"; 
    3835my %map; 
     
    4037close $MAP; 
    4138 
     39# Open $gff_file for reading, unlink it to avoid clobbering it and then open 
     40# the same file for writing.  This allows in-place editing. 
    4241open (my $IN, '<', $gff_file) or die "Can't open $gff_file for reading\n$!\n"; 
    4342unlink($gff_file); 
    4443open(my $OUT, '>', $gff_file) or die "Can't open $gff_file for writing\n$!\n"; 
     44 
     45# Set the order in which attributes will appear in the attribute text. 
     46# Attributes not listed here will be sorted alphabetically. 
     47my %order = (ID     => 1, 
     48             Parent => 2, 
     49             Name   => 3, 
     50             Alias  => 4, 
     51            ); 
    4552 
    4653my $fasta_start; 
     
    5057        my ($seq, $source, $type, $start, $end, $score, 
    5158            $strand, $phase, $attrb_text) = split /\t/, $line; 
    52         if (! $fasta_start    && 
    53             $line !~ /^\s*\#/    && 
    54             $start =~ /^\d+$/ && 
    55             $end =~ /^\d+$/   && 
    56             $attrb_text) { 
     59        if (! $fasta_start    && # If we're not in the fasta section... 
     60            $line !~ /^\s*\#/ && # and we're not a comment line... 
     61            $start =~ /^\d+$/ && # and we have a valid start column... 
     62            $end =~ /^\d+$/   && # and a valid end column... 
     63            $attrb_text          # and attribute text 
     64           ) { 
    5765                chomp $attrb_text; 
    58                 my @attrb_list = split /\s*;\s*/, $attrb_text; 
    59                 my %attrb; 
    60                 for my $attrb_pair (@attrb_list) { 
    61                         my ($tag, $value_text) = split /\s*=\s*/, $attrb_pair; 
    62                         my @values = split /\s*,\s*/, $value_text; 
    63                         $attrb{$tag} = \@values; 
    64                 } 
     66                my %attrb = parse_attributes($attrb_text); 
     67 
     68                # Create an alias with the old_name for gene and mRNA features. 
    6569                my $alias; 
    66                 my @ids; 
    6770                if ($type eq 'mRNA' || $type eq 'gene') { 
    6871                        $alias = $attrb{ID}[0] if $type =~ /^(gene|mRNA)$/; 
    6972                } 
     73 
     74                # Collect all IDs that should be mapped for this feature 
     75                # (the values of Parent and ID tags) 
     76                my @ids; 
    7077                push @ids, @{$attrb{Parent}} if $attrb{Parent}; 
    7178                push @ids, @{$attrb{ID}}     if $attrb{ID}; 
     79 
     80                # Keep only the IDs that are in the mapping file. 
    7281                my @map_ids; 
    7382                map {push @map_ids, $_ if $map{$_}}  @ids; 
     83 
     84                # If there is nothing to map, print the line and go to 
     85                # the next feature. 
    7486                if (! @map_ids) { 
    7587                        print $OUT $line; 
    7688                        next LINE; 
    7789                }; 
     90 
     91                # Map old_name to new_name for each tag value... 
    7892                for my $tag (keys %attrb) { 
    7993                        for my $value (@{$attrb{$tag}}) { 
    8094                                for my $id (@map_ids) { 
    81                                         next unless $map{$id}; 
     95                                        # Only if the value (or the portion preceding 
     96                                        # the first colon) is equal to the map key. 
    8297                                        next unless ($value eq $id || $value =~ /$id:/); 
    8398                                        $value =~ s/$id/$map{$id}/; 
     
    85100                        } 
    86101                } 
     102                # Now that we're done mapping, add the alias to the attributes 
     103                # if we have one. 
    87104                $attrb{Alias} = [$alias] if $alias; 
    88                 my %order = (ID     => 1, 
    89                              Parent => 2, 
    90                              Name   => 3, 
    91                              Alias  => 4, 
    92                             ); 
     105 
     106                # Make sure we have all attribute keys added to sort order to 
     107                # avoid undef in <=> 
    93108                map {$order{$_} ||= 99} keys %attrb; 
     109 
     110                # Create new attribute text. 
    94111                my $new_attrb_text; 
    95112                for my $tag (sort {$order{$a} <=> $order{$b} || $a cmp $b} keys %attrb) { 
     
    98115                } 
    99116                $new_attrb_text .= "\n"; 
     117 
     118                # Create a new $line. 
    100119                $line = join "\t", ($seq, 
    101120                                 $source, 
     
    111130        print $OUT $line; 
    112131} 
     132#----------------------------------------------------------------------------- 
     133#-------------------------- Subroutines -------------------------------------- 
     134#----------------------------------------------------------------------------- 
     135sub parse_attributes { 
     136        my $attrb_text = shift; 
     137 
     138        my %attrb; 
     139        my @attrb_list = split /\s*;\s*/, $attrb_text; 
     140        for my $attrb_pair (@attrb_list) { 
     141                my ($tag, $value_text) = split /\s*=\s*/, $attrb_pair; 
     142                my @values = split /\s*,\s*/, $value_text; 
     143                $attrb{$tag} = \@values; 
     144        } 
     145        return %attrb; 
     146}