README file for Stochastic Evolutionary Model of a Protein's Production Rate (SEMPPR) ************************************************************************************* VERSION: SEMPPR 1.0.1 DESCRIPTION: The code calcualtes the production rate of proteins based on the degree of its adaptation as measured by its codon usage patterns in an explicitly evolutionary framework. The model is described in: "Gilchrist, M.A. (2007) Combining Models of Protein Translation and Population Genetics to Predict Protein Production Rates from Codon Usage Pattern using SEMPPR. Molecular Biology & Evolution." LICENSE: Licensed under GNU General Public License. See LICENSE file for details. SYNOPSIS: ./SEMPPR [options] -F OPTIONS: -T Specify the location of the tRNA abundance/codon translation file. [DEFAULT] "tRNA_files/S.cerevisiae.tRNA" -O Specifies the prefix for the output files. [DEFAULT] "output/out" -N Specify the number of sequences to be simulated for calculating the Eta distribution. [DEFAULT] N=1000 -M MLE, MOM Specify whether the paramaters of the beta distribui- on should be calculated using the Maximum Likelihood method or from moments of the Eta distribution. The MLE approach is numerically more intensive than the MOM approach. The estimates from the two different methods converge asymptotically as N gets large. [DEFAULT] "MLE" -P Specify whether the simulated codon sequences should be printed or not. If P>0, then the program will create a separate file for each sequence, with sequence ID as the filename in the output directory. 0 = Off; 1 = Cryptic; 2 = Codon sequences [DEFAULT] P=0 -G Specify whether the program should use the simulated sequences from a file or create a new set of sequences. This option should be used only if the code was run EARLIER with -P command and P>0 with the same set of sequences. If G>0, specify which format was used to print the sequences using -P 0 = New; 1 = Cryptic; 2 = Codon sequences [DEFAULT] G=0 -V [0,1,2] Specify level of verbosity of messages from the code so that user can keep track of SEMPPR's progress. 0 = Off; 1 = Minimal; 2 = Extended [DEFAULT] "1" BINARIES: *NIX and OSX: Precompiled binary for *NIX machines is included as bin/SEMPPR. This Binary was compiled on on a linux machine GCC 4.1 with the command: g++ -static -lgsl -lgslcblas -lm -mtune=generic The binary ideally should run on all i386 and x86_64 machines running linux or OSX with GNU Scientific Libraries (GSL) installed on them. If problems are encountered we encourage you to try recompiling the code for your local machine before contacting the authors. WINDOWS: A precompiled binary for Windows XP is included as bin/SEMPPR.exe COMPILERS: The code can be recompiled from source and optimized for the hardware of the machine it will be run on. The source code has been successfully compiled with the following compilers. Mac: gcc, g++ Linux: g++, gcc -lm Windows: Bloodshed Dev C++, Lcc-win32 RUNTIME: On a 2.66 GHz single core Xeon processor, the code takes approximately 0.061 seconds per gene for 1000 configurations and 10000 steps using MOM 0.511 seconds per gene for 10000 configurations and 10000 steps using MOM 0.631 seconds per gene for 1000 configurations and 10000 steps using MLE FILE FORMATS: INPUT FILES: Sequence file: * The file should contain genes in the standard FASTA format. * The analysis will terminate at the first non-sense codon or if the codon contains any character apart from "A", "T", "G" and "C". * All nucleotides should be in uppercase characters. * The sequence should contain only an ORF WITHOUT the UTRs and intronic regions. tRNA abundance/codon translation file: [AA] [CODON] [ABUNDANCE/RATE] example: A GCG 7.82 A GCT 18.2 C TGC 7.47 . . . . . . . . . W TGG 11.7 Y TAC 17.0 Y TAT 10.4 OUTPUT FILES: Output file prefix (*) is set by the -O flag at run time. *.param_of_eta_distbn: [ID] [Eta] [Eta_min] [Eta_max] [Alpha] [Beta] ID = Sequence ID. Eta = Observed Eta value of the sequence. Eta_min = Eta value of the sequence using most optimum codons. Eta_max = Eta value of the sequence using least optimum codons. Alpha = Alpha parameter of the beta distribution fitted to the Eta values of simulated sequences. Beta = Beta parameter of the beta distribution fitted to the Eta values of simulated sequences. *.summary_stats_of_phi: [ID] [phi_mode] [phi_arith_mean] [phi_geomtrc_mean] [phi_variance] ID = Sequence ID. phi_mode = Mode of the posterior distribution of the production rates. phi_arith_mean = Arithmetic mean of the posterior distribution of the production rates. phi_geomtrc_mean = Geometric mean of the posterior distribution of the production rates. phi_variance = Variance of the posterior distribution of the production rates. *.posterior_distbn_of_phi: [ID] [0.05]...[0.9999] ID = Sequence ID. 0.05 = phi value at 0.05 percentile *.SEMPPR.log: Creates the log file of all the runs for a given prefix. It appends the information for each run to this file. EXAMPLES: Examples of output files created using ./fasta/example.fasta are in the ./example folder. Note that because g(\eta) is estimated using simulation, the exact values from the output will vary slightly between runs. UPDATES: Updates for this code can be found at the following website: www.tiem.utk.edu/~mikeg/SupplementaryMaterials/SEMPPR/semppr.html BUGS: In case of any bugs or trouble with the code, send a mail to pshah1@utk.edu