$Config{perladmin} = "Andrew.Torda@anu.edu.au"; =pod =head1 NAME wurst - Wurst is a Useful, Reliable Structural Tool =head1 SYNOPSIS B I =head1 DESCRIPTION use Wurst; or, depending on the directory you are running from, use FindBin; use lib "$FindBin::Bin/../src/Wurst/blib/arch"; use lib "$FindBin::Bin/../src/Wurst/blib/lib"; use Wurst; =head1 INTRODUCTION There is no monolithic program, wurst. Instead, this is a framework which works mostly with proteins and lets one do things like sequence alignments, sequence to structure alignments, and pure structural alignments. The philosophy is that there is a set of low level routines coded in C. These are called from perl. To do anything useful, you write a little perl script which calls the functions you want. Scripts have been written to do things such as parameter optimization using a simplex or protein structure-based phylogeny using a guide tree-based method. =head1 DATA TYPES We define certain kinds of pieces of data. In general, the interpreter calls a function which returns a pointer to some piece of data with a label. This may be passed into other function calls. When the last reference to the piece of data disappears, the object will be cleared up. For example, C returns something of type SeqPtr. This can then be passed to other functions like C. The scheme is such that we have some kind of type checking. For example, make_model has an interface like make_model ( p, seq, coord) and the interpreter will only make the call if the first argument is a Pair_set, the second a sequence and the third some coordinates. The set of data types includes: =over =item Aa_clssfcn =item Clssfcn =item Coord =item FXParam =item Pair_set =item Param =item Prob_vec =item Sec_s_data =item Seq =item Seq_array =item Sub_mat =item Score_mat =item Scor_set =item Seqprof =back =head1 PROBABILITY VECTOR NOTES This is a special section to point one in the right direction to read about the functions which create probability vectors. These are described in the full function list, but they can be hard to find, so they are gathered together here. =over =item seq_2_prob_vec Create a probability vector from a sequence structure, using a classification that is purely sequence based. =item struct_2_prob_vec Create a probability vector from a structure, using a classification that is purely sequence based. =item aa_strct_2_prob_vec Create a probability vector from a structure, but using a classification based on sequence and structure information. Use both the sequence and structure data. =item aa_2_prob_vec SEQ CLSSFCN =item aa_2_prob_vec SEQ CLSSFCN NORM Create a probability vector from a sequence SEQ, but using a classification CLSSFCN based on sequence and structure information. This is the same classification as above in C. The third argument can be used to turn off normalising of vectors by passing a value of S<0 (zero)>. By default, S. =back =head1 FUNCTIONS =over =item ac_nclass AA_CLSSFCN B is an amino acid fragment classification. Return the number of classes in the classification. =item ac_dump AA_CLSSFCN B is an amino acid fragment classfication. Print some information to stdout about the classication. This is useful when checking if a classification has been read properly. =item ac_nclass AA_CLSSFCN B is an amino acid fragment classfication. Return the number of classes. =item ac_size AA_CLSSFCN B is an amino acid fragment classfication. Return the fragment length. The intention is only simple for the case of sequence classes. For structural classifications, there may be more than I descriptors for I residues. =item ac_read FNAME Read an amino acid fragment classification. Return a B object if successful, undef otherwise. =item ac_size AA_CLSSFCN B is an amino acid fragment classification. Return the fragment size used in the classification. =item blst_chk_read FNAME Read a BLAST checkpoint file from B. Return a B object if successful, undef otherwise. =item blst_chk_write FNAME Seqprof Writes a BLAST checkpoint file to B. Returns 1 if successful, and 0 otherwise. This profile is functionally equivalent to profiles written by psi-blast and used by wurst, but not byte-equivalent (due to floating point noise). =item get_clssfcn FNAME abs_error Reads an classification generated by AutoClass-C. Do not forget to give an absolut error value from your original data (it will be used for the integration in computeMembership() ). Returns a B object if successful, undef otherwise. =item coord_2_bin COORD FNAME Take a B structure and write it, in our proprietary .bin format, to FNAME. Returns 1 on success and undef on failure. This function may be used when writing the results of a calculation, but more likely, it is used when converting PDB files to our C<.bin> format. Look at pdb_read. =item coord_c_n_dist COORD i, j, sqrt_flag Given a B structure, COORD and the index of two residues, B and B return the distance, in Angstroms between the carbonyl oxygen of B and the backbone nitrogen atom of B if B is set to non-zero. If the flag is left as zero or B, then the value returned will be the distance squared. In a protein, the bondlength is about 1.32 Angstrom. Many scripts just want to check for continuous chains, so they do not need the real distance and it is faster to avoid the square root. For example, to check if the third and fourth residues are bonded, you might say my $dist2 = 1.5 * 1.5; if (coord_c_n_dist ($coord, 2, 3, 0) > $dist2) { print "not connected\n"; } else { print "connected\n"; } =item coord_deletion COORD Start Coord_Length Seq_Length Returns a pointer to a new B (structure) where an affine gap has been introduced, arising from the deletion of some sequence or coordinate data from an existing structure. The gap is specified by its B [0,size), and number of residues excised from the structure (B) and sequence (B). Minimal checking ensures gap specifications are sensible. This function is provided for the generation of synthetic threading results and other forms of improper structure sets. =item coord_geo_gap COORD SCALE MAX Calculate a penalty based on the geometric damage in a structure. This is based on the carbonyl carbon to amide nitrogen distance between each pair of residues which should be connected. The ideal C..N distance is 1.32 Angstrom. No penalty is enforced if the distance is less than 2.0 Angstrom. COORD is a set of coordinates. SCALE is the number with which penalties will be multiplied. MAX is a limit on the distance to be considered. For each gap, if it is bigger than MAX, it is set to MAX. This function returns a list of values like: ($quadratic, $linear, $logistic, $num_gap) = coord_geo_gap ($coord, $scale, $max); Where =over =item $quadratic Sum over (distance - ideal)^2 for each gap, where ideal is hard coded to about 1.32 Angstrom. =item $linear This is the sum over (distance - ideal) for each gap. =item $logistic This uses a logistic activation function. The penalty is the sum of a logistic activation function applied to each gap. =item $num_gap This is the number of gaps, regardless of length. =back =item coord_get_seq COORD Returns a pointer to a B (sequence) given a pointer to a B. =item coord_has_sec_s COORD Returns non-zero if B has defined secondary structure. =item coord_get_sec_s COORD Returns a string containing the secondary structure assignment for the coordinates, if they exist. Side-effect: a non-serious warning message is generated if COORD has no assigned secondary structure. =item coord_name COORD Print the PDB acquisition code and chain identifier from the coordinates, COORD (an object of C type). =item coord_read FNAME Read coordinates from file FNAME. Returns a B pointer. =item coord_rmsd PAIR_SET, COORD1, COORD2, SUBSET_FLAG =item coord_rmsd PAIR_SET, COORD1, COORD2 Superimpose COORD1 onto COORD2 and calculate the root mean square difference of coordinates in Angstroms, based on an alignment. Return a list with shifted coordinates and the rmsd value like this ($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2) or ($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2, SUBSET_FLAG) $rmsd is the root mean square difference. $new_c1 is a coordinate object which has been moved. It may be smaller than $c1 (see note below about SUBSET_FLAG). $new_c2 may be identical to $c2. $pair_set has come from some kind of alignment such as sequence to sequence or structure to structure. C<$c1> are the coordinates to be moved. C<$c2> are the reference coordinates. If defined, the optional argument, SUBSET_FLAG, means that one walks down the list coordinates of C<$c1> and C<$c2> and copies into C<$new_c1> and C<$new_c2> those sites which were aligned. If the alignment of two proteins includes most of the original coordinates, then one will not usually want to define SUBSET_FLAG. If the two proteins are very different sizes, it makes sense to define this flag (and only see the similar regions). If SUBSET_FLAG is not set, $new_c2 is an exact copy of $c2. =item coord_size COORD Return an integer with the number of residues in COORD, a C object. =item coord_2_pdb FNAME COORD SEQ =item coord_2_pdb FNAME COORD Write the coordinates from COORD (a B pointer) to the filename given by FNAME. Returns nonzero on success and the undefined value otherwise. If the optional last argument, SEQ, is present, it should be a sequence object and will be used to print the original (hopefully) complete sequence in the pdb file in standard, PDB, SEQRES format records. =item coord_2_spdb FNAME COORD SCOR_SET =item coord_2_spdb FNAME COORD SCOR_SET SEQ Identical to B (above), this function additionally writes a temperature value in the generated PDB file, corresponding to the score given in SCOR_SET (a B pointer) for each residue in COORD. See the entry for the function B (below) for an example. =item coord_2_pnlty COORD VALUE COORD is a C type object. VALUE is a floating point number. Returns a FloatPtr (floating point array) which will be used for extra gap penalty weights during alignments. At each site in COORD, see if it is in secondary structure. If so, it gets an extra weight, given by VALUE. The exact method is more complicated. If we have a structure which looks like 1 2 3 4 5 6 7 - E E E H - - The we say that "E" and "H" refer to extended (beta strand) and helix, so residues S<2 - 5> look as if they are in interesting secondary structure. In fact, the situation is more interesting. Residue 1 is adjoining a piece of secondary structure, and residue 2 is on the edge of one. Only residues S<3 - 4> are really definitely inside secondary structure and residue 7 is definitely outside one. If VALUE is set to 10, then the coefficients will be residue coefficient 1 3.25 2 7.75 3 10 4 10 5 7.75 6 3.25 7 1 =item dmat_b_cliques MODEL REFERENCE BOUND SIZE Returns a list of intervals on MODEL's coordinates corresponding to 'flat' areas of the difference of distance matrices between MODEL and REFERENCE (which could be the native fold of the sequence modelled by MODEL). The list should be interpreted as ranges in the residues of MODEL : start1, end1, start2, end2, ... These define contiguous stretches of MODEL's structure whose distance contacts differ by less than BOUND angstroms from those in REFERENCE. Each interval is at least of size SIZE. There are some default parameters : my @good_regions_of_model = dmat_b_cliques $model, $native; Returns the flat regions longer than 9 residues where DME is less than 1 angstrom. =item dme_thresh FRACTION COORD1 COORD2 THRESHOLD Compare the distance matrices from COORD1 and COORD2. Calculate the fraction of the distance matrix left after setting a threshold of THRESHOLD. Put this fraction into FRACTION. Returns the useful answer in FRACTION, but returns undef on error. DME stands for "distance matrix error", a name coined in S, Biopolymers, 29, 1565-1585 (1990), but it is really the root mean square difference of distance matrices. To compare structures we calculate the DME based on alpha carbon coordinates. We then go back to the distance original distance matrices and find the most different matrix elements and enter a loop: while (DME > threshold) remove most different element from calculation return the fraction of the distance matrix remaining If the two sets of coordinates were very similar, then we did not have to remove any matrix elements, so we return a value near 1.0. If the coordinates were very different, we had to throw away most matrix elements, so we return a number closer to 0.0. A good value for THRESHOLD is about 3.5 or 4.0 Angstrom. This measure of similarity is bounded by 0 and 1.0 and is not too sensitive to protein size. =item dme_nice MODEL REF BOUND SIZE CORE CORE_BORDER Returns a list of intervals defining bounded regions of the difference of distance matrices between MODEL and REF. Whereas in dme_b_cliques, the intervals define bounded regions, the 'nice' regions defined by this function allow for some deviations at the boundary of the flat regions of a DME plot. SIZE refers to the structural fragment size, and CORE refers to the minimum size of a bounded region expressed as the number of overlapping fragments. CORE_BORDER is the size of the boundary region where deviation is allowed. Essentially this means that pairs of intervals that would be returned by dmat_b_cliques will be merged if they are separated by fewer than CORE_BORDER residues. =item dssp COORD Run the DSSP program on the coordinates in COORD. Store the answer in COORD. =item make_model PAIR_SET SEQ COORD Given an alignment in PAIR_SET (a B pointer) which comes from sequence SEQ aligned to coordinates COORD, return a B pointer with a model of the sequence on the coordinates. =item model_pdb_num COORD MODELNUM Given a B and a residue position, returns the original PDB number stored for that residue. =item model_res_num COORD RESIDUENUM Given a B and a residue number (according to PDB sequence numbering), returns the index for that residue in the coord object, or undefined if there is no coordinates for that sequence number. =item num_seq SEQ_ARRAY Returns the number of sequences in a sequence array. =item pair_set_coverage PAIR_SET SIZE1 SIZE2 Given an alignment, we often want to know what part of the sequence or structure is accounted for. For example A B C - - D E - F G H Q R S T U V W - Y Z In the first string, you could imagine the residues we know about are like 0 0 1 1 1 1 0 1 while in the second string, the pattern is like 1 0 0 1 1 0 1 1 Call this function with a pair set and the size of the first and second objects (sequences or sequence/structure pair). pair_set_coverage returns two lists. One for each member of the alignment respectively. This returns a pair of strings. In each string, a '0' char means the site is not covered by the alignment, but a '1' means it is. On failure, returns undef. A typical use would be my ($s1, $s2) = pair_set_coverage ($pair_set, seq_size ($seq), coord_size ($coord); print "Pattern of sequence coverage looks like\n$s1\n"; print "Pattern of struct coverage looks like \n$s2\n"; and the output strings might look like 00100111 Which would mean that the 3rd, 6th and 7th positions were properly aligned. Note also that this leaves room for some tricks. Strings can be binary &'d to look for overlap and '1's can be quickly counted using perl matching operators. =item pair_set_extend PAIR_SET $EXT_LONG =item pair_set_extend PAIR_SET $EXT_SHORT If you have done a Smith and Waterman style alignment, you may not have all residues aligned. Consider A B C D Q R S E X Y Q R S X Y The alignment from a highest scoring segment is Q R S Q R S and this is what we will likely get if we don't ask otherwise. but we could reasonably do a short extension to include residues which go out to the ends of the shorter sequence. C D Q R S E X Y Q R S X One may also want a longer extension, which would look like A B C D Q R S E - - - X Y Q R S X Y Given these definitions of extension, (long and short), you may want either result. The symbols, C<$EXT_LONG> and C<$EXT_SHORT> are constants. You should choose one of them. =item pair_set_gap PAIR_SET SCALE_OPEN SCALE_WIDEN Like C, this is for calculating extra gap penalties after doing an alignment. This returns a gap penalty for the sequence, not the structure. It returns two values like my ($open_penalty, $widen_penalty); ($open_penalty, $widen_penalty) = pair_set_gap($pair_set, $o_scl, $w_scl); For each gap, the penalty is calculated as penalty = open_scale + (n - 1) * widen_scale where C is the length of the gap. Because these things are linear, you can get the number of gaps in a whole alignment from S>. =item pair_set_score PAIR_SET This returns two floating point numbers. The first is the score from an alignment, including gaps. The second is the score, not including gaps. The intentions are to =over =item * Let one look at the pure sequence-structure score and =item * Let one use a more sophisticated gap scoring scheme than the one used in the alignment calculation. =back For example, we have have something like my $pair_set = score_mat_sum_sec (.....); my ($score, $simple_score) = pair_set_score ($pair_set); print "Score with gaps is ", $score, "without gaps is ", $simple_score, "\n"; For compatibility with old scripts, it is still OK to say my $score = pair_set_score ($pair_set) =item pair_set_string PAIR_SET SEQ1 SEQ2 Return a string for printing. PAIR_SET is an alignment corresponding to SEQ1 and SEQ2. If this came from a sequence to structure alignment, we still want the sequence corresponding to SEQ2. This is easy to get from coord_get_seq (C) but this routine should be renamed C. =item pair_set_pretty_string PAIR_SET SEQ1 SEQ2 SEC_S_DATA COORD2 =item pair_set_pretty_string PAIR_SET SEQ1 SEQ2 Return a string with a pretty alignment. If you have secondary structure data (SEC_S) which has probably come from a predictor and if you provide the coordinates (COORD2), then the output will include this secondary structure information. =item param_fx_read FNAME Read parameters for Thomas' Bayesian scoring / alignment functions. Returns an FXParam object on success, or undefined on failure. =item param_rs_read FNAME Read parameters for the tanh() based score function with all atoms. This is for rescoring, not alignment. Returns an RSParam object on success, or undefined on failure. =item pdb_read FNAME ACQ_CODE CHAIN Read the pdb file, FNAME, and return a Coord object. Note that this will not include secondary structure information. ACQ_CODE is the acquisition code. It can be blank, in which case, the program will try to guess the code from the filename. CHAIN can also be blank, or a single letter, chain specifier. This function is for rewriting lots of files, for example, when we rebuild a library. This means it tolerates errors rather than stopping. Furthermore, it contains a list of residue name substitutions. For example, it will change original new_name comment --------------------------- UNK ALA Unknown residues become alanine MSE MET Selenomethionine becomes methionine CSE CYS Selenocysteine becomes cystein There are currently more than 20 of these conversions. They are a one way affair. After reading the coordinates, the original name is thrown away. =item prob_vec_info PROB_VEC B is a matrix of class probabilities. This returns a string containing some minimal information. Functions like L<"seq_2_prob_vec"> return a probability vector. =item scor_set_simpl PAIR_SET SCORE_MAT Returns a B object, corresponding to the local score for each aligned pair of residues in PAIR_SET, based on the scores in SCORE_MAT. These local scores, when summed, yield the 'simple score' returned by B. The obvious use of this function is to generate an annotated PDB file : coord_2_spdb ("1mypC.pdb", make_model($pair_set, $seq, $template) , scor_set_simpl($pair_set, $tot_score_mat)); Where the temperature entry of the each residue in the model will be set to the local score given in the score matrix. =item scor_set_fromvec @SCORES Returns a B object containing the vector of scores given in @SCORES. =item scor_set_to_vec $scor_set Returns a vector of doubles corresponding to the scores contained in the B object. This is useful when you want to actually print out the scores for an alignment : # float_to_character maps a signed float to a # scale of alphanumerics sub d2c( $ ) { return float_to_character ( $_ ) }; my $i=0; my $cs = pair_set_coverage( $pairset, $seq_len, $coord_len ); my $scor = scor_set_to_vec( $scor_set); # from $pair_set $cs=~ s/0/./g; $cs=~ s/1/d2c($$scor[$i++]);/ge; # $cs is now a coverage string where every aligned position is # scored according to float_to_character. =item scor_set_scale SCOR_SET SCALE Returns true if it managed to divide each entry in the B by the scalar in SCALE. =item score_fx MATRIX SEQ COORD PARAM Fill out a score matrix for a sequence structure pair, based on Herr Doktor Huber's Bayesian-based score function. This score function works for sequence to structure alignments. MATRIX is a new created score matrix, probably returned from B. SEQ is a sequence object (SeqPtr) COORD is a coordinate object (CoordPtr). PARAM are parameters from C. They have type FXParam and this type checking is enforced by the interpreter. =item score_fx_prof MATRIX SP COORD PARAM This is the same as score_fx, but the second argument, C, is a sequence profile, rather than a sequence. =item score_mat_add (MAT1, MAT2, SCALE, SHIFT) Return a new score matrix where each element is given by NEW_MAT = MAT1 + ( SCALE * MAT2 + SHIFT) If only three arguments are given, SHIFT is set to zero. =item score_mat_info MAT Given a Score_mat object in MAT, return a list of the form (MIN, MAX, AV, STD_DEV) Ignore the first and last row and column, since they are used for treatment of end gaps. Note, the calling procedure for this function changed in S, so old scripts may break; =item score_mat_new N_ROWS N_COLS Create a new score matrix and return a Score_mat object. In a sequence to structure alignment, N_ROWS is the size of the sequence, N_COLS is the size of the structure. So, if C<$seq> is a sequence and C<$coord> is a structure, we might have $scr_mat = score_mat_new (seq_size ($seq), coord_size ($coord)); You B entitled to assume that the new matrix is properly zeroed. Code should not bother to zero it again. =item score_mat_scale MAT, SCALE Multiply all elements in MAT by SCALE. This returns a new matrix. =item score_mat_shift (MAT1, SHIFT) Return a new score matrix where each element is given by NEW_MAT = MAT1 + SHIFT The first and last row and column are not shifted. This is one way to implement end gaps. This function returns a new matrix. =item score_pvec MATRIX, PVEC1, PVEC2 Fill out the score matrix, B based on comparing the class membership probability vectors B and B. These probability vectors check the length of the original fragments and will return B if they do not match. =item score_mat_read FNAME Go to F and read the score matrix that was probably written by L. Return a new score matrix or undef on failure like my $scr_mat = score_mat_read ($fname); if (! $scr_mat) { print STDERR "Reading score matrix from $fname failed.\n"; } =item score_mat_string SCORE_MAT SEQ1 SEQ2 This is mainly for debugging. After calling a score function, this will return a string containing the score matrix with the sequence at the top and left hand side. If you did a sequence to structure alignment, you should still call it with the sequences. =item score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE =item score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE, BIAS_SET This does the summation of a score matrix. Returns non-zero on success and zero on failure. Note, there is an optional last argument. The parameters are =over 8 =item RMAT This is used for returning the summed matrix. The function does not overwrite SCORE_MAT. Instead it returns a new matrix. This can generally be ignored. =item SCORE_MAT This is the score matrix which came from a call to something like B. =item PGAP_OPEN Penalty for opening gaps in the first of the pair of sequences or structures or whatever. =item PGAP_WIDEN Penalty for extending the PGAPs. =item QGAP_OPEN Penalty for opening gaps in the second of the pair. =item QGAP_WIDEN Penalty for extending the QGAPs =item P_MULT This is an object of C type with site specific coefficients for gap penalties and will be applied to the P_GAP. Typically, this will be for extra penalties for secondary structure gaps. =item Q_MULT As for P_MULT, but for the q_gaps. =item ALIGN_TYPE Either "$N_AND_W" or "$S_AND_W" for Needleman and Wunsch or Smith and Waterman respectively. =item BIAS_SET This is a pair_set which may have come from a previous alignment. It really is an object of Pair_setPtr type. It is optional. If present, the alignment will be coerced to follow this alignment. The use is that one can do an Smith and Waterman alignment and get the best scoring segment. One can then set the ALIGN_TYPE switch to C<$N_AND_W> and do a globally optimal alignment, but passing through the optimal segment. =back =item score_mat_write SCORE_MAT FNAME Write the score matrix SCORE_MAT to the file name given by FNAME. Return undef on failure. =item score_rs COORD PARAMS This will apply the tanh() based score function which is primarily for rescoring. It is B neighbour-non-specific or unusual in any other way. COORD is a Coord object. PARAMS is a set of parameters, formally of type RSParams and probably resulting from a call to param_rs_read. Returns a floating point number. =item score_smat SCORE_MAT, SEQ1, SEQ2, SUB_MAT Fill out a score matrix based on sequence comparison of two sequences. SCORE_MAT is a score matrix which may have come from new_matrix. SEQ1 and SEQ2 are the sequences. SUB_MAT is a substitution matrix. =item score_sprof SCORE_MAT, PROF, SEQ, SUB_MAT This is very similar to B. SCORE_MAT is a score matrix, probably from B. PROF is a sequence profile, maybe from B. SEQ is a sequence and SUB_MAT is a substition matrix. This fills out a score matrix using the substition matrix it has been given and fractional sequences from the sequence profile. =item score_sec_s SCORE_MAT, SEC_S_DATA, COORD Given a SCORE_MAT which probably came from C, and some secondary structure data in SEC_S_DATA, probably from C and a set of coordinates in COORD, do a scoring. We compare each site against the predicted secondary structure. If the psi angle difference is more than 90 degrees, we set it to 90. We then return cos (difference). Note, this is different behaviour to earlier versions. If an angle is very different (diff > 90), it it not penalised. If the angle is less than 90, a value will be returned between 0 and 1.0 =item sec_s_data_read FNAME Read secondary structure data from FNAME. This may be in PHD/Rost format or our L<"manual input format"/item_manual_secondary_structure_file> format defined below. Return a pointer to an object of Sec_s_data type. =item sec_s_data_string SEC_S SEC_S is a pointer to a Sec_s_data object. Return a string with the bare secondary structure information. For example: $x = sec_s_data_read ("sec_data_file") || die "Fail on $tmp_data: $!"; print sec_s_data_string($x); =item seq_2_prob_vec SEQ, AA_CLSSFCN Given a sequence, B and an amino acid classification, B return an object of B type. This is most interesting for feeding to a score function like B. =item seq_read FNAME Read a sequence from file FNAME. Returns a SeqPtr object. =item seq_read_many FNAME Reads a number of sequences from file FNAME. Returns a SeqArrayPtr object (array of sequences). =item seq_from_string STRING Returns a sequence object C from a string. This lets one build a sequence in the interpreter like $seq = seq_from_string ('avlc'); Would store a sequence (Ala, Val, Leu, Cys) in the $seq variable. The string can have a FASTA style comment at the start. The string can span multiple lines. =item seq_get_1 SEQ_ARRAY N Return the N'th string from a sequence array, SEQ_ARRAY. Returns a SeqPtr object. =item seq_num SEQ_ARRAY Returns the number of sequences stored in the sequence array, SEQ_ARRAY. =item seq_print SEQ Return a sequence to the interpreter as a string. It is not pretty and will be fixed. =over 8 =item * The sequence will be beautified so it comes out in blocks of ten residues, probably with numbering. =back =item seq_print_many SEQ_ARRAY Returns a string containing all the sequences in SEQ_ARRAY. =item seq_size SEQ Return the number of residues in a sequence, SEQ. This is the number of residues in a sequence. There are no complications or frills like null terminators. =item seqprof_get_seq SEQPROF This is analogous to coord_get_seq. It takes a sequence profile and returns a B sequence. =item seqprof_str SEQPROF Take the sequence profile in the SEQPROF object and return a printable string. For example if ( ! ($profile = blst_chk_read ( $fname))) { return undef; } print seqprof_str ($profile); =item struct_2_prob_vec COORD CLSSFCN Computes the memberships matrix (probability vector) for a structure COORD given a CLSSFCN =item sub_mat_get_by_i SUB_MAT, I, I Return the substition matrix element indexed by B and B =item sub_mat_get_by_c SUB_MAT, I, I Return the substitution matrix element for the amino acids I and I. So S> would return the current value for cys/asp. The amino acid names are single letter codes and may be upper or lower case. =item sub_mat_set_by_c SUB_MAT, I, I, I Set the substitution matrix element for I and I to I. =item sub_mat_set_by_i SUB_MAT, I, I, I Set the value indexed by I,I in the substition matrix to I. =item sub_mat_string SUB_MAT Return a string containing the S matrix held in SUB_MAT. =item sub_mat_read (FILENAME) Go to FILENAME. Read up the substitution / score matrix and return it. =item sub_mat_shift SUBST_MATRIX, BOTTOM Given a substitution matrix (an object of type Sub_mat), shift the whole matrix so the smallest (most negative value) is of size BOTTOM. This does not return anything. It acts on the SUBST_MATRIX argument directly. =item sub_mat_scale SUBST_MATRIX, BOTTOM, TOP Given a substition matrix, SUBST_MATRIX, scale and shift it so the minimum and maximum values run from BOTTOM to TOP. =item score_mat_sum_smpl NEW_MAT SCORE_MAT PGAP_OPEN PGAP_WIDEN QGAP_OPEN QGAP_WIDEN ALIGNMENT_TYPE We have a score matrix which could be from sequence/sequence, sequence/structure or whatever. Now, do the dynamic programming work. Sum the score matrix and return a set of pairs. The parameters are =over =item NEW_MAT This is a fresh matrix with the traced back scores in it. =item SCORE_MAT This is the score matrix. =item GAP_OPEN =item GAP_WIDEN =item ALIGNMENT_TYPE There are only two values allowed, either $N_AND_W or $S_AND_W These stand for "Needleman and Wunsch" and "Smith and Waterman" respectively. Any other value will cause an error. =back =item svm_rs_cdata MODEL NATIVE SCOR_SET RS_PARAM CVTYPE *EXPERIMENTAL!* The function returns an array of training vectors suitable for use in training a support vector machine (libSVM.pm) or some other machine learning procedure. Its form is : [ [label_class, [(feature vector)], .. ] The scheme for calculating the training vectory is given in CVTYPE, and the data is formed from the local sequence to structure scores as given by SCOR_SET, the TANH forcefield based pairwise interaction terms (calculated via RS_PARAM), and the local model consistency (based on the difference of distance matrices computed between MODEL and NATIVE). Scheme 0 works as follows : (see scoranlys.c:get_svmdata for details at the moment). =item svm_rsfeat MODEL SCOR_SET RS_PARAMS CVTYPE This returns a set of feature vectors for each position in MODEL, calculated from local sequence-structure fitness and residue-specific interaction terms according to the CVTYPE scheme (see svm_rs_cdata or scoranlys.c for details). The form is as follows : my @m_fvset = svm_rsfeat MODEL, SCOR_SET, PARAMS, 0 @m_fvset is of form [ (undef), [feature vector], .., .., (undef)] and (scalar @m_fvset) == coord_size(MODEL) undefs are given for positions in the model where a full feature vector cannot be computed (at the ends, for instance). =back =head1 BUILD AND INSTALL Wurst is mostly migrated to the standard perl build system. This means you should go the top level directory and try perl Makefile.PL make make install This would try to install into the system directories, so a better tactic might be perl Makefile PREFIX=~/myperl LIB=~/myperl/lib make This usually builds without problem. Check with scripts from the F directory. Then install into your own account with make install Your scripts would then have to start with lines like use lib "$ENV{HOME}/myperl/lib If this looks OK, you might cd scripts perl hello.pl This will check if Wurst pieces appear to be in place. If that looks OK, then edit F to set the installation destination. Then make install from the top level directory. Then go back to the scripts directory and try a different file like cd scripts perl hello2.pl =head1 FILE FORMATS =over =item PHD secondary structure files Wurst can read secondary structures from the PHD prediction server. The format is empirically defined. That means we try to read anything that comes from the server. =item manual secondary structure file One may type in secondary structure predictions or assignments. The format is like: # speculative secondary structure assignments secondary structure 5 - 30 h 37 e 40-45 e 5 # This is a very unreliable guess =over =item * This means that residues 5 to 30 are helix. Residue 37 is sheet (extended). Residues 40 to 45 are sheet (extended) and they have a confidence level of 5. =item * The first non blank line must say "secondary structure". This is not optional. It is used to recognise the file format. =item * Confidence levels can be given from 0 to 9. Zero means there is no confidence. 9 means you are very confident. This number must be an integer. There is no way to give a more detailed number. =item * Confidence levels are optional. The default is confidence=9. =item * Anything after a hash C<#> is a comment. =item * Blank lines are ignored. =back =back =head1 AUTHORS Alphabetically... Steve Hoffmann, Thomas Huber, Tina Lai, Nasir Mahmood, Thomas Margraf, Martin Mosisch, James B. Procter, Gundolf Schenk, Andrew E. Torda. =head1 SEE ALSO coding.pod contains rules for adding to wurst. =cut