Algorithms in bioinformatics (CSI 5126)=1Please don't print these lecture notes unless you really need

icon

51

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

51

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

1Algorithms in bioinformatics (CSI 5126)Marcel Turcotte(turcotte@site.uottawa.ca)School of Information Technology and EngineeringUniversity of OttawaCanadaOctober 16, 20091Please don’t print these lecture notes unless you really need to!Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSmith & WatermanBase conditions,v(i; 0) = 0;i2 0::nv(0;j) = 0;j2 0::mGeneral case,80;> v(i;j 1) +d;:v(i 1;j 1) +s(S (i);S (j)):1 2where d is the cost for a deletion/insertion, while s(a;b) is thesubstitution score.Solution,?v = max[v(i;j) : i n;j m]) Smith & Waterman (1981) J. Mol. Biol. 147:195-197.Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsProbabilistic FrameworkI Recall that a sequence alignment should answer the question:\are the two sequences (evolutionary) related?"I In other words, is the observed sequence alignment the resultof:1. an evolutionary process, where both sequences have evolvedindependtly from a common ancestry, or2. can it be attributable to chance alone; randomly selecting twounrelated sequences could produce a similar alignment score.Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsProtein sequence probabilitiesIt’s useful to consider a \simple" probabilistic model of a proteinsequence, given p , the probability of observing the amino acid a,asuch that,p > 0a20Xp = 1aa=1Let’s de ne the ...
Voir icon arrow

Publié par

Langue

English

0a20Xp = 1aa=1Let’s de ne the ..." />
1PrptttnisaelnodeenuresotsehectlellnyruaessoynuelTurcrcelo!Maeedtis@ettocrut(ettoCSa).cwataot.uteibnifnioamroscit12I5Al6.rigomsth
1
Algorithms in bioinformatics (CSI 5126)
School of Information Technology and Engineering University of Ottawa Canada
October 16, 2009
Marcel Turcotte (turcotte@site.uottawa.ca)
MaelrcrcTuettorut(ttocis@escitamrofnioibni
General case,
v?= max[v(i,j) :in,jm]
0, v(i,j) = maxvv((ii,j1,1j+))+dd,, (i1,j1) +s(S1(i),S2(j)). v
Smith & Waterman (1981) Biol.J. Mol.147:195-197.
wheredis the cost for a deletion/insertion, whiles(a,b) is the substitution score. Solution,
v(i,0) = 0,i0..n
Base conditions,
Smith & Waterman
v(0,j) = 0,j0..m
)aSCawc.toatetu.thmsgori6.AlI512
Mcitamrof
IRecall that a sequence alignment should answer the question: are the two sequences (evolutionary) related?” Ithe observed sequence alignment the resultIn other words, is of: 1.an evolutionary process, where both sequences have evolved independtly from a common ancestry, or 2.can it be attributable to chance alone; randomly selecting two unrelated sequences could produce a similar alignment score.
Probabilistic Framework
sotrc@stee.itttuoecraruTlttocut(elgorithmsinbioinwa.aacC)IS1562A.
rcMa
pa>0 20 Xpa= 1
a=1 Let’s define the probability of a sequenceS(1)S(2). . .S(n) as,
Protein sequence probabilities
It’s useful to consider a “simple” probabilistic model of a protein sequence, givenpa, the probability of observing the amino acida, such that,
n pS(1)pS(2). . .pS(n)=YpS(i) i=1
timacsfnroibiosmniirhtAlgo126.CSI5.ca)awattou.etis@ettcour(tteotrcTuel
probabilitiesocttTlruraecMite.te@srcote(tu
> GetAaFrequency(DB); Alanine 7.62 % Arginine 5.19 % Asparagine 4.40 % Aspartic acid 5.27 % Cysteine 1.64 % Glutamine 3.94 % Glutamic acid 6.40 % Glycine 6.87 % Histidine 2.24 % Isoleucine 5.84 % Leucine 9.47 % Lysine 5.96 % Methionine 2.38 % Phenylalanine 4.10 % Proline 4.91 % Serine 7.09 % Threonine 5.64 % Tryptophan 1.23 % Tyrosine 3.18 % Valine 6.62 %
rmfoinio
A common practice consists of estimating the amino acid probabilities using the observed frequencies in a large database. Here are the amino acid frequencies observed for the database Swiss-Prot version 39.
sicatglro62A.isbntimhawa.uottSI51ca)CAminoacids
s
Probabilistic Interpretation of a Sequence Alignment
S1(1)S1(2). . .S1(n) S2(1)S2(2). . .S2(n)
Consider twoalignedsequences,S1andS2 simplicity,. For ungaped alignments are considered.
The interpretation requires weighting two issues. 1.Sequences are related (Match Model – M); 2.Sequences are unrelated (Random Model – R).
formaticsinbioinwa.aacC)ti.eouttlgorithmSI5126.AecraruTlMotrc@stettcotue(
oibnofnitamrsci
In the match model, we have,
P(S1,S2|M) =Yq(S1(i),S2(i)) i
Match model
whereq(a,b) represents the probability that both residuesaandb have both been derived independently from an ancestral residuec.
courlTcearMtis@ettocrut(ett)CSIa.cattawe.uomhisrotiA.gl1562
nformatics
Random model
Whilst the random model is simply, P(S1,S2|R) =YpS1(i)YpS2(j) i j
but since we assumed that|S1|=|S2|, P(S1,S2|R) =YpS1(i)pS2(i) i
lAogirhtsmniibioa).cwata6.12I5CS@ettocrutou.etiselTuMarcte(trcot
ottate.ua)CSwa.ct(rutoet@eisocttMarcTuelrc
The ratio of the two likelihoods is called anodds-ratio(or likelihood-ratio),
s(a,b) = log(qp(aa,bp)) b represents the log-likelihood ratio that the residue pair (a,b) will occur as an aligned pair, as opposed to unaligned. In the case of proteinss(a,b) represents a 20×20 matrix, known asscore matrixorsubstitution matrix.
where each,
taking the logarithm leads to a quantity known as thelog-odds ratio,
S(S1,S2) =Xlog(q(S1(i),S2(i))) ipS1(i)pS2(i)
P(S1,S2|M)Yq(pSS11((ii)),pSS2(2(ii))) = P(S1,S2|R)i
amroscitibnifniorigomsth12I5Al6.
oiniofmrtimhisbn
I
I I I I I I
atics
In this view, the total score of alignment is the sum of all the terms for the aligned pairs of residues and gaps. The score is interpreted as the logarithm of the relative likelihood that the sequences are related vs not related. Positive terms represent substitutions are more likely than would be expected by chance. Negative terms represent unfavorable substitutions. Finally, when the two hypotheses are equally likely the log-likelihood ratio will be zero. We see that such substitution matrix can be used for calculating local sequence alignments, since likely alignments will have a positive score and unlikely alignment will have a negative score. Additive scoring scheme means that positions along the sequence are considered independent from one another, i.e. mutations at different sites have occurred independently. It’s aworking hypothesis.
raecMrcote(tucottlTur.awattou.etis@etorlg.A2651SI)Cca
What about the Substitution Scores?
ICertain amino acids have similar properties (structure, volume, charge, hydrophobicity, etc.); ILooking at the genetic code, you can see that certain pairs of amino acids are such that the minimum number of mutations at the codon level to change the encoding from one amino acid type to another is only one (Ala and Asp, GCC and GAC), there are pairs that need a minimum of two mutations (Ala and Arg, CGA and GCA) or even three (Asn and Trp, AAC or AAU and UGG); IThe substitution score is expected to reflect both of these effects.
The substitution scores that we used were rather arbitrary, either the identity matrix or some hand made matrix. Let’s have a look at scoring schemes that are appropriate for protein sequences.
maorcstiMa)aSCawc..6lA5I21thmsgorioinfinbicruTlecrrut(ettosie@ttcotaot.ute
Voir icon more
Alternate Text