101
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
101
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Langue
English
1Algorithms in bioinformatics (CSI 5126)
Marcel Turcotte
(turcotte@site.uottawa.ca)
School of Information Technology and Engineering
University of Ottawa
Canada
October 2, 2009
1
Please don’t print these lecture notes unless you really need to!
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsPlan
I String algorithms
I Applications of su x trees (ST)
I Generalized su x trees
I Lowest common ancestor (LCA)
I Applications of LCA + ST
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformatics11 9monotonous$
$ us$
1 monotonous$ r s$ 10
no o
tonous$
5
us$us$ notonous$
8
7 tonous$
3
us$
tonous$
4
6
2
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSummary
I A su x tree can be built in linear time and space
I Su x trees were developed to determine if a string P occurs
in a text T in time proportional tojPj (after pre-processing,
i.e. building the tree)
I Indeed, P is a substring of T i P is a pre x of a su x
of T
I To locate P, it su ce to follow a unique path from the root
of the tree up to a node, explicit or implicit, that corresponds
to the end of the pattern. This takes time proportional to the
length of the pattern
I Nowadays, su x tree based algorithms have been
developed to solve a large array of problems for which no
e cient algorithm was known. This lecture presents
some of them
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsLongest repeated substring
I Let (i; j) denote the substring of S starting at i and ending
at j, i.e. S[i::j].
0 0 0
I A repeat is a pair ((i; j); (i ; j )) such that i < i and
0 0S[i::j] = S[i ::j ].
0 0
I The longest repeated substring is the pair ((i; j); (i ; j ))
such that the length of the substring is maximum.
I The longest repeated substring of abracadabra is abra.
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsNaive algorithm
Imagine an algorithm to nd the longest repeated substring
without using a su x tree.
What is its time complexity?
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSu x tree based algorithm
CATTATTAGGA$ 9
GA$
12 $ r G v A$ 10
A T
11 $ u A y GGA$w 7
TTA TA TTAGGA$
GGA$
z 48 x
GGA$CATTATTAGGA$
TTAGGA$
GGA$
6
TTAGGA$
3
5
2 1
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSu x tree based algorithm (cont.)
Outline a su x tree based algorithm for nding repeats? What
characterizes a repeat?
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSu x tree based algorithm (cont.)
Let’s de ne a branching node, sometimes called fork, as a node
having two or more children.
S δ a δ b
The path-label of a node is the concatenation of all the edge
labels along the path from the root to the node.
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformaticsSu x tree based algorithm (cont.)
Let’s de ne a branching node, sometimes called fork, as a node
having two or more children.
S δ a δ b
It su ce to traverse the tree and nd a node 1) which is a fork
node and 2) which has the longest path-label.
Finding the longest repeated substring takesO(jTj).
Marcel Turcotte (turcotte@site.uottawa.ca) CSI 5126. Algorithms in bioinformatics