matchPattern {Biostrings} | R Documentation |
Generic that finds all matches of a pattern in a sequence (an XString object).
matchPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE) countPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE) vcountPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)
pattern |
The pattern string. |
subject |
An XString object containing the subject string,
or an XStringViews object, or a character vector
(converted to an XString or XStringSet object
internally) or (for vcountPattern ) an XStringSet
object.
|
algorithm |
One of the following: "auto" , "naive-exact" ,
"naive-inexact" , "boyer-moore" or "shift-or" .
|
max.mismatch |
The maximum number of mismatching letters allowed (see
isMatching for the details).
If non-zero, an inexact matching algorithm is used.
|
fixed |
If FALSE then IUPAC extended letters are interpreted as ambiguities
(see isMatching for the details).
|
Available algorithms are: ``naive exact'', ``naive inexact'',
``Boyer-Moore-like'' and ``shift-or''. Not all of them can be
used in all situations: restrictions depend on the length of
the pattern, the class of the subject and the values of
max.mismatch
and fixed
.
When 2 different algorithms can be used for a given task,
then choosing one or the other only affects the performance,
not the result, so there is no "wrong choice" (strictly speaking).
In short, it is better to just use algorithm="auto"
(the default):
this way matchPattern
will choose the algo that is best suited
for the task.
An XStringViews object for matchPattern
.
A single integer for countPattern
.
An integer vector for vcountPattern
, with each element in
the vector corresponding to the number of matches in the corresponding
element of subject
.
isMatching
,
mismatch
,
matchPDict
,
matchLRPatterns
,
matchProbePair
,
maskMotif
,
alphabetFrequency
,
XStringViews-class,
XString-class
## A simple inexact matching example with a short subject x <- DNAString("AAGCGCGATATG") m1 <- matchPattern("GCNNNAT", x) m1 m2 <- matchPattern("GCNNNAT", x, fixed=FALSE) m2 as.matrix(m2) ## With DNA sequence of yeast chromosome number 1 data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE) match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE) ## With a genome containing isolated Ns library(BSgenome.Celegans.UCSC.ce2) chrII <- Celegans[["chrII"]] alphabetFrequency(chrII) matchPattern("N", chrII) matchPattern("TGGGTGTCTTT", chrII) # no match matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match ## Using wildcards ("N") in the pattern on a genome containing N-blocks library(BSgenome.Dmelanogaster.UCSC.dm3) chrX <- maskMotif(Dmelanogaster$chrX, "N") as(chrX, "XStringViews") # 4 non masked regions matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE) ## Can also be achieved with no mask masks(chrX) <- NULL matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")