match-utils {Biostrings} | R Documentation |
This man page gives some background information about the concept of "match" ("exact match" or "inexact match") as understood by the various pattern matching functions available in the Biostrings package.
The nmismatchStartingAt
, nmismatchEndingAt
and isMatching
functions implement this concept.
Other utility functions related to pattern matching are described here:
the mismatch
function for getting the positions of the mismatching
letters of a given pattern relatively to its matches in a given subject,
and the coverage
function that can be used to get the "coverage"
of a subject by a given pattern or set of patterns.
nmismatchStartingAt(pattern, subject, starting.at=1, fixed=TRUE) nmismatchEndingAt(pattern, subject, ending.at=1, fixed=TRUE) isMatching(pattern, subject, start=1, max.mismatch=0, fixed=TRUE) mismatch(pattern, x, fixed=TRUE) coverage(x, start=NA, end=NA)
pattern |
The pattern string. |
subject |
An XString object (or character vector) containing the subject sequence, |
starting.at |
An integer vector specifying the starting positions of the pattern relatively to the subject. |
ending.at |
An integer vector specifying the ending positions of the pattern relatively to the subject. |
start |
For isMatching : an integer vector specifying the starting positions
of the pattern relatively to the subject.
For coverage : a single integer specifying the position in x
where to start the extraction of the coverage.
|
max.mismatch |
The maximum number of mismatching letters allowed.
Note that isMatching doesn't support the kind of inexact matching
where a given number of insertions or deletions are allowed.
Therefore all the "matches" (i.e. the substrings in the subject that match
the pattern) have the length of the pattern.
|
fixed |
Only with a DNAString or RNAString subject can a fixed
value other than the default (TRUE ) be used.
With fixed=FALSE , ambiguities (i.e. letters from the IUPAC Extended
Genetic Alphabet (see IUPAC_CODE_MAP ) that are not from the
base alphabet) in the pattern _and_ in the subject are interpreted as
wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset
of c("pattern", "subject") .
fixed=c("pattern", "subject") is equivalent to fixed=TRUE
(the default).
An empty vector is equivalent to fixed=FALSE .
With fixed="subject" , ambiguities in the pattern only
are interpreted as wildcards.
With fixed="pattern" , ambiguities in the subject only
are interpreted as wildcards.
|
x |
An XStringViews object for mismatch (typically, one returned
by matchPattern(pattern, subject) ).
Typically an XStringViews or MIndex object for coverage
but IRanges, MaskCollection and MaskedXString objects
are accepted too.
|
end |
A single integer specifying the position in x
where to end the extraction of the coverage.
|
nmismatchStartingAt
, nmismatchEndingAt
: an integer vector of
the same length as starting.at
(or ending.at
) reporting the
number of mismatching letters for each starting (or ending) position.
A logical vector of the same length as start
for isMatching
.
A list of integer vectors for mismatch
.
An integer vector indicating the coverage of x
in the interval
specified by the start
and end
arguments.
An integer value called the "coverage" can be associated to each position
in x
, indicating how many times this position is covered by the views
or matches stored in x
. For example, if x
is an
XStringViews object, the coverage of a given position in x
is
the number of views it belongs to.
If x
is an MIndex object, the coverage of a given position
in x
is the number of matches (or hits) it belongs to.
Note that the positions in the returned vector are to be interpreted as
relative to the interval specified by the start
and end
arguments.
matchPattern
,
matchPDict
,
IUPAC_CODE_MAP
,
XString-class,
XStringViews-class,
MIndex-class,
IRanges-class,
MaskCollection-class,
MaskedXString-class
## --------------------------------------------------------------------- ## nmismatchStartingAt() / isMatching() ## --------------------------------------------------------------------- subject <- DNAString("GTATA") ## Pattern "AT" matches subject "GTATA" at position 3 (exact match) nmismatchStartingAt("AT", subject, starting.at=3) isMatching("AT", subject, start=3) ## ... but not at position 1 nmismatchStartingAt("AT", subject) isMatching("AT", subject) ## ... unless we allow 1 mismatching letter (inexact match) isMatching("AT", subject, max.mismatch=1) ## Here we look at 6 different starting positions and find 3 matches if ## we allow 1 mismatching letter isMatching("AT", subject, start=0:5, max.mismatch=1) ## No match nmismatchStartingAt("NT", subject, starting.at=1:4) isMatching("NT", subject, start=1:4) ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE) nmismatchStartingAt("NT", subject, starting.at=1:4, fixed=FALSE) isMatching("NT", subject, start=1:4, fixed=FALSE) ## max.mismatch != 0 and fixed=FALSE can be used together nmismatchStartingAt("NCA", subject, starting.at=0:5, fixed=FALSE) isMatching("NCA", subject, start=0:5, max.mismatch=1, fixed=FALSE) some_starts <- c(10:-10, NA, 6) subject <- DNAString("ACGTGCA") is_matching <- isMatching("CAT", subject, start=some_starts, max.mismatch=1) some_starts[is_matching] ## --------------------------------------------------------------------- ## mismatch() ## --------------------------------------------------------------------- m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE) mismatch("NCA", m) ## --------------------------------------------------------------------- ## coverage() ## --------------------------------------------------------------------- coverage(m) x <- IRanges(start=c(-2L, 6L, 9L, -4L, 1L, 0L, -6L, 10L), width=c( 5L, 0L, 6L, 1L, 4L, 3L, 2L, 3L)) coverage(x, start=-6, end=20) # 'start' and 'end' must be specified for # an IRanges object. coverage(shift(x, 2), start=-6, end=20) coverage(restrict(x, 1, 10), start=-6, end=20) coverage(reduce(x), start=-6, end=20) coverage(gaps(x, start=-6, end=20), start=-6, end=20) mask1 <- Mask(mask.width=29, start=c(11, 25, 28), width=c(5, 2, 2)) mask2 <- Mask(mask.width=29, start=c(3, 10, 27), width=c(5, 8, 1)) mask3 <- Mask(mask.width=29, start=c(7, 12), width=c(2, 4)) mymasks <- append(append(mask1, mask2), mask3) coverage(mymasks) ## See ?matchPDict for examples of using coverage() on an MIndex object...