match-utils {Biostrings}R Documentation

Utility functions related to pattern matching

Description

This man page gives some background information about the concept of "match" ("exact match" or "inexact match") as understood by the various pattern matching functions available in the Biostrings package.

The nmismatchStartingAt, nmismatchEndingAt and isMatching functions implement this concept.

Other utility functions related to pattern matching are described here: the mismatch function for getting the positions of the mismatching letters of a given pattern relatively to its matches in a given subject, and the coverage function that can be used to get the "coverage" of a subject by a given pattern or set of patterns.

Usage

  nmismatchStartingAt(pattern, subject, starting.at=1, fixed=TRUE)
  nmismatchEndingAt(pattern, subject, ending.at=1, fixed=TRUE)
  isMatching(pattern, subject, start=1, max.mismatch=0, fixed=TRUE)
  mismatch(pattern, x, fixed=TRUE)
  coverage(x, start=NA, end=NA)

Arguments

pattern The pattern string.
subject An XString object (or character vector) containing the subject sequence,
starting.at An integer vector specifying the starting positions of the pattern relatively to the subject.
ending.at An integer vector specifying the ending positions of the pattern relatively to the subject.
start For isMatching: an integer vector specifying the starting positions of the pattern relatively to the subject. For coverage: a single integer specifying the position in x where to start the extraction of the coverage.
max.mismatch The maximum number of mismatching letters allowed. Note that isMatching doesn't support the kind of inexact matching where a given number of insertions or deletions are allowed. Therefore all the "matches" (i.e. the substrings in the subject that match the pattern) have the length of the pattern.
fixed Only with a DNAString or RNAString subject can a fixed value other than the default (TRUE) be used.
With fixed=FALSE, ambiguities (i.e. letters from the IUPAC Extended Genetic Alphabet (see IUPAC_CODE_MAP) that are not from the base alphabet) in the pattern _and_ in the subject are interpreted as wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset of c("pattern", "subject"). fixed=c("pattern", "subject") is equivalent to fixed=TRUE (the default). An empty vector is equivalent to fixed=FALSE. With fixed="subject", ambiguities in the pattern only are interpreted as wildcards. With fixed="pattern", ambiguities in the subject only are interpreted as wildcards.
x An XStringViews object for mismatch (typically, one returned by matchPattern(pattern, subject)).
Typically an XStringViews or MIndex object for coverage but IRanges, MaskCollection and MaskedXString objects are accepted too.
end A single integer specifying the position in x where to end the extraction of the coverage.

Value

nmismatchStartingAt, nmismatchEndingAt: an integer vector of the same length as starting.at (or ending.at) reporting the number of mismatching letters for each starting (or ending) position.
A logical vector of the same length as start for isMatching.
A list of integer vectors for mismatch.
An integer vector indicating the coverage of x in the interval specified by the start and end arguments. An integer value called the "coverage" can be associated to each position in x, indicating how many times this position is covered by the views or matches stored in x. For example, if x is an XStringViews object, the coverage of a given position in x is the number of views it belongs to. If x is an MIndex object, the coverage of a given position in x is the number of matches (or hits) it belongs to. Note that the positions in the returned vector are to be interpreted as relative to the interval specified by the start and end arguments.

See Also

matchPattern, matchPDict, IUPAC_CODE_MAP, XString-class, XStringViews-class, MIndex-class, IRanges-class, MaskCollection-class, MaskedXString-class

Examples

  ## ---------------------------------------------------------------------
  ## nmismatchStartingAt() / isMatching()
  ## ---------------------------------------------------------------------
  subject <- DNAString("GTATA")

  ## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
  nmismatchStartingAt("AT", subject, starting.at=3)
  isMatching("AT", subject, start=3)

  ## ... but not at position 1
  nmismatchStartingAt("AT", subject)
  isMatching("AT", subject)

  ## ... unless we allow 1 mismatching letter (inexact match)
  isMatching("AT", subject, max.mismatch=1)

  ## Here we look at 6 different starting positions and find 3 matches if
  ## we allow 1 mismatching letter
  isMatching("AT", subject, start=0:5, max.mismatch=1)

  ## No match
  nmismatchStartingAt("NT", subject, starting.at=1:4)
  isMatching("NT", subject, start=1:4)

  ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
  nmismatchStartingAt("NT", subject, starting.at=1:4, fixed=FALSE)
  isMatching("NT", subject, start=1:4, fixed=FALSE)

  ## max.mismatch != 0 and fixed=FALSE can be used together
  nmismatchStartingAt("NCA", subject, starting.at=0:5, fixed=FALSE)
  isMatching("NCA", subject, start=0:5, max.mismatch=1, fixed=FALSE)

  some_starts <- c(10:-10, NA, 6)
  subject <- DNAString("ACGTGCA")
  is_matching <- isMatching("CAT", subject, start=some_starts, max.mismatch=1)
  some_starts[is_matching]

  ## ---------------------------------------------------------------------
  ## mismatch()
  ## ---------------------------------------------------------------------
  m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE)
  mismatch("NCA", m)

  ## ---------------------------------------------------------------------
  ## coverage()
  ## ---------------------------------------------------------------------
  coverage(m)

  x <- IRanges(start=c(-2L, 6L, 9L, -4L, 1L, 0L, -6L, 10L),
               width=c( 5L, 0L, 6L,  1L, 4L, 3L,  2L,  3L))
  coverage(x, start=-6, end=20)  # 'start' and 'end' must be specified for
                                 # an IRanges object.
  coverage(shift(x, 2), start=-6, end=20)
  coverage(restrict(x, 1, 10), start=-6, end=20)
  coverage(reduce(x), start=-6, end=20)
  coverage(gaps(x, start=-6, end=20), start=-6, end=20)

  mask1 <- Mask(mask.width=29, start=c(11, 25, 28), width=c(5, 2, 2))
  mask2 <- Mask(mask.width=29, start=c(3, 10, 27), width=c(5, 8, 1))
  mask3 <- Mask(mask.width=29, start=c(7, 12), width=c(2, 4))
  mymasks <- append(append(mask1, mask2), mask3)
  coverage(mymasks)

  ## See ?matchPDict for examples of using coverage() on an MIndex object...

[Package Biostrings version 2.8.18 Index]