matchPattern {Biostrings}R Documentation

String searching functions

Description

Generic that finds all matches of a pattern in a sequence (an XString object).

Usage

  matchPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)
  countPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)
  vcountPattern(pattern, subject, algorithm="auto", max.mismatch=0, fixed=TRUE)

Arguments

pattern The pattern string.
subject An XString object containing the subject string, or an XStringViews object, or a character vector (converted to an XString or XStringSet object internally) or (for vcountPattern) an XStringSet object.
algorithm One of the following: "auto", "naive-exact", "naive-inexact", "boyer-moore" or "shift-or".
max.mismatch The maximum number of mismatching letters allowed (see isMatching for the details). If non-zero, an inexact matching algorithm is used.
fixed If FALSE then IUPAC extended letters are interpreted as ambiguities (see isMatching for the details).

Details

Available algorithms are: ``naive exact'', ``naive inexact'', ``Boyer-Moore-like'' and ``shift-or''. Not all of them can be used in all situations: restrictions depend on the length of the pattern, the class of the subject and the values of max.mismatch and fixed.

When 2 different algorithms can be used for a given task, then choosing one or the other only affects the performance, not the result, so there is no "wrong choice" (strictly speaking). In short, it is better to just use algorithm="auto" (the default): this way matchPattern will choose the algo that is best suited for the task.

Value

An XStringViews object for matchPattern.
A single integer for countPattern.
An integer vector for vcountPattern, with each element in the vector corresponding to the number of matches in the corresponding element of subject.

See Also

isMatching, mismatch, matchPDict, matchLRPatterns, matchProbePair, maskMotif, alphabetFrequency, XStringViews-class, XString-class

Examples

  ## A simple inexact matching example with a short subject
  x <- DNAString("AAGCGCGATATG")
  m1 <- matchPattern("GCNNNAT", x)
  m1
  m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
  m2
  as.matrix(m2)

  ## With DNA sequence of yeast chromosome number 1
  data(yeastSEQCHR1)
  yeast1 <- DNAString(yeastSEQCHR1)
  PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
  match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
  match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE)

  ## With a genome containing isolated Ns
  library(BSgenome.Celegans.UCSC.ce2)
  chrII <- Celegans[["chrII"]]
  alphabetFrequency(chrII)
  matchPattern("N", chrII)
  matchPattern("TGGGTGTCTTT", chrII) # no match
  matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

  ## Using wildcards ("N") in the pattern on a genome containing N-blocks
  library(BSgenome.Dmelanogaster.UCSC.dm3)
  chrX <- maskMotif(Dmelanogaster$chrX, "N")
  as(chrX, "XStringViews") # 4 non masked regions
  matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE)
  ## Can also be achieved with no mask
  masks(chrX) <- NULL
  matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

[Package Biostrings version 2.8.18 Index]