matchPDict-inexact {Biostrings}R Documentation

Inexact matching with matchPDict()

Description

The matchPDict function can be used for inexact matching.

For the moment, only one kind of inexact matching is provided where the user must specify the length of the Trusted Prefix for each pattern stored in the dictionary and the maximum number of mismatching letters allowed in each pattern suffix (or "tail"). Please refer to the PDict documentation for how to create a Trusted Prefix DNA dictionary (preprocessing).

The inexact matches are searched by using a 2-stage approach: (1) use the Aho-Corasick algorithm for fast lookup of all the prefixes in the subject, followed by (2) comparison of each tail with the short substrings in the subject located next to the positions found in (1). Because performance will mostly depend on the number of positions found in (1) it's important to minimize this number by keeping the length of the Trusted Prefix as large as possible. A simple empiric rule is: if TP is the length of the Trusted Prefix and L the length of the subject, then L / (4**TP) should be as small as possible, typically < 10 or 20.

Note that, when matchPDict is used for inexact matching, variable width dictionaries (i.e. containing patterns of variable length) are accepted. In addition, when fixed=FALSE, then IUPAC extended letters placed in the tail of the dictionary are interpreted as ambiguities.

However, like in exact matching mode, it only works with a dictionary of DNA patterns.

Author(s)

H. Pages

See Also

PDict, matchPDict-exact

Examples

  ## WARNING! THIS IS STILL WORK IN PROGRESS!!!
  ## A known issue is that the "startIndex", "[[" and "unlist" methods are
  ## BROKEN on the MIndex object returned by matchPDict() called on a
  ## TBdna_PDict object!

  ## ---------------------------------------------------------------------
  ## A. WITH UNNAMED PATTERNS
  ## ---------------------------------------------------------------------

  library(drosophila2probe)
  dict0 <- drosophila2probe$sequence  # The input dictionary.

  ## Specify the length of the Trusted Prefix via the 'tb.end' argument
  pdict9 <- PDict(dict0, tb.end=9)
  pdict9
  tail(pdict9)
  #sum(duplicated(pdict9)) # not ready yet!
  #table(patternFrequency(pdict9)) # not ready yet!

  library(BSgenome.Dmelanogaster.UCSC.dm3)
  chr3R <- Dmelanogaster$chr3R
  chr3R
  table(countPDict(pdict9, chr3R, max.mismatch=1))
  table(countPDict(pdict9, chr3R, max.mismatch=3))
  table(countPDict(pdict9, chr3R, max.mismatch=5))

  ## ---------------------------------------------------------------------
  ## B. COMPARISON WITH EXACT MATCHING
  ## ---------------------------------------------------------------------

  ## When the input dictionary is of constant width, using 'max.mismatch=0'
  ## on the Trusted Prefix dictionary is equivalent to (but much slower than)
  ## using the full dictionary.
  pdict0 <- PDict(dict0)
  count0 <- countPDict(pdict0, chr3R)
  count_tp9_mm0 <- countPDict(pdict9, chr3R, max.mismatch=0)
  identical(count_tp9_mm0, count0)    # TRUE
  
  ## ---------------------------------------------------------------------
  ## C. VARIABLE WIDTH DICTIONARY
  ## ---------------------------------------------------------------------

  pdict <- PDict(c("ACNG", "GT", "CGT", "AC"), tb.end=2)
  pdict
  tail(pdict) 
  endIndex(matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=0))
  endIndex(matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=1))
  mindex <- matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=2)
  endIndex(mindex)

  startIndex(mindex)                  # CURRENTLY BROKEN!
  mindex[[1]]                         # CURRENTLY BROKEN!
  unlist(mindex)                      # CURRENTLY BROKEN!

  ## -------------------------------------------------------------------------
  ## D. WITH AMBIGUITIES IN THE TAIL
  ## -------------------------------------------------------------------------

  ## Ambiguities (IUPAC extended letters) are supported in the tail only:
  endIndex(matchPDict(pdict, DNAString("ACGGACCG"), fixed=FALSE))

[Package Biostrings version 2.8.18 Index]