PDict-class {Biostrings}R Documentation

PDict objects

Description

The PDict class is a container for storing a preprocessed set of patterns (aka the dictionary) that can be used with the matchPDict function in order to find the occurences of all the input patterns (i.e. the patterns stored in the PDict object) in a text in an efficient way.

Converting a set of input sequences into a PDict object is called preprocessing. This operation is done by using the PDict constructor.

Usage

  PDict(dict, tb.start=1, tb.end=NA, drop.head=FALSE, drop.tail=FALSE, skip.invalid.patterns=FALSE)

Arguments

dict A character vector, a DNAStringSet object or an XStringViews object containing the input sequences.
tb.start [DOCUMENT ME]
tb.end [DOCUMENT ME]
drop.head [DOCUMENT ME]
drop.tail [DOCUMENT ME]
skip.invalid.patterns [DOCUMENT ME]

Details

This is a work in progress and only 2 types of dictionaries are supported at the moment: constant width DNA dictionaries and Trusted Band DNA dictionaries.

A constant width DNA dictionary is a dictionary where all the patterns are DNA sequences of the same length (i.e. all the patterns have the same number of nucleotides). For now the patterns can only contain As, Cs, Gs and Ts (no IUPAC extended letters). The container for this particular type of dictionary is the CWdna_PDict class (a subclass of the PDict class).

A Trusted Band DNA dictionary is a dictionary where the patterns are DNA sequences with a trusted region i.e. a region that will have to match exactly when the dictionary is used with the matchPDict function. This trusted region must have the same length for all patterns. It can be a prefix (Trusted Prefix), a suffix (Trusted Suffix) or more generally any substring (Trusted Band) of the input patterns. The container for this particular type of dictionary is the TBdna_PDict class (a subclass of the PDict class). The dictionary stored in a TBdna_PDict object is splitted in 3 parts: the Trusted Band, the head and the tail of the dictionary. Each of them contains the same number of sequences (eventually empty) which is also the number of input patterns. The Trusted Band is the set of sequences obtained by extracting the trusted region from each input pattern. The head is the set of sequences obtained by extracting the region located before (i.e. to the left of) the trusted region from each input pattern. The tail is the set of sequences obtained by extracting the region located after (i.e. to the right of) the trusted region from each input pattern.

Like for a constant width DNA dictionary, the Trusted Band can only contain As, Cs, Gs and Ts (no IUPAC extended letters) for now. However, the head and the tail of a Trusted Band DNA dictionary can contain any valid DNA letter including IUPAC extended letters (see DNA_ALPHABET for the set of valid DNA letters).

Note that a Trusted Band DNA dictionary with no head and no tail (i.e. with a head and a tail where all the sequences are empty) is in fact a constant width DNA dictionary. A Trusted Band DNA dictionary with no head is called a Trusted Prefix DNA dictionary. A Trusted Band DNA dictionary with no tail is called a Trusted Suffix DNA dictionary. Only Trusted Prefix and Trusted Suffix DNA dictionaries are currently supported.

Accesor methods

In the code snippets below, x is a PDict object.

length(x): The number of patterns in the PDict object.
width(x): The number of nucleotides per pattern for a constant width DNA dictionary. The width of the Trusted Band (i.e. the number of nucleotides in the trusted region of each pattern) for a Trusted Band DNA dictionary.
names(x): The names of the patterns in the PDict object.
head(x): The head of the PDict object (a DNAStringSet object of length length(x) or NULL).
tail(x): The tail of the PDict object (a DNAStringSet object of length length(x) or NULL).

Other functions and generics

In the code snippet below, x is a PDict object.

duplicated(x): [DOCUMENT ME]
patternFrequency(x): [DOCUMENT ME]

Author(s)

H. Pages

See Also

matchPDict, DNA_ALPHABET, DNAStringSet-class, XStringViews-class

Examples

  ## Preprocessing a constant width DNA dictionary
  library(drosophila2probe)
  dict0 <- drosophila2probe$sequence   # The input sequences.
  length(dict0)                        # Hundreds of thousands of patterns.
  unique(nchar(dict0))                 # Patterns are 25-mers.
  dict0[1:5]
  pdict0 <- PDict(dict0)               # Store the input dictionary into a
                                       # PDict object (preprocessing).
  pdict0
  class(pdict0)
  length(pdict0)                       # Same as length(dict0).
  width(pdict0)                        # The number of chars per pattern.
  sum(duplicated(pdict0))
  table(patternFrequency(pdict0))      # 9 patterns are repeated 3 times.

  ## Creating a constant width DNA dictionary by truncating the input
  ## sequences
  dict1 <- c("ACNG", "GT", "CGT", "AC")
  pdict1a <- PDict(dict1, tb.end=2, drop.tail=TRUE)
  pdict1a
  class(pdict1a)
  length(pdict1a)
  width(pdict1a)
  duplicated(pdict1a)
  patternFrequency(pdict1a)

  ## Preprocessing a Trusted Prefix DNA dictionary
  pdict1b <- PDict(dict1, tb.end=2)
  pdict1b
  class(pdict1b)
  length(pdict1b)
  width(pdict1b)
  head(pdict1b)
  tail(pdict1b)

[Package Biostrings version 2.8.18 Index]