PDict-class {Biostrings} | R Documentation |
The PDict class is a container for storing a preprocessed set of patterns
(aka the dictionary) that can be used with the matchPDict
function in order to find the occurences of all the input patterns (i.e.
the patterns stored in the PDict object) in a text in an efficient way.
Converting a set of input sequences into a PDict object is called
preprocessing. This operation is done by using the PDict
constructor.
PDict(dict, tb.start=1, tb.end=NA, drop.head=FALSE, drop.tail=FALSE, skip.invalid.patterns=FALSE)
dict |
A character vector, a DNAStringSet object or an XStringViews object containing the input sequences. |
tb.start |
[DOCUMENT ME] |
tb.end |
[DOCUMENT ME] |
drop.head |
[DOCUMENT ME] |
drop.tail |
[DOCUMENT ME] |
skip.invalid.patterns |
[DOCUMENT ME] |
This is a work in progress and only 2 types of dictionaries are supported at the moment: constant width DNA dictionaries and Trusted Band DNA dictionaries.
A constant width DNA dictionary is a dictionary where all the patterns are DNA sequences of the same length (i.e. all the patterns have the same number of nucleotides). For now the patterns can only contain As, Cs, Gs and Ts (no IUPAC extended letters). The container for this particular type of dictionary is the CWdna_PDict class (a subclass of the PDict class).
A Trusted Band DNA dictionary is a dictionary where the patterns are DNA
sequences with a trusted region i.e. a region that will have to match
exactly when the dictionary is used with the matchPDict
function.
This trusted region must have the same length for all patterns. It can be a
prefix (Trusted Prefix), a suffix (Trusted Suffix) or more generally any
substring (Trusted Band) of the input patterns.
The container for this particular type of dictionary is the TBdna_PDict
class (a subclass of the PDict class). The dictionary stored in a
TBdna_PDict object is splitted in 3 parts: the Trusted Band, the head
and the tail of the dictionary. Each of them contains the same number
of sequences (eventually empty) which is also the number of input patterns.
The Trusted Band is the set of sequences obtained by extracting
the trusted region from each input pattern.
The head is the set of sequences obtained by extracting the region located
before (i.e. to the left of) the trusted region from each input pattern.
The tail is the set of sequences obtained by extracting the region located
after (i.e. to the right of) the trusted region from each input pattern.
Like for a constant width DNA dictionary, the Trusted Band can only contain
As, Cs, Gs and Ts (no IUPAC extended letters) for now.
However, the head and the tail of a Trusted Band DNA dictionary can contain
any valid DNA letter including IUPAC extended letters (see
DNA_ALPHABET
for the set of valid DNA letters).
Note that a Trusted Band DNA dictionary with no head and no tail (i.e. with a head and a tail where all the sequences are empty) is in fact a constant width DNA dictionary. A Trusted Band DNA dictionary with no head is called a Trusted Prefix DNA dictionary. A Trusted Band DNA dictionary with no tail is called a Trusted Suffix DNA dictionary. Only Trusted Prefix and Trusted Suffix DNA dictionaries are currently supported.
In the code snippets below,
x
is a PDict object.
length(x)
:
The number of patterns in the PDict object.
width(x)
:
The number of nucleotides per pattern for a constant width DNA dictionary.
The width of the Trusted Band (i.e. the number of nucleotides in the
trusted region of each pattern) for a Trusted Band DNA dictionary.
names(x)
:
The names of the patterns in the PDict object.
head(x)
:
The head of the PDict object (a DNAStringSet
object
of length length(x)
or NULL
).
tail(x)
:
The tail of the PDict object (a DNAStringSet
object
of length length(x)
or NULL
).
In the code snippet below,
x
is a PDict object.
duplicated(x)
:
[DOCUMENT ME]
patternFrequency(x)
:
[DOCUMENT ME]
H. Pages
matchPDict
,
DNA_ALPHABET
,
DNAStringSet-class,
XStringViews-class
## Preprocessing a constant width DNA dictionary library(drosophila2probe) dict0 <- drosophila2probe$sequence # The input sequences. length(dict0) # Hundreds of thousands of patterns. unique(nchar(dict0)) # Patterns are 25-mers. dict0[1:5] pdict0 <- PDict(dict0) # Store the input dictionary into a # PDict object (preprocessing). pdict0 class(pdict0) length(pdict0) # Same as length(dict0). width(pdict0) # The number of chars per pattern. sum(duplicated(pdict0)) table(patternFrequency(pdict0)) # 9 patterns are repeated 3 times. ## Creating a constant width DNA dictionary by truncating the input ## sequences dict1 <- c("ACNG", "GT", "CGT", "AC") pdict1a <- PDict(dict1, tb.end=2, drop.tail=TRUE) pdict1a class(pdict1a) length(pdict1a) width(pdict1a) duplicated(pdict1a) patternFrequency(pdict1a) ## Preprocessing a Trusted Prefix DNA dictionary pdict1b <- PDict(dict1, tb.end=2) pdict1b class(pdict1b) length(pdict1b) width(pdict1b) head(pdict1b) tail(pdict1b)