scoreSegments {tilingArray} | R Documentation |
Score segments
Description
Score the segments found by a previous call to findSegments by
comparing to genome annotation
Usage
scoreSegments(s, gff,
nrBasePerSeg = 1500,
probeLength = 25,
knownFeatures = c("CDS", "gene", "ncRNA", "nc_primary_transcript",
"rRNA", "snRNA", "snoRNA", "tRNA",
"transposable_element", "transposable_element_gene"),
params = c(minOverlapFractionSame = 0.8, minOverlapOppo = 40,
minIsolatedDistance=100, oppositeWindow = 100, utrScoreWidth=100),
verbose = TRUE)
Arguments
s |
environment. See details. |
gff |
GFF dataframe. |
nrBasePerSeg |
Numeric of length 1. This parameter determines the
number of segments. |
probeLength |
Numeric of length 1. |
knownFeatures |
Character vector. Names of those features in
gff which should be considered known features. |
params |
vector of additional parameters, see details. |
verbose |
Logical. |
Details
This function scores segments.
It is typically called after a segmentation.
For an example segmentation script, see the script segment.R
in the scripts
directory of this package. For an example
scoring script, which loads the data and then calls this function, see
the script scoreSegments.R
.
To compare segment coordinates with genomic coordinates, a segment's
start coordinate was defined as the coordinate of the middle (13-th)
base of the first probe in the segment, and similarly its end coordinate
as the coordinate of the middle base of the last probe.
For each segment, we calculate and record its:
- chr, strand
- chromosome and strand
- start, end, length
- start position, end position, length (in
bases)
- frac.dup
- fraction (0...1) of probes in this segment that have
also hits otherwhere
- level
- mean signal level
- geneInSegment
- list of gene identifiers (these can be zero, one,
or several identifiers). A gene is included if it is fully contained
within the segment, i.e. its start coordinates are >= the segment's
start and its end coordinates <= the segment's end.
- overlappingFeature
- list of feature identifiers (these include
genes, CDSs (=exons), ncRNAs ..., everything which has a line the
GFF file). A feature is included if the overlap between it and the
segment is more than a fraction
minOverlapFractionSame
of
the length of the segment or of the feature, whichever is smaller. The
overlappingFeature list by definition contains the elements of the
geneInSegment list, but can be larger.
- oppositeFeature
- list of feature identifiers. A feature is
included if the overlap between it and the segment is >=
minOverlapOppo (see below) bases.
- oppositeExpression
- a number. The signal on the opposite strand is
filtered with a moving average smoother of width oppositeWindow (see
below) bases. oppositeExpression is the minimum of the result. It is
later used to eliminate potential reverse transcription artifacts from
the unannotated, potential antisense segments. If it is
sufficiently small, we can assume that there is no transcription at
least on parts of the strand opposite the segment, hence it cannot be
a reverse transcription artifact.
- isIsolatedSame, isIsolatedOppo
- logical. TRUE if the distance
between the segment and and any annotated feature on the same or
opposite strand, respectively, is >=
params["minIsolatedDistance"]
bases.
- utr3, utr5
- positive integer numbers. These are calculated only
for segments which have exactly one gene in the geneInSegment
list. They are calculated as the difference between start points of
segment and gene, and between end points of segment and gene,
respectively.
- distLeft, distRight
- positive integer numbers. distLeft is the
distance between the start of the segment and the closest end of any
annotated feature, and distRight is the distance between the end of
the segment and the closest start any annotated feature.
- zLeft, zRight
- z-scores for the left (right) flank of the
segment. They are calculated as the difference between mean and of the
segment and mean of the signal from a region of length utrScoreWidth
(see below) immediately to the left (right), divided by the standard
deviation of the region. Note that the standard deviation of the
signal within the segment is not considered here.
The meaning of the parameters in the parameter vector params
is
as follows:
- minOverlapFractionSame
- see the definition of
overlappingFeature
above
- minOverlapOppo
- see the definition of
oppositeFeature
above
- oppositeWindow
- see the definition of
oppositeExpression
above
- utrScoreWidth
- see the definition of
zLeft
, zRight
above
Value
A dataframe with columns as described in the details section.
Author(s)
W. Huber <huber@ebi.ac.uk>
Examples
[Package
tilingArray version 1.0.2
Index]