scoreSegments {tilingArray}R Documentation

Score segments

Description

Score the segments found by a previous call to findSegments by comparing to genome annotation

Usage

scoreSegments(s, gff, 
  nrBasePerSeg = 1500, 
  probeLength  = 25,
  knownFeatures = c("CDS", "gene", "ncRNA", "nc_primary_transcript",
        "rRNA", "snRNA", "snoRNA", "tRNA",
        "transposable_element", "transposable_element_gene"),
  params = c(minOverlapFractionSame = 0.8, minOverlapOppo = 40,
    minIsolatedDistance=100, oppositeWindow = 100, utrScoreWidth=100),
  verbose = TRUE)

Arguments

s environment. See details.
gff GFF dataframe.
nrBasePerSeg Numeric of length 1. This parameter determines the number of segments.
probeLength Numeric of length 1.
knownFeatures Character vector. Names of those features in gff which should be considered known features.
params vector of additional parameters, see details.
verbose Logical.

Details

This function scores segments. It is typically called after a segmentation. For an example segmentation script, see the script segment.R in the scripts directory of this package. For an example scoring script, which loads the data and then calls this function, see the script scoreSegments.R.

To compare segment coordinates with genomic coordinates, a segment's start coordinate was defined as the coordinate of the middle (13-th) base of the first probe in the segment, and similarly its end coordinate as the coordinate of the middle base of the last probe.

For each segment, we calculate and record its:

chr, strand
chromosome and strand
start, end, length
start position, end position, length (in bases)
frac.dup
fraction (0...1) of probes in this segment that have also hits otherwhere
level
mean signal level
geneInSegment
list of gene identifiers (these can be zero, one, or several identifiers). A gene is included if it is fully contained within the segment, i.e. its start coordinates are >= the segment's start and its end coordinates <= the segment's end.
overlappingFeature
list of feature identifiers (these include genes, CDSs (=exons), ncRNAs ..., everything which has a line the GFF file). A feature is included if the overlap between it and the segment is more than a fraction minOverlapFractionSame of the length of the segment or of the feature, whichever is smaller. The overlappingFeature list by definition contains the elements of the geneInSegment list, but can be larger.
oppositeFeature
list of feature identifiers. A feature is included if the overlap between it and the segment is >= minOverlapOppo (see below) bases.
oppositeExpression
a number. The signal on the opposite strand is filtered with a moving average smoother of width oppositeWindow (see below) bases. oppositeExpression is the minimum of the result. It is later used to eliminate potential reverse transcription artifacts from the unannotated, potential antisense segments. If it is sufficiently small, we can assume that there is no transcription at least on parts of the strand opposite the segment, hence it cannot be a reverse transcription artifact.
isIsolatedSame, isIsolatedOppo
logical. TRUE if the distance between the segment and and any annotated feature on the same or opposite strand, respectively, is >= params["minIsolatedDistance"] bases.
utr3, utr5
positive integer numbers. These are calculated only for segments which have exactly one gene in the geneInSegment list. They are calculated as the difference between start points of segment and gene, and between end points of segment and gene, respectively.
distLeft, distRight
positive integer numbers. distLeft is the distance between the start of the segment and the closest end of any annotated feature, and distRight is the distance between the end of the segment and the closest start any annotated feature.
zLeft, zRight
z-scores for the left (right) flank of the segment. They are calculated as the difference between mean and of the segment and mean of the signal from a region of length utrScoreWidth (see below) immediately to the left (right), divided by the standard deviation of the region. Note that the standard deviation of the signal within the segment is not considered here.

The meaning of the parameters in the parameter vector params is as follows:

minOverlapFractionSame
see the definition of overlappingFeature above
minOverlapOppo
see the definition of oppositeFeature above
oppositeWindow
see the definition of oppositeExpression above
utrScoreWidth
see the definition of zLeft, zRight above

Value

A dataframe with columns as described in the details section.

Author(s)

W. Huber <huber@ebi.ac.uk>

Examples






[Package tilingArray version 1.0.2 Index]