subsampleClustering {clusterExperiment} | R Documentation |
Given input data, this function will subsample the samples, cluster the
subsamples, and return a n x n
matrix with the probability of
co-occurance.
## S4 method for signature 'character' subsampleClustering(clusterFunction, ...) ## S4 method for signature 'ClusterFunction' subsampleClustering(clusterFunction, x = NULL, diss = NULL, distFunction = NA, clusterArgs = NULL, classifyMethod = c("All", "InSample", "OutOfSample"), resamp.num = 100, samp.p = 0.7, ncores = 1, checkArgs = TRUE, checkDiss = TRUE, largeDataset = FALSE, ...)
clusterFunction |
a |
... |
arguments passed to mclapply (if ncores>1). |
x |
the data on which to run the clustering (samples in columns). |
diss |
a dissimilarity matrix on which to run the clustering. |
distFunction |
a distance function to be applied to |
clusterArgs |
a list of parameter arguments to be passed to the function
defined in the |
classifyMethod |
method for determining which samples should be used in calculating the co-occurance matrix. "All"= all samples, "OutOfSample"= those not subsampled, and "InSample"=those in the subsample. See details for explanation. |
resamp.num |
the number of subsamples to draw. |
samp.p |
the proportion of samples to sample for each subsample. |
ncores |
integer giving the number of cores. If ncores>1, mclapply will be called. |
checkArgs |
logical as to whether should give warning if arguments given that don't match clustering choices given. Otherwise, inapplicable arguments will be ignored without warning. |
checkDiss |
logical. Whether to check whether the input |
largeDataset |
logical indicating whether a more memory-efficient version should be used because the dataset is large. This is a beta option, and is in the process of being tested before it becomes the default. |
subsampleClustering
is not usually called directly by the
user. It is only an exported function so as to be able to clearly document
the arguments for subsampleClustering
which can be passed via the
argument subsampleArgs
in functions like clusterSingle
and clusterMany
.
requiredArgs:
The choice of "All" or "OutOfSample" for
requiredArgs
require the classification of arbitrary samples not
originally in the clustering to clusters; this is done via the classifyFUN
provided in the ClusterFunction
object. If the
ClusterFunction
object does not have such a function to
define how to classify into a cluster samples not in the subsample that
created the clustering then classifyMethod
must be
"InSample"
. Note that if "All" is chosen, all samples will be
classified into clusters via the classifyFUN, not just those that are
out-of-sample; this could result in different assignments to clusters for
the in-sample samples than their original assignment by the clustering
depending on the classification function. If you do not choose 'All',it is
possible to get NAs in resulting S matrix (particularly if when not enough
subsamples are taken) which can cause errors if you then pass the resulting
D=1-S matrix to mainClustering
. For this reason the default is
"All".
Our subsampling algorithm is implemented in C++ and is fast and
simple but may be memory inefficient if samp.p
is low.
largeDataset = TRUE
should be more efficient in such situations,
possibly at the cost of speed. Note that this feature is experimental and
should be used only by the developers.
A n x n
matrix of co-occurances, i.e. a symmetric matrix with
[i,j] entries equal to the percentage of subsamples where the ith and jth
sample were clustered into the same cluster. The percentage is only out of
those subsamples where the ith and jth samples were both assigned to a
clustering. If classifyMethod=="All"
, this is all subsamples for all
i,j pairs. But if classifyMethod=="InSample"
or
classifyMethod=="OutOfSample"
, then the percentage is only taken on
those subsamples where the ith and jth sample were both in or out of
sample, respectively, relative to the subsample.
## Not run: #takes a bit of time, not run on checks: data(simData) coOccur <- subsampleClustering(clusterFunction="kmeans", x=simData, clusterArgs=list(k=3,nstart=10), resamp.n=100, samp.p=0.7) #visualize the resulting co-occurance matrix plotHeatmap(coOccur) ## End(Not run)