Help for package BioM2

Title:

Biologically Explainable Machine Learning Framework

Version:

1.1.3

Description:

Biologically Explainable Machine Learning Framework for Phenotype Prediction using omics data described in Chen and Schwarz (2017) <doi:10.48550/arXiv.1712.00336>.Identifying reproducible and interpretable biological patterns from high-dimensional omics data is a critical factor in understanding the risk mechanism of complex disease. As such, explainable machine learning can offer biological insight in addition to personalized risk scoring.In this process, a feature space of biological pathways will be generated, and the feature space can also be subsequently analyzed using WGCNA (Described in Horvath and Zhang (2005) <doi:10.2202/1544-6115.1128> and Langfelder and Horvath (2008) <doi:10.1186/1471-2105-9-559> ) methods.

License:

MIT + file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.2.3

Imports:

WGCNA, mlr3, CMplot, ggsci, ROCR, caret, ggplot2, ggpubr, viridis, ggthemes, ggstatsplot, htmlwidgets, mlr3verse, parallel, uwot, webshot, wordcloud2,ggforce, igraph, ggnetwork

Depends:

R (≥ 4.1.0)

LazyData:

true

NeedsCompilation:

Packaged:

2025-07-17 08:11:25 UTC; Shedom

Author:

Shunjie Zhang [aut, cre], Junfang Chen [aut]

Maintainer:

Shunjie Zhang <zhang.shunjie@qq.com>

Repository:

CRAN

Date/Publication:

2025-07-17 08:30:02 UTC

Add unmapped probe

Description

Add unmapped probe

Usage

AddUnmapped(
  train = NULL,
  test = NULL,
  Unmapped_num = NULL,
  Add_FeartureSelection_Method = "wilcox.test",
  anno = NULL,
  len = NULL,
  verbose = TRUE,
  cores = 1
)

Arguments

train

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

test

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

Unmapped_num

The number of unmapped probes.

Add_FeartureSelection_Method

Feature selection methods. Available options are c('cor', 'wilcox.test').

anno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

len

The number of unmapped probes

verbose

Whether to print running process information to the console

cores

The number of cores used for computation.

Value

Matrix of unmapped probes

Biologically Explainable Machine Learning Framework

Description

Biologically Explainable Machine Learning Framework

Usage

BioM2(
  TrainData = NULL,
  TestData = NULL,
  pathlistDB = NULL,
  FeatureAnno = NULL,
  resampling = NULL,
  nfolds = 5,
  classifier = "liblinear",
  paramlist = NULL,
  predMode = "probability",
  PathwaySizeUp = 200,
  PathwaySizeDown = 20,
  MinfeatureNum_pathways = 10,
  Add_UnMapped = TRUE,
  Unmapped_num = 300,
  Add_FeartureSelection_Method = "wilcox.test",
  Inner_CV = TRUE,
  inner_folds = 10,
  Stage1_FeartureSelection_Method = "cor",
  cutoff = 0.3,
  Stage2_FeartureSelection_Method = "RemoveHighcor",
  cutoff2 = 0.95,
  classifier2 = NULL,
  target = "predict",
  p.adjust.method = "fdr",
  save_pathways_matrix = FALSE,
  cores = 1,
  seed = 10,
  verbose = TRUE
)

Arguments

TrainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

TestData

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

FeatureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

resampling

Resampling in mlr3verse.

nfolds

k-fold cross validation ( Only supported when TestData = NULL )

classifier

Learners in mlr3

paramlist

Learner parameters search spaces

predMode

The prediction mode. Currently only supports 'probability' for binary classification tasks.

PathwaySizeUp

The upper-bound of the number of genes in each biological pathways.

PathwaySizeDown

The lower-bound of the number of genes in each biological pathways.

MinfeatureNum_pathways

The minimal defined pathway size after mapping your own data to pathlistDB(KEGG database/GO database).

Add_UnMapped

Whether to add unmapped probes for prediction

Unmapped_num

The number of unmapped probes

Add_FeartureSelection_Method

Feature selection methods.

Inner_CV

Whether to perform a k-fold verification on the training set.

inner_folds

k-fold verification on the training set.

Stage1_FeartureSelection_Method

Feature selection methods.

cutoff

The cutoff used for feature selection threshold. It can be any value between 0 and 1.

Stage2_FeartureSelection_Method

Feature selection methods.

cutoff2

The cutoff used for feature selection threshold. It can be any value between 0 and 1.

classifier2

Learner for stage 2 prediction(if classifier2==NULL,then it is the same as the learner in stage 1.)

target

Is it used to predict or explore potential biological mechanisms? Available options are c('predict', 'pathways').

p.adjust.method

p-value adjustment method.(holm", "hochberg", "hommel", "bonferroni", "BH", "BY",

save_pathways_matrix

Whether to output the path matrix file

cores

The number of cores used for computation.

seed

Cross-Validation seed

verbose

Whether to print running process information to the console

Value

A list containing prediction results and prediction result evaluation

Examples




library(mlr3verse)
library(caret)
library(parallel)
library(BioM2)
data=MethylData_Test
set.seed(1)
part=unlist(createDataPartition(data$label,p=0.8))
Train=data[part,]
Test=data[-part,]
pathlistDB=GO2ALLEGS_BP
FeatureAnno=MethylAnno


pred=BioM2(TrainData = Train,TestData = Test,
           pathlistDB=pathlistDB,FeatureAnno=FeatureAnno,
           classifier='svm',nfolds=5,
           PathwaySizeUp=25,PathwaySizeDown=20,MinfeatureNum_pathways=10,
           Add_UnMapped='Yes',Unmapped_num=300,
           Inner_CV='None',inner_folds=5,
           Stage1_FeartureSelection_Method='cor',cutoff=0.3,
           Stage2_FeartureSelection_Method='None',
           target='predict',cores=1
)#(To explore biological mechanisms, set target=‘pathways’)

Find suitable parameters for partitioning pathways modules

Description

Find suitable parameters for partitioning pathways modules

Usage

FindParaModule(
  pathways_matrix = NULL,
  control_label = 0,
  minModuleSize = seq(10, 20, 5),
  mergeCutHeight = seq(0, 0.3, 0.1),
  minModuleNum = 5,
  power = NULL,
  exact = TRUE,
  ancestor_anno = NULL
)

Arguments

pathways_matrix

A pathway matrix generated by the BioM2( target='pathways') function.

control_label

The label of the control group ( A single number, factor, or character )

minModuleSize

minimum module size for module detection. Detail for WGCNA::blockwiseModules()

mergeCutHeight

dendrogram cut height for module merging. Detail for WGCNA::blockwiseModules()

minModuleNum

Minimum total number of modules detected

power

soft-thresholding power for network construction. Detail for WGCNA::blockwiseModules()

exact

Whether to divide GO pathways more accurately (work when ancestor_anno=NULL)

ancestor_anno

Annotations for ancestral relationships (like data('GO_Ancestor') )

Value

A list containing recommended parameters

An example about pathlistDB

Description

An example about pathlistDB

Format

A list :

...

Details

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used).

Pathways in the GO database and their Ancestor

Description

Inclusion relationships between pathways

Format

A data frame :

...

Details

In the GO database, each pathway will have its own ancestor pathway. Map pathways in GO database to about 20 common ancestor pathways.

Source

From GO.db

Pathways in the GO database and their Ancestor

Description

Inclusion relationships between pathways

Format

A data frame :

...

Details

In the GO database, each pathway will have its own ancestor pathway. Map pathways in GO database to about 400 common ancestor pathways.

Source

From GO.db

BioM2 Hyperparametric Combination

Description

BioM2 Hyperparametric Combination

Usage

HyBioM2(
  TrainData = NULL,
  pathlistDB = NULL,
  FeatureAnno = NULL,
  resampling = NULL,
  nfolds = 5,
  classifier = "liblinear",
  predMode = "probability",
  PathwaySizeUp = 200,
  PathwaySizeDown = 20,
  MinfeatureNum_pathways = 10,
  Add_UnMapped = TRUE,
  Add_FeartureSelection_Method = "wilcox.test",
  Unmapped_num = 300,
  Inner_CV = TRUE,
  inner_folds = 10,
  Stage1_FeartureSelection_Method = "cor",
  stage1_cutoff = 0.3,
  Stage2_FeartureSelection_Method = "RemoveHighcor",
  stage2_cutoff = 0.8,
  classifier2 = NULL,
  cores = 1,
  verbose = TRUE
)

Arguments

TrainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

FeatureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

resampling

Resampling in mlr3verse.

nfolds

k-fold cross validation ( Only supported when TestData = NULL )

classifier

Learners in mlr3

predMode

The prediction mode. Currently only supports 'probability' for binary classification tasks.

PathwaySizeUp

The upper-bound of the number of genes in each biological pathways.

PathwaySizeDown

The lower-bound of the number of genes in each biological pathways.

MinfeatureNum_pathways

The minimal defined pathway size after mapping your own data to pathlistDB(KEGG database/GO database).

Add_UnMapped

Whether to add unmapped probes for prediction

Add_FeartureSelection_Method

Feature selection methods.

Unmapped_num

The number of unmapped probes

Inner_CV

Whether to perform a k-fold verification on the training set.

inner_folds

k-fold verification on the training set.

Stage1_FeartureSelection_Method

Feature selection methods.

stage1_cutoff

The cutoff used for feature selection threshold. It can be any value between 0 and 1.

Stage2_FeartureSelection_Method

Feature selection methods.

stage2_cutoff

The cutoff used for feature selection threshold. It can be any value between 0 and 1.

classifier2

Learner for stage 2 prediction(if classifier2==NULL,then it is the same as the learner in stage 1.)

cores

The number of cores used for computation.

verbose

Whether to print running process information to the console

Value

A data frame contains hyperparameter results

An example about FeatureAnno for methylation data

Description

An example about FeatureAnno for methylation data

Format

A data frame :

...

Details

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'.

An example about TrainData/TestData for methylation data

Description

An example about TrainData/TestData for methylation data MethylData_Test.

Format

A data frame :

...

Details

The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

Delineate differential pathway modules with high biological interpretability

Description

Delineate differential pathway modules with high biological interpretability

Usage

PathwaysModule(
  pathways_matrix = NULL,
  control_label = NULL,
  power = NULL,
  minModuleSize = NULL,
  mergeCutHeight = NULL,
  cutoff = 70,
  MinNumPathways = 5,
  p.adjust.method = "fdr",
  exact = TRUE,
  ancestor_anno = NULL
)

Arguments

pathways_matrix

A pathway matrix generated by the BioM2( target='pathways') function.

control_label

The label of the control group ( A single number, factor, or character )

power

soft-thresholding power for network construction. Detail for WGCNA::blockwiseModules()

minModuleSize

minimum module size for module detection. Detail for WGCNA::blockwiseModules()

mergeCutHeight

dendrogram cut height for module merging. Detail for WGCNA::blockwiseModules()

cutoff

Thresholds for Biological Interpretability Difference Modules

MinNumPathways

Minimum number of pathways included in the biologically interpretable difference module

p.adjust.method

p-value adjustment method.(holm", "hochberg", "hommel", "bonferroni", "BH", "BY",

exact

Whether to divide GO pathways more accurately (work when ancestor_anno=NULL)

ancestor_anno

Annotations for ancestral relationships (like data('GO_Ancestor') )

Value

A list containing differential module results that are highly biologically interpretable

Correlalogram for Biological Differences Modules

Description

Correlalogram for Biological Differences Modules

Usage

PlotCorModule(
  PathwaysModule_obj = NULL,
  alpha = 0.7,
  begin = 0.2,
  end = 0.9,
  option = "C",
  family = "serif"
)

Arguments

PathwaysModule_obj

Results produced by PathwaysModule()

alpha

The alpha transparency, a number in (0,1). Detail for scale_fill_viridis()

begin

The (corrected) hue in (0,1) at which the color map begins. Detail for scale_fill_viridis().

end

The (corrected) hue in (0,1) at which the color map ends. Detail for scale_fill_viridis()

option

A character string indicating the color map option to use. Detail for scale_fill_viridis()

family

calligraphic style

Value

a ggplot object

Visualisation of significant pathway-level features

Description

Visualisation of significant pathway-level features

Usage

PlotPathFearture(
  BioM2_pathways_obj = NULL,
  pathlistDB = NULL,
  top = 10,
  p.adjust.method = "none",
  begin = 0.1,
  end = 0.9,
  alpha = 0.9,
  option = "C",
  seq = 1
)

Arguments

BioM2_pathways_obj

Results produced by BioM2(,target='pathways')

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

top

Number of significant pathway-level features visualised

p.adjust.method

p-value adjustment method.(holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none")

begin

The (corrected) hue in (0,1) at which the color map begins. Detail for scale_fill_viridis().

end

The (corrected) hue in (0,1) at which the color map ends. Detail for scale_fill_viridis()

alpha

The alpha transparency, a number in (0,1). Detail for scale_fill_viridis()

option

A character string indicating the color map option to use. Detail for scale_fill_viridis()

seq

Interval of x-coordinate

Value

a ggplot2 object

Visualisation Original features that make up the pathway

Description

Visualisation Original features that make up the pathway

Usage

PlotPathInner(
  data = NULL,
  pathlistDB = NULL,
  FeatureAnno = NULL,
  PathNames = NULL,
  p.adjust.method = "none",
  save_pdf = FALSE,
  alpha = 1,
  cols = NULL
)

Arguments

data

The input omics data

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

FeatureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

PathNames

A vector.A vector containing the names of pathways

p.adjust.method

p-value adjustment method.(holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr","none")

save_pdf

Whether to save images in PDF format

alpha

The alpha transparency, a number in (0,1).

cols

palette (vector of colour names)

Value

a plot object

Network diagram of pathways-level features

Description

Network diagram of pathways-level features

Usage

PlotPathNet(
  data = NULL,
  BioM2_pathways_obj = NULL,
  FeatureAnno = NULL,
  pathlistDB = NULL,
  PathNames = NULL,
  cutoff = 0.2,
  num = 10
)

Arguments

data

The input omics data

BioM2_pathways_obj

Results produced by BioM2()

FeatureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

pathlistDB

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

PathNames

A vector.A vector containing the names of pathways

cutoff

Threshold for correlation between features within a pathway

num

The first few internal features of each pathway that are most relevant to the phenotype

Value

a ggplot object

Display biological information within each pathway module

Description

Display biological information within each pathway module

Usage

ShowModule(obj = NULL, ID_Module = NULL, exact = TRUE, ancestor_anno = NULL)

Arguments

obj

Results produced by PathwaysModule()

ID_Module

ID of the diff module

exact

Whether to divide GO pathways more accurately (work when ancestor_anno=NULL)

ancestor_anno

Annotations for ancestral relationships (like data('GO_Ancestor') )

Value

List containing biologically specific information within the module

Stage 1 Fearture Selection

Description

Stage 1 Fearture Selection

Usage

Stage1_FeartureSelection(
  Stage1_FeartureSelection_Method = "cor",
  data = NULL,
  cutoff = NULL,
  featureAnno = NULL,
  pathlistDB_sub = NULL,
  MinfeatureNum_pathways = 10,
  cores = 1,
  verbose = TRUE
)

Arguments

Stage1_FeartureSelection_Method

Feature selection methods. Available options are c(NULL, 'cor', 'wilcox.test', 'cor_rank', 'wilcox.test_rank').

data

The input training dataset. The first column is the label.

cutoff

The cutoff used for feature selection threshold. It can be any value between 0 and 1. Commonly used cutoffs are c(0.5, 0.1, 0.05, 0.01, etc.).

featureAnno

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'. (For details, please refer to data( data("MethylAnno") )

pathlistDB_sub

A list of pathways with pathway IDs and their corresponding genes ('entrezID' is used). For details, please refer to ( data("GO2ALLEGS_BP") )

MinfeatureNum_pathways

The minimal defined pathway size after mapping your own data to pathlistDB(KEGG database/GO database).

cores

The number of cores used for computation.

verbose

Whether to print running process information to the console

Value

A list of matrices with pathway IDs as the associated list member names.

Author(s)

Shunjie Zhang

Examples


library(parallel)
data=MethylData_Test
feature_pathways=Stage1_FeartureSelection(Stage1_FeartureSelection_Method='cor',
                 data=data,cutoff=0,
                 featureAnno=MethylAnno,pathlistDB_sub=GO2ALLEGS_BP,cores=1)

Stage 2 Fearture Selection

Description

Stage 2 Fearture Selection

Usage

Stage2_FeartureSelection(
  Stage2_FeartureSelection_Method = "RemoveHighcor",
  data = NULL,
  label = NULL,
  cutoff = NULL,
  preMode = NULL,
  classifier = NULL,
  verbose = TRUE,
  cores = 1
)

Arguments

Stage2_FeartureSelection_Method

Feature selection methods. Available options are c(NULL, 'cor', 'wilcox.test','cor_rank','wilcox.test_rank','RemoveHighcor', 'RemoveLinear').

data

The input training dataset. The first column is the label.

label

The label of dataset

cutoff

The cutoff used for feature selection threshold. It can be any value between 0 and 1.

preMode

The prediction mode. "Currently only supports 'probability' for binary classification tasks."

classifier

Learners in mlr3

verbose

Whether to print running process information to the console

cores

The number of cores used for computation.

Value

Column index of feature

Author(s)

Shunjie Zhang

An example about FeatureAnno for gene expression

Description

An example about FeatureAnno for gene expression

Format

A data frame :

...

Details

The annotation data stored in a data.frame for probe mapping. It must have at least two columns named 'ID' and 'entrezID'.

An example about TrainData/TestData for gene expression

Description

An example about TrainData/TestData for gene expression MethylData_Test.

Format

A data frame :

...

Details

The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

Visualisation of the results of the analysis of the pathway modules

Description

Visualisation of the results of the analysis of the pathway modules

Usage

VisMultiModule(
  BioM2_pathways_obj = NULL,
  FindParaModule_obj = NULL,
  ShowModule_obj = NULL,
  PathwaysModule_obj = NULL,
  exact = TRUE,
  ancestor_anno = NULL,
  type_text_table = FALSE,
  text_table_theme = ttheme("mOrange"),
  volin = FALSE,
  control_label = 0,
  module = NULL,
  cols = NULL,
  n_neighbors = 8,
  spread = 1,
  min_dist = 2,
  target_weight = 0.5,
  size = 1.5,
  alpha = 1,
  ellipse = TRUE,
  ellipse.alpha = 0.2,
  theme = ggthemes::theme_base(base_family = "serif"),
  save_pdf = FALSE,
  width = 7,
  height = 7
)

Arguments

BioM2_pathways_obj

Results produced by BioM2(,target='pathways')

FindParaModule_obj

Results produced by FindParaModule()

ShowModule_obj

Results produced by ShowModule()

PathwaysModule_obj

Results produced by PathwaysModule()

exact

Whether to divide GO pathways more accurately (work when ancestor_anno=NULL)

ancestor_anno

Annotations for ancestral relationships (like data('GO_Ancestor') )

type_text_table

Whether to display it in a table

text_table_theme

The topic of this table.Detail for ggtexttable()

volin

Can only be used when PathwaysModule_obj exists. ( Violin diagram )

control_label

Can only be used when PathwaysModule_obj exists. ( Control group label )

module

Can only be used when PathwaysModule_obj exists.( PathwaysModule ID )

cols

palette (vector of colour names)

n_neighbors

The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

spread

The effective scale of embedded points. In combination with min_dist, this determines how clustered/clumped the embedded points are.

min_dist

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

target_weight

Weighting factor between data topology and target topology. A value of 0.0 weights entirely on data, a value of 1.0 weights entirely on target. The default of 0.5 balances the weighting equally between data and target. Only applies if y is non-NULL.

size

Scatter plot point size

alpha

Alpha for ellipse specifying the transparency level of fill color. Use alpha = 0 for no fill color.

ellipse

logical value. If TRUE, draws ellipses around points.

ellipse.alpha

Alpha for ellipse specifying the transparency level of fill color. Use alpha = 0 for no fill color.

theme

Default:theme_base(base_family = "serif")

save_pdf

Whether to save images in PDF format

width

image width

height

image height

Value

a ggplot2 object

Prediction by Machine Learning

Description

Prediction by Machine Learning with different learners ( From 'mlr3' )

Usage

baseModel(
  trainData,
  testData,
  predMode = "probability",
  classifier,
  paramlist = NULL,
  inner_folds = 10,
  seed = 10
)

Arguments

trainData

The input training dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

testData

The input test dataset. The first column is the label or the output. For binary classes, 0 and 1 are used to indicate the class member.

predMode

The prediction mode.Currently only supports 'probability' for binary classification tasks.

classifier

Learners in mlr3

paramlist

Learner parameters search spaces

inner_folds

k-fold cross validation ( Only supported when testData = NULL )

seed

Cross-Validation seed

Value

The predicted output for the test data.

Author(s)

Shunjie Zhang

Examples

library(mlr3verse)
library(caret)
library(BioM2)
data=MethylData_Test
set.seed(1)
part=unlist(createDataPartition(data$label,p=0.8))#Split data
predict=baseModel(trainData=data[part,1:10],
                 testData=data[-part,1:10],
                 classifier = 'svm')#Use 10 features to make predictions,Learner uses svm