A new pipeline to explore structural similarity across metabolite modules

1 Introduction

MetChem is an R package used to perform structural and functional analysis of metabolites using a simple pipeline.

2 Installation

2.1 Installation via CRAN

The R package MetChem (current version 0.2) is part of the Comprehensive R Archive Network (CRAN)1. The simplest way to install the package is to enter the following command into your R session: install.packages("MetChem"). We suggest installing the following R packages: pheatmap and RColorBrewer to enable data visualization in heatmaps, readxl for the data reading of Excel files, and impute for the imputation of missing data.

# To install the pheatmap package
install.packages("pheatmap")

# To install the RColorBrewer package
install.packages("RColorBrewer")

# To install the readxl package
install.packages("readxl")

# To install the impute package
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("impute")

2.2 Manual installation from source

The package can be manually installed from source by opening the package’s page in CRAN and then proceeding as follows:

  • Download MetChem.tar.gz and save it to your hard disk
  • Open a shell/terminal/command prompt window and change to the desired directory for installation of MetChem.tar.gz. Enter R CMD INSTALL MetChem.tar.gz to install the package. Note that this may require additional software on some platforms. Windows requires Rtools2 to be installed and to be available in the default search path (environment variable PATH). MAC OS X requires installation of Xcode developers and command line tools.

2.3 Compatibility issues

The package downloadable from CRAN was built using R version, R.4.2.1. The package should work without major issues on R versions > 3.5.0 and KODAMA package >= 2.3.

3 Getting Started

The R package MetChem depends by the R package rcdk, which is partially implemented in Java. In Windows environment, we reported an issue related to the uploading of rcdk package. It can be solved running the following code before loading the package MetChem.

replacement <- function(category = "LC_ALL") {
  if (identical(category, "LC_MESSAGES"))
    return("")
  category <- match(category, .LC.categories)
  if (is.na(category)) 
    stop("invalid 'category' argument")
  .Internal(Sys.getlocale(category))

}
base <- asNamespace("base")
environment(replacement) <- base
unlockBinding("Sys.getlocale", base)
assign("Sys.getlocale", replacement, envir = base)
lockBinding("Sys.getlocale", base)

To load the package, enter the following command in your R session:

library("MetChem")

If this command terminates without any error messages, the package is installed successfully. The MetChem package is now ready for use.

The package includes both a user manual (this document) and a reference manual (help pages for each function). To view the user manual, enter vignette("MetChem"). Help pages can be viewed using the command help(package="MetChem").

4 Example 1: murine prostate tissues metabolic profile

Here, we introduce an example for the analysis of metabolic structural information using MetChem package. For this, we used a data set of mass spectrometry dataset obtained from murine prostate tissue samples reported by Labbé and Zadra (2019) (Supplementary Data 2). The metabolic data are obtained from ventral prostate tissues of mice that overexpress a human c-MYC transgene (MYC) in the prostate epithelium and wild-type littermates (WT). Mice were fed either a high fat diet (HFD; 60% kcal from fat; lard—rich in saturated fat) or a control diet (CTD; 10% kcal from fat). The data set includes six replicates for each group (, WT_CTD, MYC_CTD, WT_HFD, and MYC_HFD). To begin, download the data from the Labbé and Zadra (2019) study. Download it and save it to your hard disk. Metabolomic data is extracted using the instructions below. Data is then imputed using a k-nearest neighbour (kNN) algorithm using the function impute as described in the publication.

require("readxl")
require("impute")
d=as.data.frame(read_excel("41467_2019_12298_MOESM5_ESM.xlsx",skip = 3))
d=d[1:414,]
rownames(d)=d[,"Metabolite"]
met=d[,4:27]
label=rep(c("WT_CTD","MYC_CTD","WT_HFD","MYC_HFD"),each=6)
label_MYC=rep(c("WT","MYC","WT","MYC"),each=6)
colnames(met)=paste(label,1:6)
met=data.matrix(met)
met=impute.knn(met,k=5)$data

Heatmap visualization is generated using the function pheatmap. Metabolites are hierarchically clustered according to their relative concentration The hierarchical clustering is performed using the distance matrix based on the KODAMA dimensions. KODAMA is a learning algorithm for unsupervised feature extraction specifically designed for analyzing noisy and high-dimensional data sets (Cacciatore , 2014), implemented in the R package KODAMA (Cacciatore , 2017). Additional information can be found in the review of Zinga , 2023.

require("pheatmap")
require("RColorBrewer")

my_colour1 = list(genotype=c(MYC="#000000ff",WT="#eeeeeeff"),
                  group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff",
                          WT_CTD="#00a4cfff",WT_HFD="#e40a81ff"))

set.seed(1)
kk1=KODAMA.matrix(t(met))
col=KODAMA.visualization(kk1)
hcol=hclust(dist(col),method="ward.D")

kk2=KODAMA.matrix(scale(met))
row=KODAMA.visualization(kk2)
hrow=hclust(dist(row),method="ward.D")
my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)

pheatmap(met,
         cluster_cols = hcol,
         cluster_rows = hrow,
         labels_row = rep("",nrow(met)),
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

To analyze the chemical similarities among metabolites, we need the Simplified Molecular-Input Line-Entry System (SMILES) of each metabolite is obtained. The SMILES of the previous data set is stored in the list HFD that can be loaded using the function data(HFD). MetChem package includes the modules.detection function based on KODAMA analysis. This function repeats the follwing steps (10 times as default): i) transformation of the chemical structure dissimilarity matrix in a multidimensional space (with 50 dimensions as defaults) using multidimensional scaling; ii) KODAMA features extraction; iii) hierarchical clustering based on the KODAMA output; iv) Calculation of the silhoutte index from different number of clusters (from 2 to 30 as default). The average of the siloutte index is calculated for each cluster numbers to identity the optimal cluster number.

data(HFD)
met=met[rownames(HFD),]
SMILES=HFD$SMILES
names(SMILES)=rownames(HFD)
clu=modules.detection(SMILES)
plot(clu$min_nc:clu$max_nc,clu$silhouette,type="l",
     ylab="Rousseeuw's Silhouette index",xlab="Number of clusters")
abline(v=5*(1:6),lty=2)

print(clu$main_cluster)

Based on the average Silhouette indeces, we have the optimal number of cluster for this data set is 8. Below is shown a graphical visualization of the final output of KODAMA. Each cluster is represented by a different color code. Each dot represents a different metabolite. Metabolites that are located near to each other share a similar chemical structure.

plot(clu$visualization,pch=21,bg=rainbow(8,alpha = 0.7)[clu$clusters[,"Clusters 8"]],cex=2)
legend(-30, 20, legend=paste("Cluster", unique(clu$clusters[,"Clusters 8"])),
   col= rainbow(8,alpha = 0.7), pch= 16, cex=1)
   

The following line code can be used to identify the points on the scatter plot. The cluster belonging and chemical name of the selected points will be displayed. ESC key to terminate the command.

data.frame(metabolite=rownames(met),
           cluster=clu$clusters[,"Clusters 8"])[identify(clu$visualization),]

A new heatmap is generated where metabolites are clustered according to their chemical similarity.

my_colour2 = list(cluster = c("1"=rainbow(8,alpha = 1)[1],
                              "2"=rainbow(8,alpha = 1)[2],
                              "3"=rainbow(8,alpha = 1)[3],
                              "4"=rainbow(8,alpha = 1)[4],
                              "5"=rainbow(8,alpha = 1)[5],
                              "6"=rainbow(8,alpha = 1)[6],
                              "7"=rainbow(8,alpha = 1)[7],
                              "8"=rainbow(8,alpha = 1)[8]),
                  genotype=c(MYC="#000000ff",WT="#eeeeeeff"),
                  group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff",
                          WT_CTD="#00a4cfff",WT_HFD="#e40a81ff"))

clusters8=clu$clusters[,"Clusters 8"]
my_sample_row <- data.frame(cluster = as.character(clusters8))
row.names(my_sample_row) <- rownames(met)

set.seed(1)
met=met[rownames(HFD),] 
kk1=KODAMA.matrix(t(met))
col=KODAMA.visualization(kk1)
hcol=hclust(dist(col),method="ward.D")
hrow=clu$hclust

my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)

pheatmap(met,
         cluster_cols = hcol,
         cluster_rows = hrow,
         labels_row = rep("",nrow(met)), 
         annotation_colors = my_colour2,
         annotation_col = my_sample_col, 
         annotation_row = my_sample_row, 
         cutree_rows = 8, 
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

We next build a heatmap of the metabolite belonging to the cluster 4.

sel=clu$clusters[,"Clusters 8"]==4
met.sel=met[sel,] 
my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)
oo=order(row.names(my_sample_col))
my_sample_col=my_sample_col[oo,]
met.sel=met.sel[,oo]
hrow.sel=hclust(dist(clu$visualization[sel,]),method="ward.D") 

pheatmap(met.sel,fontsize = 7,
         cluster_cols = FALSE,
         cluster_rows = hrow.sel,
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))
         
         

In the next step, we apply the Weighted Metabolite Chemical Similarity Analysis (WMCSA) on all branches of the hierarchical clustering performed on the cluster 4. WMCSA is implemented in the function WMCSA. This function summarizes the relative concentration of metabolites within each module. Each module is defined according to the chemical similarity.

set.seed(1)
my_sample_col <- data.frame(group = label,genotype=label_MYC)
row.names(my_sample_col) <- colnames(met)
cl=allbranches(hrow.sel)
ww=WMCSA(met.sel,cl)

hrow=hclust(dist(ww),method="ward.D")

pheatmap(ww,
         cluster_cols = FALSE,
         cluster_rows = hrow,
         annotation_col = my_sample_col, 
         annotation_colors = my_colour1,
         color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100))

Differential analysis of the relevant modules can be performed using the function multi_analysis present in the R package KODAMA. In the example below, we perform a differential analysis between MYC transgenic mice fed with high-fat diet or control diet named as MYC_HFD and MYC_CTD, respectively.

label=rep(c("MYC","WT"),each=12)
multi_analysis(t(ww),label)
Feature MYC WT p-value FDR
Mod1, median [IQR] -2.919 [-3.833 -1.536] 2.938 [2.195 3.563] 3.66e-05 7.84e-05
Mod2, median [IQR] -2.17 [-2.638 -1.706] 2.237 [1.98 2.478] 3.66e-05 7.84e-05
Mod3, median [IQR] -1.798 [-2.033 -1.215] 1.409 [1.015 2.228] 3.66e-05 7.84e-05
Mod4, median [IQR] -2.45 [-3.037 -1.381] 2.139 [1.776 2.827] 3.66e-05 7.84e-05
Mod5, median [IQR] -1.902 [-2.171 -1.425] 1.688 [1.563 1.941] 3.66e-05 7.84e-05
Mod6, median [IQR] -1.772 [-3.022 -1.119] 1.701 [0.742 2.204] 2.46e-04 4.10e-04
Mod7, median [IQR] 0.886 [-1.181 2.231] -0.329 [-1.224 0.379] 4.03e-01 4.64e-01
Mod8, median [IQR] -1.674 [-2.485 -0.656] 1.248 [0.609 1.995] 3.84e-04 5.76e-04
Mod9, median [IQR] 1.609 [0.961 2.06] -1.412 [-1.706 -1.203] 4.69e-05 8.80e-05
Mod10, median [IQR] -0.84 [-1.956 0.087] 0.8 [-0.148 1.161] 4.64e-02 6.33e-02
Mod11, median [IQR] -0.023 [-0.639 1.15] -0.583 [-1.229 0.027] 1.75e-01 2.19e-01
Mod12, median [IQR] -1.347 [-1.928 -0.681] 1.305 [0.711 1.736] 3.66e-05 7.84e-05
Mod13, median [IQR] 0.757 [-0.717 1.532] -0.004 [-1.003 0.6] 5.07e-01 5.43e-01
Mod14, median [IQR] 1.691 [0.59 1.952] -1.389 [-1.662 -0.889] 3.66e-05 7.84e-05
Mod15, median [IQR] 0.196 [-1.074 0.842] 0.388 [-0.853 0.638] 7.95e-01 7.95e-01

The function readMet connects R to the HMDB database 3. to retrieve chemical and functional information of each metabolite. This can be summarized using different functions: substituentsMet, diseaseMet, enzymeMet, pathwaysMet, taxonomyMet. The function features associates the most prominent features to each module.

In this example, we characterized the modules by functional group.

HMDB=HFD$HMDB
names(HMDB)=rownames(HFD)
doc=readMet(HMDB)
cla1=substituentsMet(doc)
substituents=features(doc,cla1,cl)
cla2=enzymesMet(doc)
enzymes=features(doc,cla2,cl)

To visualize the metabolite of , for example, module 12, we can type:

cl[[12]]

Fisher’s test was used to rank the association of each module to the metabolite information. Below are reported the p-value for associations of substituents with the module 12.

substituents[[12]]
Substituents p-value
Histidine or derivatives 1.44e-06
Imidazolyl carboxylic acid derivative 1.21e-05
Hybrid peptide 4.97e-05
N-acyl-alpha amino acid or derivatives 1.21e-03
Imidazole 1.31e-03
Azole 1.53e-03
Aromatic heteromonocyclic compound 1.77e-03
N-acyl-alpha-amino acid 2.33e-03
Beta amino acid or derivatives 8.62e-03
Organoheterocyclic compound 1.18e-02
Heteroaromatic compound 1.47e-02
Carboximidic acid 1.91e-02
Carboximidic acid derivative 2.04e-02
Azacycle 2.05e-02
Organic 1,3-dipolar compound 2.32e-02
Propargyl-type 1,3-dipolar organic compound 2.32e-02
Alpha-amino acid or derivatives 3.95e-02
Amino acid or derivatives 4.13e-02

Here are reported the p-value for associations of metabolite-related enzymes with the module 12.

enzymes[[12]]
Substituents p-value
CNDP1 2.97e-04
CARNS1 1.38e-03
VIM 8.62e-03
HSPA1A 8.62e-03
SLC15A2 8.62e-03
MPO 2.57e-02

4 How to Cite this Package

Ebtesam Abdel-Shafy, Tadele Melak, David A. MacIntyre, Giorgia Zadra, Luiz F. Zerbini, Silvano Piazza, and Stefano Cacciatore Publication in submission

To obtain BibTex entries of the two references, you can enter the following into your R session to Bibtex citation("MetChem").

5 References

Cacciatore S, Luchinat C, Tenori L. Knowledge discovery by accuracy maximization. 2014; 111: 5117-22.

Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA (2017) KODAMA: an R package for knowledge discovery and data mining. 2017; 33(4): 621-623.

Labbé DP, Zadra G, Yang M, Reyes JM, Lin CY, Cacciatore S, Ebot EM, Creech AL, Giunchi F, Fiorentino M, Elfandy H, Syamala S, Karoly ED, Alshalalfa M, Erho N, Ross A, Schaeffer EM, Gibb EA, Takhar M, Den RB, Lehrer J, Karnes RJ, Freedland SJ, Davicioni E, Spratt DE, Ellis L, Jaffe JD, D’Amico AV, Kantoff PW, Bradner JE, Mucci LA, Chavarro JE, Loda M, Brown M. High-fat diet fuels prostate cancer progression by rewiring the metabolome and amplifying the MYC program. 2019; 10: 4358.

Zinga MM, Abdel-Shafy E, Melak T, Vignoli A, Piazza S, Zerbini LF, Tenori L, Cacciatore S. KODAMA exploratory analysis in metabolic phenotyping. 2023; 9: 1436.