--- title: "A new pipeline to explore structural similarity across metabolite modules" author: | | Ebtesam Abdel-Shafy, Tadele Melak, David A. MacIntyre, | Giorgia Zadra, Luiz F. Zerbini, Silvano Piazza, and Stefano Cacciatore date: "`r Sys.Date()`" output: pdf_document: highlight: null number_sections: no fig_caption: yes word_document: default vignette: | %\VignetteIndexEntry{A simple pipeline to explore structural similarities between metabolites} %\VignetteKeywords{MetChem} %\VignettePackage{MetChem} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::knitr} --- ## 1 Introduction MetChem is an R package used to perform structural and functional analysis of metabolites using a simple pipeline. ## 2 Installation ### 2.1 Installation via CRAN The R package MetChem (current version 0.2) is part of the Comprehensive R Archive Network (CRAN)^[https://cran.r-project.org/]. The simplest way to install the package is to enter the following command into your R session: `install.packages("MetChem")`. We suggest installing the following R packages: `pheatmap` and `RColorBrewer` to enable data visualization in heatmaps, `readxl` for the data reading of Excel files, and `impute` for the imputation of missing data. # To install the pheatmap package install.packages("pheatmap") # To install the RColorBrewer package install.packages("RColorBrewer") # To install the readxl package install.packages("readxl") # To install the impute package if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("impute") ### 2.2 Manual installation from source The package can be manually installed from source by opening the package's page in CRAN and then proceeding as follows: - Download MetChem.tar.gz and save it to your hard disk - Open a shell/terminal/command prompt window and change to the desired directory for installation of MetChem.tar.gz. Enter `R CMD INSTALL MetChem.tar.gz` to install the package. Note that this may require additional software on some platforms. Windows requires Rtools^[https://developer.apple.com/xcode/] to be installed and to be available in the default search path (environment variable PATH). MAC OS X requires installation of Xcode developers and command line tools. ### 2.3 Compatibility issues The package downloadable from CRAN was built using R version, R.4.2.1. The package should work without major issues on R versions > 3.5.0 and KODAMA package >= 2.3. ## 3 Getting Started The R package MetChem depends by the R package rcdk, which is partially implemented in Java. In Windows environment, we reported an issue related to the uploading of rcdk package. It can be solved running the following code before loading the package MetChem. replacement <- function(category = "LC_ALL") { if (identical(category, "LC_MESSAGES")) return("") category <- match(category, .LC.categories) if (is.na(category)) stop("invalid 'category' argument") .Internal(Sys.getlocale(category)) } base <- asNamespace("base") environment(replacement) <- base unlockBinding("Sys.getlocale", base) assign("Sys.getlocale", replacement, envir = base) lockBinding("Sys.getlocale", base) To load the package, enter the following command in your R session: library("MetChem") If this command terminates without any error messages, the package is installed successfully. The MetChem package is now ready for use. The package includes both a user manual (this document) and a reference manual (help pages for each function). To view the user manual, enter `vignette("MetChem")`. Help pages can be viewed using the command `help(package="MetChem")`. \newpage ## 4 Example 1: murine prostate tissues metabolic profile Here, we introduce an example for the analysis of metabolic structural information using MetChem package. For this, we used a data set of mass spectrometry dataset obtained from murine prostate tissue samples reported by Labbé and Zadra \emph{et al.} (2019) (Supplementary Data 2). The metabolic data are obtained from ventral prostate tissues of mice that overexpress a human c-MYC transgene (MYC) in the prostate epithelium and wild-type littermates (WT). Mice were fed either a high fat diet (HFD; 60% kcal from fat; lard—rich in saturated fat) or a control diet (CTD; 10% kcal from fat). The data set includes six replicates for each group (\emph{i.e.}, WT_CTD, MYC_CTD, WT_HFD, and MYC_HFD). To begin, download the data from the Labbé and Zadra (2019) study. Download it and save it to your hard disk. Metabolomic data is extracted using the instructions below. Data is then imputed using a k-nearest neighbour (kNN) algorithm using the function `impute` as described in the publication. require("readxl") require("impute") d=as.data.frame(read_excel("41467_2019_12298_MOESM5_ESM.xlsx",skip = 3)) d=d[1:414,] rownames(d)=d[,"Metabolite"] met=d[,4:27] label=rep(c("WT_CTD","MYC_CTD","WT_HFD","MYC_HFD"),each=6) label_MYC=rep(c("WT","MYC","WT","MYC"),each=6) colnames(met)=paste(label,1:6) met=data.matrix(met) met=impute.knn(met,k=5)$data Heatmap visualization is generated using the function `pheatmap`. Metabolites are hierarchically clustered according to their relative concentration The hierarchical clustering is performed using the distance matrix based on the KODAMA dimensions. KODAMA is a learning algorithm for unsupervised feature extraction specifically designed for analyzing noisy and high-dimensional data sets (Cacciatore \emph{et al.}, 2014), implemented in the R package `KODAMA` (Cacciatore \emph{et al.}, 2017). Additional information can be found in the review of Zinga \emph{et al.}, 2023. require("pheatmap") require("RColorBrewer") my_colour1 = list(genotype=c(MYC="#000000ff",WT="#eeeeeeff"), group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff", WT_CTD="#00a4cfff",WT_HFD="#e40a81ff")) set.seed(1) kk1=KODAMA.matrix(t(met)) col=KODAMA.visualization(kk1) hcol=hclust(dist(col),method="ward.D") kk2=KODAMA.matrix(scale(met)) row=KODAMA.visualization(kk2) hrow=hclust(dist(row),method="ward.D") my_sample_col <- data.frame(group = label,genotype=label_MYC) row.names(my_sample_col) <- colnames(met) pheatmap(met, cluster_cols = hcol, cluster_rows = hrow, labels_row = rep("",nrow(met)), annotation_col = my_sample_col, annotation_colors = my_colour1, color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100)) \begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMApaper.pdf}} \caption{Heatmap of metabolites with hierarchical clustering based on their concentratrion.} \label{fig:F1} \end{figure} \newpage To analyze the chemical similarities among metabolites, we need the Simplified Molecular-Input Line-Entry System (SMILES) of each metabolite is obtained. The SMILES of the previous data set is stored in the list `HFD` that can be loaded using the function `data(HFD)`. `MetChem` package includes the `modules.detection` function based on KODAMA analysis. This function repeats the follwing steps (10 times as default): i) transformation of the chemical structure dissimilarity matrix in a multidimensional space (with 50 dimensions as defaults) using multidimensional scaling; ii) KODAMA features extraction; iii) hierarchical clustering based on the KODAMA output; iv) Calculation of the silhoutte index from different number of clusters (from 2 to 30 as default). The average of the siloutte index is calculated for each cluster numbers to identity the optimal cluster number. data(HFD) met=met[rownames(HFD),] SMILES=HFD$SMILES names(SMILES)=rownames(HFD) clu=modules.detection(SMILES) plot(clu$min_nc:clu$max_nc,clu$silhouette,type="l", ylab="Rousseeuw's Silhouette index",xlab="Number of clusters") abline(v=5*(1:6),lty=2) print(clu$main_cluster) \begin{figure}[h] \centerline{\includegraphics[width=8cm]{FLindex.pdf}} \caption{Silhouette index.} \label{fig:F2} \end{figure} \newpage Based on the average Silhouette indeces, we have the optimal number of cluster for this data set is 8. Below is shown a graphical visualization of the final output of KODAMA. Each cluster is represented by a different color code. Each dot represents a different metabolite. Metabolites that are located near to each other share a similar chemical structure. plot(clu$visualization,pch=21,bg=rainbow(8,alpha = 0.7)[clu$clusters[,"Clusters 8"]],cex=2) legend(-30, 20, legend=paste("Cluster", unique(clu$clusters[,"Clusters 8"])), col= rainbow(8,alpha = 0.7), pch= 16, cex=1) The following line code can be used to identify the points on the scatter plot. The cluster belonging and chemical name of the selected points will be displayed. ESC key to terminate the command. data.frame(metabolite=rownames(met), cluster=clu$clusters[,"Clusters 8"])[identify(clu$visualization),] \begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMA.pdf}} \caption{KODAMA plot.} \label{fig:F3} \end{figure} \newpage A new heatmap is generated where metabolites are clustered according to their chemical similarity. my_colour2 = list(cluster = c("1"=rainbow(8,alpha = 1)[1], "2"=rainbow(8,alpha = 1)[2], "3"=rainbow(8,alpha = 1)[3], "4"=rainbow(8,alpha = 1)[4], "5"=rainbow(8,alpha = 1)[5], "6"=rainbow(8,alpha = 1)[6], "7"=rainbow(8,alpha = 1)[7], "8"=rainbow(8,alpha = 1)[8]), genotype=c(MYC="#000000ff",WT="#eeeeeeff"), group=c(MYC_CTD="#373898ff",MYC_HFD="#c11630ff", WT_CTD="#00a4cfff",WT_HFD="#e40a81ff")) clusters8=clu$clusters[,"Clusters 8"] my_sample_row <- data.frame(cluster = as.character(clusters8)) row.names(my_sample_row) <- rownames(met) set.seed(1) met=met[rownames(HFD),] kk1=KODAMA.matrix(t(met)) col=KODAMA.visualization(kk1) hcol=hclust(dist(col),method="ward.D") hrow=clu$hclust my_sample_col <- data.frame(group = label,genotype=label_MYC) row.names(my_sample_col) <- colnames(met) pheatmap(met, cluster_cols = hcol, cluster_rows = hrow, labels_row = rep("",nrow(met)), annotation_colors = my_colour2, annotation_col = my_sample_col, annotation_row = my_sample_row, cutree_rows = 8, color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100)) \begin{figure}[h] \centerline{\includegraphics[width=12cm]{heatmap.pdf}} \caption{Heatmap of metabolites with vertical hierarchical clustering based on their molecular structure.} \label{fig:F4} \end{figure} \newpage We next build a heatmap of the metabolite belonging to the cluster 4. sel=clu$clusters[,"Clusters 8"]==4 met.sel=met[sel,] my_sample_col <- data.frame(group = label,genotype=label_MYC) row.names(my_sample_col) <- colnames(met) oo=order(row.names(my_sample_col)) my_sample_col=my_sample_col[oo,] met.sel=met.sel[,oo] hrow.sel=hclust(dist(clu$visualization[sel,]),method="ward.D") pheatmap(met.sel,fontsize = 7, cluster_cols = FALSE, cluster_rows = hrow.sel, annotation_col = my_sample_col, annotation_colors = my_colour1, color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100)) \begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMAselection.pdf}} \caption{Heatmap} \label{fig:F6} \end{figure} \newpage In the next step, we apply the Weighted Metabolite Chemical Similarity Analysis (WMCSA) on all branches of the hierarchical clustering performed on the cluster 4. WMCSA is implemented in the function `WMCSA`. This function summarizes the relative concentration of metabolites within each module. Each module is defined according to the chemical similarity. set.seed(1) my_sample_col <- data.frame(group = label,genotype=label_MYC) row.names(my_sample_col) <- colnames(met) cl=allbranches(hrow.sel) ww=WMCSA(met.sel,cl) hrow=hclust(dist(ww),method="ward.D") pheatmap(ww, cluster_cols = FALSE, cluster_rows = hrow, annotation_col = my_sample_col, annotation_colors = my_colour1, color = colorRampPalette(rev(brewer.pal(n = 11, name ="RdBu")))(100)) \begin{figure}[h] \centerline{\includegraphics[width=12cm]{KODAMAall.pdf}} \caption{Heatmap of the output of WMCSA.} \label{fig:F5} \end{figure} \newpage Differential analysis of the relevant modules can be performed using the function `multi_analysis` present in the R package `KODAMA`. In the example below, we perform a differential analysis between MYC transgenic mice fed with high-fat diet or control diet named as MYC_HFD and MYC_CTD, respectively. label=rep(c("MYC","WT"),each=12) multi_analysis(t(ww),label) ```{r, echo=FALSE, results='asis'} Feature=c( "Mod1, median [IQR]", "Mod2, median [IQR]", "Mod3, median [IQR]", "Mod4, median [IQR]" , "Mod5, median [IQR]" , "Mod6, median [IQR]", "Mod7, median [IQR]", "Mod8, median [IQR]", "Mod9, median [IQR]", "Mod10, median [IQR]", "Mod11, median [IQR]", "Mod12, median [IQR]", "Mod13, median [IQR]", "Mod14, median [IQR]", "Mod15, median [IQR]") MYC=c( "-2.919 [-3.833 -1.536]", "-2.17 [-2.638 -1.706]","-1.798 [-2.033 -1.215]", "-2.45 [-3.037 -1.381]", "-1.902 [-2.171 -1.425]", "-1.772 [-3.022 -1.119]", "0.886 [-1.181 2.231]" , "-1.674 [-2.485 -0.656]", "1.609 [0.961 2.06]", "-0.84 [-1.956 0.087]", "-0.023 [-0.639 1.15]", "-1.347 [-1.928 -0.681]", "0.757 [-0.717 1.532]", "1.691 [0.59 1.952]" , "0.196 [-1.074 0.842]" ) WT=c("2.938 [2.195 3.563]", "2.237 [1.98 2.478]" , "1.409 [1.015 2.228]" , "2.139 [1.776 2.827]" , "1.688 [1.563 1.941]" , "1.701 [0.742 2.204]", "-0.329 [-1.224 0.379]", "1.248 [0.609 1.995]", "-1.412 [-1.706 -1.203]", "0.8 [-0.148 1.161]" , "-0.583 [-1.229 0.027]", "1.305 [0.711 1.736]" , "-0.004 [-1.003 0.6]" , "-1.389 [-1.662 -0.889]", "0.388 [-0.853 0.638]" ) pvalue=c( "3.66e-05" ,"3.66e-05", "3.66e-05" ,"3.66e-05", "3.66e-05", "2.46e-04", "4.03e-01", "3.84e-04", "4.69e-05", "4.64e-02", "1.75e-01" , "3.66e-05", "5.07e-01", "3.66e-05" ,"7.95e-01") FDR=c( "7.84e-05","7.84e-05", "7.84e-05", "7.84e-05", "7.84e-05", "4.10e-04", "4.64e-01", "5.76e-04", "8.80e-05", "6.33e-02" ,"2.19e-01", "7.84e-05", "5.43e-01", "7.84e-05", "7.95e-01") da=data.frame(Feature=Feature,MYC=MYC,WT=WT,'p-value'=pvalue,FDR=FDR,check.names = FALSE) knitr::kable(da, align = "lcccc") ``` \newpage \newpage The function `readMet` connects R to the HMDB database ^[https://www.hmdb.ca]. to retrieve chemical and functional information of each metabolite. This can be summarized using different functions: `substituentsMet`, `diseaseMet`, `enzymeMet`, `pathwaysMet`, `taxonomyMet`. The function `features` associates the most prominent features to each module. In this example, we characterized the modules by functional group. HMDB=HFD$HMDB names(HMDB)=rownames(HFD) doc=readMet(HMDB) cla1=substituentsMet(doc) substituents=features(doc,cla1,cl) cla2=enzymesMet(doc) enzymes=features(doc,cla2,cl) To visualize the metabolite of , for example, module 12, we can type: cl[[12]] Fisher's test was used to rank the association of each module to the metabolite information. Below are reported the p-value for associations of substituents with the module 12. substituents[[12]] ```{r, echo=FALSE, results='asis'} pval=c("1.44e-06","1.21e-05","4.97e-05","1.21e-03","1.31e-03","1.53e-03","1.77e-03", "2.33e-03", "8.62e-03", "1.18e-02", "1.47e-02", "1.91e-02", "2.04e-02", "2.05e-02", "2.32e-02", "2.32e-02", "3.95e-02", "4.13e-02") nam=c( "Histidine or derivatives","Imidazolyl carboxylic acid derivative","Hybrid peptide", "N-acyl-alpha amino acid or derivatives","Imidazole","Azole", "Aromatic heteromonocyclic compound","N-acyl-alpha-amino acid","Beta amino acid or derivatives", "Organoheterocyclic compound","Heteroaromatic compound","Carboximidic acid", "Carboximidic acid derivative","Azacycle","Organic 1,3-dipolar compound", "Propargyl-type 1,3-dipolar organic compound","Alpha-amino acid or derivatives","Amino acid or derivatives" ) da=data.frame(Substituents=nam,'p-value'=pval,check.names = FALSE) knitr::kable(da, align = "lc") ``` \newpage Here are reported the p-value for associations of metabolite-related enzymes with the module 12. enzymes[[12]] ```{r, echo=FALSE, results='asis'} pval=c("2.97e-04", "1.38e-03", "8.62e-03", "8.62e-03", "8.62e-03", "2.57e-02") nam=c( "CNDP1","CARNS1", "VIM" , "HSPA1A" , "SLC15A2", "MPO" ) da=data.frame(Substituents=nam,'p-value'=pval,check.names = FALSE) knitr::kable(da, align = "lc") ``` ## 4 How to Cite this Package Ebtesam Abdel-Shafy, Tadele Melak, David A. MacIntyre, Giorgia Zadra, Luiz F. Zerbini, Silvano Piazza, and Stefano Cacciatore Publication in submission To obtain BibTex entries of the two references, you can enter the following into your R session to Bibtex `citation("MetChem")`. ## 5 References Cacciatore S, Luchinat C, Tenori L. Knowledge discovery by accuracy maximization. \emph{Proc Natl Acad Sci USA} 2014; 111: 5117-22. Cacciatore S, Tenori L, Luchinat C, Bennett PR, MacIntyre DA (2017) KODAMA: an R package for knowledge discovery and data mining. \emph{Bioinformatics} 2017; 33(4): 621-623. Labbé DP, Zadra G, Yang M, Reyes JM, Lin CY, Cacciatore S, Ebot EM, Creech AL, Giunchi F, Fiorentino M, Elfandy H, Syamala S, Karoly ED, Alshalalfa M, Erho N, Ross A, Schaeffer EM, Gibb EA, Takhar M, Den RB, Lehrer J, Karnes RJ, Freedland SJ, Davicioni E, Spratt DE, Ellis L, Jaffe JD, D'Amico AV, Kantoff PW, Bradner JE, Mucci LA, Chavarro JE, Loda M, Brown M. High-fat diet fuels prostate cancer progression by rewiring the metabolome and amplifying the MYC program. \emph{Nat Commun} 2019; 10: 4358. Zinga MM, Abdel-Shafy E, Melak T, Vignoli A, Piazza S, Zerbini LF, Tenori L, Cacciatore S. KODAMA exploratory analysis in metabolic phenotyping. \emph{Front Mol Biosci} 2023; 9: 1436.