Cite this asAğraz M, Ağyüz U, Welch EC, Kuyumcu B, Burak MF (2022) Machine learning characterization of a novel panel for metastatic prediction in breast cancer. Glob J Perioperative Med 6(1): 005-011. DOI: 10.17352/gjpm.000011
Copyright License© 2022 Ağraz M, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Metastasis is one of the most challenging problems in cancer diagnosis and treatment, as causal factors have yet to be fully disentangled. Prediction of the metastatic status of breast cancer is important for informing treatment protocols and reducing mortality. However, the systems biology behind metastasis is complex and driven by a variety of interacting factors. Furthermore, the prediction of cancer metastasis is a challenging task due to the variation in parameters and conditions specific to individual patients and mutation subtypes.
In this paper, we apply tree-based machine learning algorithms for gene expression data analysis in the estimation of metastatic potentials within a group of 490 breast cancer patients. Tree-based machine learning algorithms including decision trees, gradient boosting, and extremely randomized trees are used to assess the variable importance of different genes in breast cancer metastasis.
Highly accurate values were obtained from all three algorithms, with the gradient boosting method having the highest accuracy at 0.8901. The most significant ten genetic variables and fifteen gene functions in metastatic progression were identified. Respective importance scores and biological functions were also cataloged. Key genes in metastatic breast cancer progression include but are not limited to CD8, PB1, and THP-1.
Metastasis begins with the displacement of tumor cells from the primary tumor. Circulating tumor cells (CTCs) move through the vascular system to a distant organ. There, they colonize the new environment, forming a new tumor.
Metastasis is one of the most complex and challenging problems in the cancer field as its main causes are multifaceted and not well-understood yet. Additionally, it is strongly correlated with mortality, making it the most critical area in need of research within the field of cancer diagnostics .
Metastasis begins with the loss of cell-to-cell and cell-to-matrix adhesion. This facilitates local infiltration of tumor cells into adjacent tissues as well as trans endothelial migration into vessels via the process of intravasation. Cancer cells must transform themselves from endothelial cells into mesenchymal cells, known as epithelial to mesenchymal transition (EMT). This process is characterized by the loss of cellular adhesive properties and polarity with a simultaneous gain of other properties that enable CTCs to migrate to distant organs, extravasate, proliferate and colonize a discrete competent organ. The other major factors for metastasis are cell adhesion defects, angiogenesis, and disrupted cell signaling and metabolism.
Disrupted cell signaling interrupts foreign recognition responses, allowing cancer cells to pass through the circulation without being recognized by the immune system. However, CTCs can evade immune recognition by mimicking peripheral immune tolerance, as recently detailed by Gonzalez et al. .
CTCs have abnormal gene expression characteristics that are different from the primary tumor and help improve their survival in circulation . Survivin is a major member of the inhibitor of apoptosis family (IAP) and it facilitates the escape of tumor cells from immune recognition by blocking the cytotoxicity of NK cells and PD-L1. It can mediate the regulatory T-cells (Tregs) to play a role in immunosuppression.
Metastasis has most frequently been investigated in late-stage metastatic tumors, the products of colonization of discrete regions. It is still ambiguous how metastatic mechanisms begin in the primary tumor in the early stages and how an expression changes over time . This is important not only from the basic science perspective but from the diagnostic and predictive perspective as well. Cancer mortality can be reduced when appropriate anti-metastatic treatments are started earlier. Until then, the inability to reliably characterize metastasis continues to drive cancer’s reputation as the most unpredictable and challenging illness to treat, resulting in lower-than-predicted survival times .
Machine learning is a combination of statistics and computer science which has become popular in recent years due to increases in computational power, data availability, and data quantity. Machine learning approaches have been used in different fields of bioscience such as in biological network representation , classification and diagnosis , medical status prediction  and more . This approach has recently become popular specifically in bioinformatics and cancer research .
As machine learning capabilities grow, predictive models have become more and more accurate at determining cancer metastasis. For example, Huang, et al.  used support vector machine (SVM) and SVM Ensembles to predict breast cancer, Behravan, et al.  predicted breast cancer risk using machine learning algorithms for genetic and demographic datasets, Xiaoa, et al. , used deep learning in cancer prediction Kadir and Gleeson  implemented machine learning methods in the classification of lung cancer in images and Azzawi  conducted lung cancer prediction from microarray data. Decision trees are some of the most popular non-parametric supervised classification machine learning algorithms. They are used to classify the data in the form of an inverted tree that consists of a leaf node, root node, and internal node . The extremely randomized trees model is a tree-based ensemble model which was first introduced by Geurts, et al.  2006. This algorithm is similar to the random forest model which selects the subset of K features when deciding to split at each node. However, the difference between the random forest and extremely randomized trees (ERT) model is that ERT creates the trees from the learning samples. The Gradient Boosting tree model is an ensemble model technique thought to originate from the work of Breiman , which was later progressed by Friedman .
Due to the success of this approach in predicting and classifying different forms of biological data, we have opted to apply this method herein to analyze the metastatic gene expression data from breast cancer patients using large, publicly available datasets. The dataset contains information on the expression of 23397 genes across 490 individuals. The full datasets also contain significant amounts of other information, including cancer type, tumor grade, and age. t-statistics and the Bayesian method were first applied to select important predictors. The differential expression of genes between 2 groups: metastatic and non-metastatic were subsequently analyzed and profiled. A Differential Gene Expression (DGE) Analysis was performed between these 2 groups using R software. Using this analysis, 133 significant transcripts were detected with a >1.5-fold change. Significant dimensionality reduction was applied to simplify and better interpret the data. In the subsequent framework, the metastatic and non-metastatic expression profiles are further investigated using the previously mentioned machine learning models to determine significant metastatic predictors. Tree-based machine learning algorithms were first applied to the reduced candidate data following DGE. Variable importance was used to examine variable responses and thereby identify the variables that most influence breast cancer metastasis.
There are two main aims of this study. The first one is to show which of the tree-based algorithms is the most efficient in array analysis, and the second is to demonstrate which transcript outputs of these algorithms are the most significant both biologically and for future modeling approaches.
To address the first aim, data were processed by various machine learning methods to assess which method possesses the highest accuracy for this type of analysis. Decision trees, gradient boosting and extremely randomized trees were tested and compared. Each model was able to report separate variables with the highest value of metastatic predictive capability.
2 different cohort studies were merged to create the single dataset that was used in this study. Publicly available datasets GSE102484 and GSE20685 were downloaded from NCBI GEO Databank (https://www.ncbi.nlm.nih.gov/geo/). Both datasets were obtained from the same microarray chip platform GPL570 [HG-U133 Plus2] Affymetrix Human Genome U133 Plus 2.0 Array chip platform. ll cancer patients were diagnosed with breast cancer of clinical stages I-III. The data originates from a cohort study of invasive breast carcinoma patients who underwent surgery. Genomic data were obtained by whole RNA study from fresh frozen samples stored at a cancer center in Taiwan. These samples were obtained from total mastectomy and sentinel lymph node biopsy procedures. Any patient pretreated by chemotherapy or radiotherapy was excluded from this cohort data. (n=683).
A second dataset GSE20685 was merged with the first. In this cohort study, genomic profiles were assessed from the whole RNA of fresh frozen samples obtained from patients diagnosed and treated with breast cancer between 1991–2004. The samples were stored at the National Cancer Center Singapore. Centroid analysis was used to determine molecular subtypes of breast cancer (n=312) .
These two datasets were combined in R. Mutual parameters and data points were selected for data alignment before the merge. 44 transcripts with missing (NA) values in more than 30% of observations were excluded from the data. Novel R programming codes for data manipulation and normalization were utilized instead of relying on built-in functions.
The Bioconductor RMA package and quantile normalization functions were applied for inter-array normalizations. After the combination, a consensus of 54643 transcripts was merged and preprocessed.
80% of the data was used to train the machine learning model while 20% was used to test. five-fold cross-validation was applied, and the model was then trained with the decision tree, extremely randomized tree, and gradient boosting approaches. These machine learning approaches were selected as tree structures are powerful in modeling and these particular approaches are able to represent the variable importance of genes within the model of the tree structures.
Then the precision, recall, F1-score, and accuracy were calculated for each model as accuracy measures. All relevant equations, such as the formula of accuracy measures can be seen in the supplementary document.
The gradient boosting approach is the algorithm that has the most accurate results when compared to all others tested across a variety of metrics including precision, recall, F1 score, and accuracy (Table 1). In the first table, it can be seen that gradient boosting is able to predict whether a patient has metastatic cancer from the input expression data since it has the highest accuracy results for precision, recall, F1-score, and accuracy. The precision, recall, F1-score, and accuracy were calculated as 0.8901, 0.8550, 0.8666, and 0.8780 for this model, respectively.
Machine learning tree models can be used to determine the variables with the most predictive importance, helping algorithms to assign greater weight to data that plays more of a role in the proper classification of metastasis.
In order to express the weighting of data in the aforementioned decision-making process, the outputs of each variable importance for each tested model are visually displayed in Figures 1-3. The figures demonstrate the most significant 10 array IDs as determined via the use of each respective algorithm. Additionally, the respective contributions of these arrays to the model can be visualized in Table 2, where the array IDs are presented in terms of their corresponding gene name.
Figure 1-3 represent the variable importance of particular arrays in reaching the decisions within the decision tree models. Figure 4 illustrates the most prevalent biological functions, where differentially expressed genes play a role in the metastatic process.
Table 3 illustrates each algorithm’s predicted top genetic candidates in respective order of priority for metastatic detection. In this table, each variable is listed from highest to lowest importance in metastatic prediction. Common important genes across all of the tested algorithms have been determined to be CD8, PB1, and THP-1, as shown in the table in bold lettering. The prevalence of these expression markers is also indicated. More specifically; differential expression in some of these markers is either present in all cancers, in a variety of different cancers, or is specific to breast cancer or a particular subset of cancers.
This analysis enabled a variety of metastatic biomarkers to be pinpointed, including some unknown genes that have yet to be identified by previous research (indicated by the “N/A” notation). The identifiable genes with the most significant differential expression were discussed below. The most significant genes identified by this analysis are listed and explained in the biological context below.
Lastly, we created a network analysis, as represented in Figure S2, of the output of the gradient boosting results, as this was found to be the most successful model tested within the machine learning analysis. The online GeneMANIA bioinformatics tool was used for this purpose . The GeneMANIA tool searches for information on particular genes and performs network analysis to determine key interactions in the results. When using the GeneMANIA tool, a link showing the interaction between each pair of genes within the target pool is created by analyzing the relationships within the data. The co-expression of transcripts was analyzed, and the interaction links were defined based on previously categorized relationships from data presented in the GeneMANIA Online Tool (https://genemania.org/).
These findings were used to refine a target network for downstream network analysis. Thus, in addition to characterizing an effective tree-based machine learning workflow for metastatic classification of array IDs and determining potential genes at play in early-stage breast cancer metastasis, we have also created a network by looking at the interactions of the differentially expressed genes that were found to play a role in metastasis as represented in Figure S2.
In this study, machine learning decision trees were used to process a clinical genetic expression dataset. In particular, basic decision trees, extremely randomized trees, and gradient boosting trees were compared and assessed in their ability to distinguish between gene expression patterns characteristic of metastatic and nonmetastatic breast cancer.
After model training, it was observed that the gradient boosting tree method was the most powerful algorithm for predicting metastatic potential within the breast cancer dataset. Feature importance analysis enabled array IDs to be narrowed down to a select pool of important arrays that play a significant role in classifying metastasis. Correlated genes and their functions were assessed to understand the broader biological context. It is seen that 243850_at, 233053_at, 231644_at, and 231576_at, are common effective arrays for predicting breast cancer metastasis, indicating that CD8, PB2, THP-1, and ETNK1 are amongst the most significant genes of interest.
In the future, we are planning to extend the study by adding more available next-generation sequencing (NGS) data and using causal inference methods. More research must be conducted to understand what genes correspond to unknown array ID hits that were strongly differentially expressed between metastatic and non-metastatic patients. All code is available on GitHub at http://github.com/melihagraz/ML_Metastatic_Prediction.
No funding was used for the execution of this research. M.F.B is a consultant for Tersus Life Sciences, LLC. The study was conducted using data from the NCBI GEO database, and appropriate ethical approval and informed consent procedures were followed by the NIH for the collection of this data. No ethics approval was required by the authors for this study.
Availability of data and materials: In this study, two different publicly available datasets are used. The two datasets are publicly available on NCBI GEO Databank (https://www.ncbi.nlm.nih.gov/geo/). The datasets, called GSE102484 and GSE20685, are merged together to create a single dataset.
Subscribe to our articles alerts and stay tuned.