Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning

Eledkawy, Amr; Hamza, Taher; El-Metwally, Sara

doi:10.1186/s13040-025-00439-8

Research
Open access
Published: 11 April 2025

Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning

Amr Eledkawy¹,
Taher Hamza¹ &
Sara El-Metwally^1,2

BioData Mining volume 18, Article number: 29 (2025) Cite this article

679 Accesses
Metrics details

Abstract

Background

Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver.

Results

The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers—including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis—are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository (https://github.com/SaraEl-Metwally/Towards-Precision-Oncology).

Conclusion

The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.

Peer Review reports

Introduction

Cancer costs millions of lives each year and is one of the leading causes of death globally, accounting for ten million fatalities yearly [1]. Unfortunately, the number of cancer-related fatalities is predicted to increase even in developed countries [2]. Surgery alone can often effectively treat localized cancers without systemic therapy [3]. Surgical excision is seldom curative if distant metastases have taken place. Therefore, there is a significant emphasis on research methods for early cancer detection before it spreads to distant sites. The early detection of cancer enables timely medical interventions, leading to a reduction in patient mortality rates and better outcomes for a wide range of cancer types [4].

In recent years, various methods have increased the accessibility of early cancer detection [5]. In particular, a blood test process called liquid biopsy has become a tool for early cancer identification. The test depends on mutations and genetic changes found in circulating tumor DNA (ctDNA), circulating cell-free DNA (cfDNA), or molecular biomarkers [6]. The principle that mutant plasma DNA templates come from dying cancer cells is the foundation of liquid biopsy, offering highly specific markers for identifying neoplasia. RNA strands, DNA mutations, proteins, and protein fragments are typical molecular biomarkers that can provide valuable insights into disease prognosis in patients [7].

A multi-analytic testing strategy that evaluates mutations in somatic variants within cfDNA, ctDNA, and a range of protein biomarkers from blood plasma can improve early cancer detection by employing different Machine Learning (ML) techniques [8]. ML enables cancer detection employing a range of data forms, including liquid biopsy, clinical, and pathological data [6, 9, 10]. Integrating this valuable marker information into advancements in artificial intelligence has made it possible to develop accurate tools for early cancer prediction.

This paper introduces a system designed to identify seven specific types of cancer: colorectal, breast, upper gastrointestinal (GI), lung, pancreas, ovarian, and liver cancer. The system begins by collecting liquid biopsy blood samples and analyzing protein biomarker concentrations and plasma cfDNA/ctDNA mutations obtained from healthy controls and cancer patients. The proposed approach employs a multi-level binary classification system, creating seven distinct datasets, each designed to target a specific type of cancer among other malignancies. This process involves a comprehensive feature selection procedure that integrates Information Value (IV), Chi-Square, Random Forest (RF) feature importance, Extra Tree (ET) feature importance, Recursive Feature Elimination (RFE), and L1 regularization techniques. Following reduction, the datasets are subjected to model training using algorithms tailored to the specific cancer types. These algorithms include eXtreme Gradient Boosting (XGBoost), RF, ET, and Quadratic Discriminant Analysis (QDA), which are applied individually or in ensemble soft voting configurations.

The main contribution of the paper can be summarized as follows:

The proposed system employs liquid biopsy to offer a non-invasive method for detecting seven distinct types of cancer at an early stage.
Employing a voting technique, the system performs comprehensive feature selection by incorporating six methods: IV, Chi-Square, RF, ET, RFE, and L1 regularization. This approach aims to identify the most important features, thereby improving model interpretability.
The system utilized various machine learning classifiers to customize the modeling process for each cancer type's specific characteristics. Through ensemble soft voting configurations, it leverages the strengths of classifiers such as XGBoost, RF, ET, and QDA, individually or collectively, ensuring accurate predictions.
The system's evaluation results reached an Area Under the Curve (AUC) of 98.2% and an average accuracy of 96.21%, highlighting the system's capacity to enhance clinical follow-up protocols.

The structure of the paper is as follows: The "Related Work" section provides an overview of previously reported approaches that have used liquid biopsy for the early detection of cancer. In the "Materials and Methods" section, we present the proposed framework for multi-cancer detection, integrating liquid biopsy and ML techniques, along with details about the dataset used in the training process. The "Experimental Results" section presents the outcomes obtained at various stages of the proposed methodology and includes comparative analyses with previously published approaches to establish benchmarking performance. The "Discussion" section comprehensively evaluates the proposed system, highlighting its robustness, offering clinically interpretable insights, and identifying its limitations. Finally, the "Conclusion" section summarizes the study's key findings. It explores potential future research directions in multi-cancer diagnosis, focusing on integrating biological biomarkers and artificial intelligence techniques.

Related work

Machine learning has been increasingly applied to cancer diagnostics, covering a wide range of cancer types, including colorectal [11], thyroid [12], lung [13], breast [14], and brain [15] tumors, demonstrating its potential in improving early detection and classification accuracy. Many studies have explored cancer prediction and classification using machine learning techniques, such as pACP-HybDeep [16], iACP-GAEnsC [17], cACP-2LFS [18], and cACP-DeepGram [19], which leverage deep learning, ensemble learning, and feature selection methods to enhance predictive accuracy and robustness.

Scholars are now investigating somatic changes associated with cancer in cfDNA as a non-invasive method for early cancer diagnosis [8, 20, 21], such as gastric [22], colorectal [23], lung [24], and breast [25]. Using deep learning methods [26], conjunctive Bayesian networks [27], and network-based multi-task learning models [28] with liquid biopsy data designed to predict cancer is advancing the field of cancer research.

Cohen et al. [8] conducted a comprehensive data collection effort from patients diagnosed with nonmetastatic cancers affecting various organs, including the ovary, liver, stomach, pancreas, esophagus, colorectum, lung, and breast. This dataset includes features of concentrations of protein biomarkers, mutations detected in plasma cfDNA/ctDNA, and important clinical characteristics such as ethnicity, sex, age, and histology. Specifically, the dataset has measurements for 39 distinct protein biomarkers present in plasma samples and an omega score computed from mutations identified in the samples. Their approach, CancerSEEK, was employed to classify seven different cancer types, with esophageal and gastric cancers grouped for analysis. Using a random forest classifier and a tenfold cross-validation technique, they achieved a classification accuracy of 62.32% in their experiments.

Wong et al. [9] introduced a cancer localization framework called CancerA1DE, which relies on Aggregating One-Dependence Estimators (A1DE). They used the Cohen et al. dataset and applied the minimum description length principle to discretize continuous marker features. They classified seven distinct cancer types using the omega score, gender, and 39 protein biomarkers.

Rahaman et al. [6] introduced CancerEMC, a cancer localization system utilizing a Bagging Ensemble Meta Classifier on the Cohen et al. dataset. The Synthetic Minority Oversampling Technique (SMOTE) is used to resolve the dataset imbalance problem, and the Random Forest (RF) is implemented as a feature selection technique. Multiple experiments with different sample sizes of data were conducted in their study. When using 626 cancer patients to detect seven cancer types, they achieved 74.12% accuracy using omega score, gender, and 19 biomarker features selected through RF on the data before applying SMOTE. After applying SMOTE, the system achieved 91.5% accuracy. They detected cancer types in 1,817 people with 83.49% accuracy before SMOTE and 95.98% after. Cancer localization accuracy was 74.22% before SMOTE and 93.98% after SMOTE for 1,005 cancer patients. All experiments use tenfold cross-validation.

Halner et al. [10] proposed the DEcancer framework for early cancer detection on the Cohen et al. dataset. The framework started by partitioning the dataset into a 20% test set, while the remaining data was used for training and validation in a 200-fold Monte Carlo cross-validation configuration. They applied various data augmentation techniques to the training data and optimized the classifier model through feature selection and hyperparameter tuning during fold validation. Using the independent t-test, they compared the performance of the classifier models with different feature sets, ensuring no statistically significant performance difference. The best data processing framework, classifier models, and feature set were selected. They then retrained the models using all combined training and validation data and evaluated them on the test set. With 39 biomarker characteristics, omega score, age, sex, and ethnicity data, their framework obtained 91.88% average AUC for cancer localization with 1005 cancer patients and 94.13% with 1817 individuals.

While CancerA1DE, CancerSEEK, CancerEMC, and DEcancer can help in early cancer detection by applying ML techniques on liquid biopsy data, CancerSEEK and CancerA1DE exhibit low accuracy in cancer-type localization. Furthermore, the applicability of the DEcancer framework is hindered by the lack of information regarding the classifiers and feature selectors utilized in their study. A summary of the related work can be found in Table 1.

Table 1 Comparing multi-cancer classification models using the Cohen et al. dataset

You are viewing the site in preview mode

Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning

Abstract

Background

Results

Conclusion

Introduction

Related work

Materials and methods

Data acquisition

Data preparation

Algorithm 1: Splitting a multi-class dataset into multiple binary datasets

Feature selection

Algorithm 2: Feature selection via majority voting

Model training

Algorithm 3: Ensemble soft voting for binary classification

Model evaluation

Experimental results

Feature selection results

Model training results

Model evaluation results

Benchmarking results

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Supplementary Material 1.

Supplementary Material 2.

Supplementary Material 3.

Supplementary Material 4.

Supplementary Material 5.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us