top of page

Regional Finalist, SARC 2025

Early Detection of Lung Cancer in India: A Machine Learning Approach

By Sarthak Gupta, India

Abstract:
Lung cancer remains a leading cause of cancer-related deaths globally [1], with India facing an even more acute crisis due to late-stage diagnoses [6] and limited access to diagnostic resources, especially in rural and low-income areas. Over 85% of lung cancer cases in India are diagnosed at advanced stages, where curative treatment is rarely possible [6]. Early detection, by contrast, can increase survival rates by four to five times. Building on my previously conducted primary research, where I independently designed and executed a study analyzing data from 2,512 Indian patients [11], published in The Journal of Student Research (Vol. 13 No. 1, 2024), this proposal seeks to expand the work nationally. The goal is to develop a simple, scalable, symptom-based machine learning model trained on diverse Indian populations, empowering frontline healthcare workers to identify high-risk individuals early. By addressing the critical gaps in healthcare access and adapting to India's genetic, environmental, and behavioral diversity [10][15], the research aims to create an equitable, low-cost tool capable of saving lives across underserved communities.

 

Introduction:

Lung cancer accounts for about one in five global cancer deaths [1] and is a particularly lethal menace in India, where the illness is most frequently identified at its later stages [6]. Research indicates that more than 85% of Indian lung cancer patients have Stage III or IV cancer at the time of diagnosis [6], greatly diminishing the possibility of survival. Access to sophisticated diagnostic techniques such as CT scans and PET scans is mostly limited to urban cities [7]. Rural and semi-urban communities, which constitute the vast majority of India's population profile, are particularly at risk — frequently being diagnosed only after severe symptoms have manifested and treatment possibilities are few. Early diagnosis is vital, and In this case, artificial intelligence has the potential to be a game-changer. As part of my early research work, I had conducted a primary study independently with clinical data of 2,512 Indian patients collected at Neera Hospital, Lucknow. This effort, included data collection, cleaning, model building, and validation, and was published in the Journal of Student Research [11]. Taking this forward, I now propose to extend the study to the pan-India level since the country is home to enormous genetic, environmental, and behavioral heterogeneity [10][15]. By collaborating with hospitals across multiple regions and retraining machine learning models on broader data, the study seeks to develop a holistic, low-cost, scalable tool for early lung cancer risk detection.

 

Literature Review:

Machine learning has emerged as a valuable tool in oncology, demonstrating strong potential in early diagnosis [7][8]. Techniques like Random Forest, Gradient Boosting, and CatBoost have achieved notable success in clinical predictive tasks [9][10]. Yet, the majority of healthcare machine learning models have been trained using Western data sets or small urban-centered Indian samples with poor representation diversity. India's regional genetic diversity [10], environmental differences (urban vs. rural) [6], smoking behaviors (cigarettes, bidis, smokeless tobacco) [13][15], and culturebased norms influencing healthcare seekers behavior underscore the critical necessity of a geographically representative data set to enable equitable, generalizable predictive models.

 

Methodology:​ 

The initial study was conducted through primary research undertaken by me, involving direct data collection from Neera Hospital, Lucknow. This study, which included designing the study structure, cleaning data, encoding categorical variables, training multiple machine learning models, and analyzing feature importances, demonstrated that machine learning models, particularly CatBoost, could predict lung cancer risk with an accuracy of 97.2% [11]. Considering this success, I propose a pan-India extension of the study with the following aims:

 

(1) Data Collection: Partner with health institutions in different Indian states to collect a geographically and socioeconomically representative dataset.

 

(2) Feature Capture: Include standard features such as demographics, smoking status, occupational history, and respiratory symptoms, along with including region-specific risk factors.

 

(3) Model Development: Re-train Logistic Regression, Decision Trees, Random Forest, Gradient Boosting, and CatBoost models on the new larger dataset.

(4) Data Preprocessing: Perform data handling for missing values, normalization, and encoding strictly to make data ready for training.

 

(5) Evaluation Metrics: Assess model performance based on accuracy, precision, recall, F1-score, and perform regional breakdowns to determine trends by region.

 

(6) Regional Analysis: Identify how socioeconomic and regional influences impact symptom patterns and predictive features.

 

Conclusion:

Given the high mortality associated with late-stage lung cancer diagnosis in India, and the severe disparity in diagnostic access between urban and rural populations, there is an urgent need for scalable, low-cost early detection tools. By expanding my original research into a pan-India study, this project will enable the development of a robust, AIdriven, symptom-based lung cancer screening tool, adaptable for use in primary care settings across diverse regions. This tool can empower frontline healthcare workers to identify at-risk patients earlier, prioritize interventions, and ultimately save lives in communities where the traditional healthcare infrastructure is weak. By combining machine learning innovation with a commitment to healthcare equity, this research has the potential to meaningfully reduce the burden of lung cancer mortality across India.

​

References :

1. Cancer (2022). World Health Organization. https://www.who.int/news-room/factsheets/detail/cancer

 

2. Brennan, P., & Davey-Smith, G. (2021). Identifying novel causes of cancers. JNCI: Journal of the National Cancer Institute, 114(3), 353–360. https://doi.org/10.1093/jnci/djab204

 

3. Falzone, L., Salomone, S., & Libra, M. (2018). Evolution of cancer pharmacological treatments. Frontiers in Pharmacology, 9. https://doi.org/10.3389/fphar.2018.01300

 

4. de Groot, P.M., et al. (2018). The epidemiology of lung cancer. Translational Lung Cancer Research, 7(3), 220–233. https://doi.org/10.21037/tlcr.2018.05.06

 

5. Fares, J., et al. (2020). Molecular principles of metastasis. Nature News. https://www.nature.com/articles/s41392-020-0134-x

 

6. Mathur, P., et al. (2022). A clinicoepidemiological profile of lung cancers in India. Indian Journal of Medical Research, 155(2), 264. https://doi.org/10.4103/ijmr.ijmr_1364_21

 

7. Huang, S., Yang, J., Fong, S., & Zhao, Q. (2020). AI in cancer diagnosis. Cancer Letters, 471, 61–71. https://doi.org/10.1016/j.canlet.2019.12.007

 

8. Iorio, F., et al. (2016). Pharmacogenomic interactions in cancer. Cell, 166(3), 740–754. https://doi.org/10.1016/j.cell.2016.06.017

 

9. P.R., R., Nair, R.A.S., & G., V. (2019). Lung cancer detection via ML algorithms. 2019 IEEE ICECCT. https://doi.org/10.1109/icecct.2019.8869001

 

10. Singh, N., et al. (2021). Lung cancer in India. Journal of Thoracic Oncology, 16(8), 1250–1266. https://doi.org/10.1016/j.jtho.2021.02.004

 

11. Gupta, S. (2024). The Early Detection of Lung Cancer among Indian Patients using Machine Learning Algorithms. Journal of Student Research, 13(1). https://doi.org/10.47611/jsr.v13i1.2383

 

12. Damani, A., et al. (2019). Dyspnea in lung cancer. Indian J Palliat Care, 25(3), 403– 406. https://doi.org/10.4103/IJPC.IJPC_64_19

 

13. Harle, A.S.M., et al. (2019). Cough symptoms in lung cancer. Chest, 155(1), 103– 113. https://doi.org/10.1016/j.chest.2018.10.003

 

14. Gorlova, O.Y., et al. (2007). Cancer among relatives of non-smokers. Int J Cancer, 121(1), 111–118. https://doi.org/10.1002/ijc.22615

 

15. Singh, N., et al. (2021). Tobacco use and lung cancer in India. Journal of Thoracic Oncology, 16(8), 1250–1266. https://doi.org/10.1016/j.jtho.2021.02.004

bottom of page