Evaluating the effectiveness of Machine Learning in diagnosing Pneumonia using clinical tabular data and raw chest X-ray images.
Pneumonia is a severe respiratory infection that affects a significant portion of the global population. This project investigates whether machine learning models can assist clinical and non-clinical staff in diagnosing pneumonia efficiently. The investigation is split into two main approaches:
- Analyzing tabular clinical data using standard classification algorithms and ensembles.
- Classifying raw chest X-ray images directly using a Support Vector Machine (SVM).
To ensure the models were trained on high-quality data, several preprocessing steps were applied to the raw pneumonia_raw.csv dataset:
- Data Cleaning: Removed duplicate records and filtered out numerical outliers (e.g., negative values for consolidation dimensions).
- Categorical Encoding: Converted the categorical target feature into a numerical format using label encoding.
- Feature Selection: Dropped irrelevant identifiers, such as the Patient ID, based on correlation matrix results.
- Scaling: Applied Standard Scaling via pipelines to ensure all features had a mean of 0 and a standard deviation of 1.
- Imbalance Handling: Utilized stratified sampling and the
class_weight='balanced'hyperparameter to counteract the higher volume of positive pneumonia cases in the dataset.
Five standalone models and four ensembles were evaluated using K-Fold Cross Validation and Confusion Matrices to determine the best approach for the tabular clinical data.
| Model | Hyperparameters Tuned | Accuracy |
|---|---|---|
| Support Vector Machine | kernel='rbf', gamma=5 | 71.6% |
| Random Forest (Ensemble 1) | n_estimators=150, max_depth=5 | 67.2% |
| Voting Ensemble 2 | estimators = DT, KNN, LR | 65.5% |
| K-Nearest Neighbors | n_neighbors=7 | 64.7% |
| Final Ensemble (RF, SVM, LR) | Diverse hyperparameter settings | 64.7% |
| Ensemble of Ensembles | estimators = Ensemble 1 & 2 | 64.7% |
| Logistic Regression | solver='liblinear', max_iter=150 | 62.9% |
| Decision Tree | max_depth=5, min_samples_split=2 | 60.3% |
| Gaussian Naive Bayes | var_smoothing=1e-8 | 59.5% |
Instead of relying solely on clinical measurements, an AI approach was deployed to directly analyze chest X-ray images.
- Image Processing: Images were loaded, resized to 128x128 pixels, converted to grayscale, and flattened into 1D arrays to be compatible with standard machine learning classifiers.
- Model Used: Support Vector Machine (Linear Kernel)
- Result: The model achieved an Accuracy of 75.0%. It demonstrated strong precision and recall across both the "Pneumonia" and "Normal" classes, proving that supervised learning can effectively extract diagnostic patterns directly from pixel data.
Machine learning can successfully assist in diagnosing pneumonia. While individual models like the Support Vector Machine performed best on the clinical tabular data, ensemble methods proved highly reliable by combining the strengths of diverse algorithms. Furthermore, the 75.0% accuracy achieved on the raw X-ray dataset confirms that AI can bypass manual clinical measurements and analyze radiologic imaging directly with a high degree of success.
- Rajaraman, S., et al. (2020). Efficient pneumonia detection in chest X-ray images using deep learning. BMC Medical Imaging.
- Rajpurkar, P., et al. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXiv preprint.
- Pan, Z., et al. (2023). Diagnosis and detection of pneumonia using weak-label based on X-ray images. BMC Medical Imaging.
To run this notebook, you will need:
- Python 3
- Libraries:
pandas,numpy,scikit-learn - Platform: Optimized for Google Colab.