Using KL divergence and Mutual Information as a hybrid feature selection method in breast cancer classification

9 min readSep 18, 2024

Credit: https://www.cancer.gov/news-events/cancer-currents-blog/2021/breast-cancer-mastectomy-quality-of-life

Introduction:

Breast cancer is a major health issue, with about 2.3 million new cases reported in 2020. It’s the most common cancer among women, making up around 12.5% of all new cancer cases. In the U.S., roughly 1 in 8 women will be diagnosed with breast cancer in their lifetime. Despite treatment advancements, it remains a leading cause of cancer-related deaths, with an estimated 685,000 deaths worldwide in 2020.

Overview of KL Divergence and Mutual information:

KL Divergence:

Kullback-Leibler (KL) divergence stands as a fundamental concept in information theory, playing a pivotal role in the channel coding theorem. While its primary application lies in quantifying the discrepancy between input and output in channel encoding, KL divergence has found broader use in measuring the dissimilarity between any two probability distributions.
Consider two probability distributions, p(x) and q(x). The KL divergence between these distributions is formally defined as:

This formula calculates the average logarithmic difference between the probabilities p(x) and q(x), weighted by p(x). In essence, KL divergence quantifies the information lost when q(x) is used to approximate p(x).

Mutual information:

Alongside Kullback-Leibler Divergence, mutual information measures how much one variable depends on another. For two random variables, X and Y, mutual information represents the amount of shared information between their probability distributions, as shown in the figure below.

Formula:

Credit: https://en.wikipedia.org/wiki/Information_diagram

A larger area of overlap indicates a stronger relationship between these variables. In simpler terms, the more they share information, the more dependent they are on each other. This concept is widely used in fields like machine learning and data analysis to understand how different variables interact.

Application to breast cancer prediction:

Dataset description:

Fine needle aspiration (FNA) is a diagnostic procedure used to examine breast masses. In this technique, a thin needle is used to extract cells from a suspicious breast lump, which are then analyzed microscopically.
The extracted cells are digitized into images, from which various features of the cell nuclei are computed. These features describe characteristics like size, shape, and texture of the nuclei. The resulting dataset, known as the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, is used in machine learning applications to distinguish between benign and malignant breast masses.
This dataset is publicly available through the University of Wisconsin CS ftp server and the UCI Machine Learning Repository, making it a valuable resource for researchers developing algorithms for breast cancer diagnosis.
Kaggle’s link: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data

Motivation:

The data was ready to analysis, with no missing values or duplicate entries. Through our descriptive analysis, we discovered that out of the 32 attributes examined, around 19 attributes exhibited clear and significant differences in their histograms when analyzed by class. These differences in shape and range of the histograms suggest that these attributes might be valuable for predicting whether cell nuclei are malignant or benign.
To quantify the extent of these differences between the two distributions, we are considering employing KL divergence. This statistical measure will help us assess how distinct the distributions are from one another.

From the perspective of KL divergence alone, we can assess how distinct one probability distribution is from another when comparing classes. However, this raises the question of whether KL divergence is a robust criterion for selecting predictors. In my view, while KL divergence effectively measures the difference in probability distributions, it does not directly indicate how this difference contributes to class differentiation. Essentially, it does not reveal the extent to which this difference impacts the predictive power of the attribute.

To address this, we need to evaluate the attribute’s dependence on the target variable to determine its utility as a predictor in the model. This consideration leads me to propose using mutual information as a complementary measure. Mutual information will help quantify the strength of the relationship between the attribute and the target variable, providing a clearer picture of the attribute’s predictive power and its potential effectiveness as a predictor.

Implementation of KL divergence and mutual information:

KL Divergence:

KL divergence is computed based on probability distributions, where the total sum of probabilities in each distribution must equal 1. To prepare the dataset for KL divergence calculation, several transformations are required:

Create Bins: Define bins for each attribute to generate histograms. This step involves discretizing the continuous data into intervals.
Regenerate Histograms: Create histograms using the new bins to represent the distribution of each attribute.
Normalize Histograms: Convert the histograms into probability distributions by normalizing them, ensuring that the sum of probabilities equals 1.
Compute KL Divergence: Calculate KL divergence to measure the difference between the probability distributions of the attributes for different classes

Detail code:

def kl_divergence(predictor):
  #Adjust the bins
  p_malignant = df[df['diagnosis']=='M'][predictor] 
  q_benign = df[df['diagnosis']=='B'][predictor]
  #Data range 
  data_min = min(p_malignant.min(),q_benign.min()) 
  data_max = max(p_malignant.max(),q_benign.max())
  #Recreate bins 
  bins = np.linspace(data_min-1,data_max+1,200)
  #Compute histogram and normalize 
  hist1, _ = np.histogram(p_malignant,bins=bins,density=True)
  hist2, _ = np.histogram(q_benign,bins=bins,density=True)
  #normalize 
  eps = 1e-10 
  hist1 = hist1/hist1.sum() + eps
  hist2 = hist2/hist2.sum() + eps
  #Compute KL divergence
  kl_div = entropy(hist1,hist2)
  return kl_div

*Note: KL divergence is indeed asymmetrical, meaning that the divergence of P(x) from Q(x) is not necessarily the same as Q(x) from P(x).

To make the concept and interpretation clearer, I choose P(x) to represent the probability distribution of the malignant class and Q(x) to represent the distribution of the benign class. In this framework, KL divergence can be understood as a measure of how “surprised” we would be to see the distribution of the malignant class if we already know the distribution of the benign class. This interpretation helps to contextualize the difference between the two distributions in terms of unexpectedness or divergence.

Mutual information:

Before computation mutual information, the target variable should be encoded as a numeric variable.
Detail code:

from sklearn.feature_selection import mutual_info_classif
df_spare = df[list(set(potential)|{'diagnosis'})] #add diagnosis 
df_spare['diagnosis'] = df_spare['diagnosis'].map({'M':1,'B':0})
# Compute mutual information
mi_scores = mutual_info_classif(df_spare.drop(columns=['diagnosis']),df_spare['diagnosis'])
# Store result
result = dict(zip(df_spare.drop(columns=['diagnosis']).columns,mi_scores))

Strategy for predictor’s selection:

From the scatter plot, attributes with both high mutual information and high KL divergence scores are selected for the model, as they indicate a strong relationship with the target variable and significant differences between classes. Conversely, attributes with low values in both scores are excluded, as they provide minimal information on distinguishing between classes. You have chosen attributes that exceed the average values for both mutual information and KL divergence scores, ensuring that only the most informative features are included in the model.

Modeling process:

Dataset Splitting: Divided the dataset into 80% for training and 20% for testing.
Cross-Validation: Applied 5-fold cross-validation on the training dataset to validate model performance.
Scaling: Standardized all input attributes using a standard scaler.
Model Selection: Chose the model with the best performance based on the average recall rate across classes.
Training: Fitted the selected model on the entire training dataset.
Prediction: Used the trained model to predict outcomes on the testing dataset.
Evaluation: Assessed the final model’s performance using the unseen testing data (20% of the dataset).

Model’s performance:

Models Considered:

Decision Tree
Random Forest
XGBoost

Cross-Validation Results:

Random Forest: Delivered the best performance with an average recall rate of 94% for class prediction, outperforming the other models in terms of recall rate.

Testing Set Performance:

Accuracy Score:
The Random Forest model achieved an accuracy score of 97% on the testing set, indicating a high level of overall correctness in predictions.

Recall Rate for Malignant Class:
The model achieved a recall rate of 93% for the malignant class, reflecting its effectiveness in correctly identifying malignant instances among the test samples.

Conclusion:

The feature selection was considered as important steps to build high performance model. The idea of combination KL divergence and mutual information were derived from the observation which there are number of attributes, which its probability distribtion was significantly different and the effort of trying quantify the impact of attributes of cells into the classfication of a tumor.

Throught the model’s performance, the combination of KL and MI was worked in helping model reach to ability of true-labeling tumor with 93% accuracy.

In the further improvement, I highly recommend to combine with more advance model like MLP. Fine-tunining model’s parameter was also recommendated to help reach to higher recall rate in classifying malignant cell.

Full code:

import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

#load data 
df = pd.read_csv('/content/data.csv')
df.head()

#inspect dataset 
print(f"Number of columns: {df.shape[0]}")
print(f"Number of rows: {df.shape[1]}")

print('Attribute format')
print(df.info())

#Drop 32th column. 
df = df.drop(axis=1, columns='Unnamed: 32')

print(f"The number of null records: {df.isnull().sum().sum()}")
print(f"The number of duplicated records: {df.duplicated().sum()}")

plt.plot(figsize=(7,7))
sns.countplot(data=df,x='diagnosis')
plt.title('Target analysis')
plt.show()

attributes = df.columns.difference(['id','diagnosis'])

fig, ax = plt.subplots(len(attributes),2,figsize=(14,len(attributes)*3))
for i,var in enumerate(attributes):
  sns.histplot(data=df,x=var,hue='diagnosis',ax=ax[i,0])
  ax[i,0].set_title(f"The histogram of {var} by diagnosis")
  sns.boxplot(data=df,x='diagnosis',y=var,ax=ax[i,1])
  ax[i,1].set_title(f"The boxplot of {var} by diagnosis") 
plt.tight_layout()
plt.show()

not_predictors = ['texture_se', 'symmetry_worst', 'symmetry_se','symmetry_mean', 'smoothness_se', 'moothness_mean', 'fractal_dimension_worst','fractal_dimension_se','fractal_dimension_mean',
'concave points_se','compactness_se']
potential = list(set(attributes).difference(set(not_predictors)))

from scipy.stats import entropy

def kl_divergence(predictor):
  #Adjust the bins
  p_malignant = df[df['diagnosis']=='M'][predictor] 
  q_benign = df[df['diagnosis']=='B'][predictor]
  #Data range 
  data_min = min(p_malignant.min(),q_benign.min()) 
  data_max = max(p_malignant.max(),q_benign.max())
  #Recreate bins 
  bins = np.linspace(data_min-1,data_max+1,200)
  #Compute histogram and normalize 
  hist1, _ = np.histogram(p_malignant,bins=bins,density=True)
  hist2, _ = np.histogram(q_benign,bins=bins,density=True)
  #normalize 
  eps = 1e-10 
  hist1 = hist1/hist1.sum() + eps
  hist2 = hist2/hist2.sum() + eps
  #Compute KL divergence
  kl_div = entropy(hist1,hist2)
  return kl_div

#Create dictionary to store 
kl_div = {}

#Compute divergence 
for predictor in potential:
  kl_div[predictor] = kl_divergence(predictor)

#Sort from Z-A 
sorted_kl_divergence = dict(sorted(kl_div.items(), key=lambda item: item[1], reverse=True))

from sklearn.feature_selection import mutual_info_classif
df_spare = df[list(set(potential)|{'diagnosis'})] #add diagnosis 
df_spare['diagnosis'] = df_spare['diagnosis'].map({'M':1,'B':0})
# Compute mutual information
mi_scores = mutual_info_classif(df_spare.drop(columns=['diagnosis']),df_spare['diagnosis'])
# Store result
result = dict(zip(df_spare.drop(columns=['diagnosis']).columns,mi_scores))

#Map KL divergence and mutual information into a dataframe 
kl_mi = pd.DataFrame(columns=['Attribute','KL','MI']) 
kl_mi['Attribute'] = potential 

# Add value to KL, MI
for attribute in potential:
  kl_mi.loc[kl_mi['Attribute']==attribute,'KL'] = sorted_kl_divergence[attribute]
  kl_mi.loc[kl_mi['Attribute']==attribute,'MI'] = result[attribute]

print(kl_mi)

#Plot scatter plot 
sns.scatterplot(data=kl_mi, x='MI', y='KL')
plt.title('KL divergence versus mutual information')
for i in range(len(kl_mi)):
  plt.annotate(kl_mi['Attribute'][i], (kl_mi['MI'][i], kl_mi['KL'][i]),textcoords="offset points", xytext=(0,10), ha='center',fontsize=7)
plt.xlabel('Mutual information')
plt.ylabel('KL divergence')
plt.tight_layout()
plt.show()

#Predictor choice: 
choices = kl_mi[(kl_mi["KL"]>kl_mi["KL"].mean())|(kl_mi["MI"]>kl_mi['MI'].mean())]
predictor = list(choices['Attribute'])

#Create new dataset which included just only predictors 
df_ready = df[list(set(predictors)|{'diagnosis'})] 

#Encode diagnosis 
df_ready['diagnosis'] = df_ready['diagnosis'].map({'M':1,'B':0})

#Import essential library 
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier 

import random 
random.seed(2024)
X = df_ready.drop(columns=['diagnosis'])
y = df_ready['diagnosis']
#Split dataset
X_valid, X_test, y_valid, y_test = train_test_split(X,y,test_size = 0.2, stratify=y, random_state=2024)

#Scaling data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_valid = scaler.fit_transform(X_valid)

#Initial model 
model = {
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGB": XGBClassifier(eval_metric='logloss'),
}

#Perform cross-validation 
kf = StratifiedKFold(n_splits =5, shuffle=True)

#Result dict: 
result_cv = {}
model_scores = {}
#Perform cross validation
for name, model in model.items(): 
  cv_scores = cross_val_score(model,X_valid,y_valid,cv=kf,scoring = 'recall')
  result_cv[name] = np.mean(cv_scores)
  model_scores[name] = model

# Print result
print(result_cv)

#The best model 
best_model_name =  max(result_cv, key=result_cv.get)
best_model = model_scores[best_model_name]
best_model.fit(X_valid,y_valid)

#Scale X_test
X_test = scaler.transform(X_test)

y_pred = best_model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test,y_pred)}")
cr = classification_report(y_test,y_pred)
print(cr)

cm = confusion_matrix(y_test,y_pred)
plt.subplot()
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Class 0', 'Class 1'], 
            yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')

Using KL divergence and Mutual Information as a hybrid feature selection method in breast cancer classification

Introduction:

Overview of KL Divergence and Mutual information:

KL Divergence:

Mutual information:

Application to breast cancer prediction:

Dataset description:

Motivation:

Implementation of KL divergence and mutual information:

KL Divergence:

Mutual information:

Strategy for predictor’s selection:

Modeling process:

Model’s performance:

Models Considered:

Cross-Validation Results:

Testing Set Performance:

Conclusion:

Written by Ryan Huynh

No responses yet