Cervical cancer is a form of cancer that is found in the cells of the cervix. Upon early detection of the same, this type of cancer can be cured or its effect can be reduced up to a great extent. This article discusses how to carry out cervical cancer detection using artificial intelligence and machine learning techniques.

This article was originally published by AI Technology & Systems.


The first and foremost step is to look out for a reliable platform to run our code using highly efficient GPUs. For our project, we used the Cainvas AITS Platform.

Next, we will import all the required packages necessary for the following tasks.
1. Importing the data
2. Visualizing the data
3. Pre-processing the data
4. Training on the data
5. Evaluating our performance on the trained data

# Import all the necessary libraries

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

import seaborn as sns
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

Now, we will unzip the data and load the file in a DataFrame using pandas library. After loading the data, we will display the first 5rows to inspect the data.
The most important task in pre-processing such data is to check for NA/null values and eliminate them. Upon inspection of our data, we find that there are several NULL values in various columns, in order to deal with this, we replace the NULL values with the mean value of the data.

# Change all the datatype to be float 64
data['Age'] = data['Age'].astype(float)
data['STDs: Number of diagnosis'] = data['STDs: Number of diagnosis'].astype(float)
data['Dx:Cancer'] = data['Dx:Cancer'].astype(float)
data['Dx:CIN'] = data['Dx:CIN'].astype(float)
data['Dx:HPV'] = data['Dx:HPV'].astype(float)
data['Dx'] = data['Dx'].astype(float)
data['Hinselmann'] = data['Hinselmann'].astype(float)
data['Schiller'] = data['Schiller'].astype(float)
data['Citology'] = data['Citology'].astype(float)
data['Biopsy'] = data['Biopsy'].astype(float)

Upon viewing the information about the data types of our data, we find that there are several columns such as Age, Citology, Biopsy, etc whose data types are integers that need to be converted to float types.

# Visualising the relationship between different columns of the data
data.hist(figsize = (18,18))


plt.figure(figsize=(15,12))
sns.heatmap(data.corr(),annot=True,linewidths=2, center = True)
plt.show()

Since the objective of our project is to detect cervical cancer, we create a column titled ‘result’ which contains binary values 0 and 1 as labels for ‘No Cancer’ and ‘Cancer’.

In this step, we will use several plotting techniques to check for correlation between various columns of the data. Let us visualize the relationship between Age, Number of Sexual Partners, and other columns such as Biopsy and Schiller as hue (colour) parameters.

Next, we drop all the irrelevant columns and thereby, select only the columns that are beneficial for training the model and visualize the relationship between all the relevant columns with the help of a heatmap.

data_final = data.drop(columns = ['Hinselmann', 
                                  'Schiller', 
                                  'Citology', 
                                  'Biopsy', 
                                  'count', 
                                  'STDs:condylomatosis',
                                  'STDs:cervical condylomatosis',
                                  'STDs:vulvo-perineal condylomatosis',
                                  'STDs:syphilis',
                                  'STDs:pelvic inflammatory disease', 
                                  'STDs:genital herpes',
                                  'STDs:molluscum contagiosum',
                                  'STDs:AIDS', 'STDs:HIV',
                                  'STDs:Hepatitis B', 'STDs:HPV', 
                                  'STDs: Number of diagnosis',
                                  'Dx:Cancer', 'Dx:CIN', 'Dx:HPV'                                  
                                 ])

y = data_final['result']
X = data_final.drop(columns = ['result'])



# Plotting a heatmap/correlation plot to see how different values are related to each other
plt.figure(figsize=(27,24))
sns.heatmap(data_final.corr(),annot=True,linewidths=2)
plt.show()

After visualizing the data, we can move forward to the next step of our classification task, and that is, pre-processing our data further to feed into the neural network model. We split the data into training and testing halves with a test size of 40%. We will use this test data to validate our model.

After splitting the data, we will use StandardScaler to scale our data. To learn more about this technique, follow this link.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify = y)

#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


# Getting the Final Data Shapes


print("Shape of Training Data")
print ("X = ",X_train.shape)
print ("Y = ",y_train.shape, "\n")


print("Shape of Testing Data")
print ("X = ",X_test.shape)
print ("Y = ",y_test.shape)

Now, we get to the exciting part. We will design the architecture of our neural network. Let us start with initializing our Sequential() model and add Dense() layers to it. For input, we see that we have 15 columns of the data for training. Hence, we will set our input dimensions as 15. Since our target variables are 2, we will add a last Dense layer containing 2 dimensions.
We compile our model using Adam optimizer with a learning rate of 0.0000001 and set our loss function to ‘categorical crossentropy’.

After compilation, we train our model using the training data and use the testing data for validation. Setting a batch size of 5 and training for 100 epochs, we achieve 88% accuracy for training data and a peak validation accuracy of about 86%. We’ve also used an EarlyStopping checkpoint that stops training the model when the validation model accuracy does not improve for 5 epochs. On plotting the model performance with the number of epochs passed, we can visualize the training performance of our model.

es = EarlyStopping(monitor = 'val_accuracy', patience = 5)

# Run the model for a batch size of 5 for 100 epochs
history = model.fit(X_train, 
                    y_train, 
                    validation_data = (X_test, y_test),
                    batch_size = 5,
                    epochs = 100,
                    callbacks = es
                   )

# Function to plot "accuracy vs epoch" graphs and "loss vs epoch" graphs for training and validation data
def plot_metrics(model_name, metric = 'accuracy'):
    if metric == 'loss':
        plt.title("Loss Values")
        plt.plot(model_name.history['loss'], label = 'train')
        plt.plot(model_name.history['val_loss'], label = 'test')
        plt.legend()
        plt.show()
    else:
        plt.title("Accuracy Values")
        plt.plot(model_name.history['accuracy'], label='train') 
        plt.plot(model_name.history['val_accuracy'], label='test') 
        plt.legend()
        plt.show()

        
plot_metrics(history, 'accuracy')
plot_metrics(history, 'loss')

The last step is to save our model and evaluate its performance further on the testing data by making predictions on it and evaluating using a classification report.

On checking the performance of our model, we observe that our model performs exceptionally well. It is safe to conclude that our model is a great success and performs really well on the data. Using machine learning techniques in order to contribute to the healthcare infrastructure of our society is an ideal use of this technology.

About the author 

Radiostud.io Staff

Showcasing and curating a knowledge base of tech use cases from across the web.

TechForCXO Weekly Newsletter
TechForCXO Weekly Newsletter

TechForCXO - Our Newsletter Delivering Technology Use Case Insights Every Two Weeks

>