Cervical Cancer Behavioral Risk

Cervical cancer (Ca Cervix) is a serious public health problem in women in the world. Fortunately, this disease can be prevented. Current prevention methods are still low both in terms of outcomes and participation. So, methods of prevention or early detection are still open and challenging. Behavior and its determinants are promising as predictors of Ca Cervix and early detection.

On this occasion, we will try to classify those related to cervical cancer using the Logistic Regression and Support Vector Machine Classifier

About the Dataset

The dataset we use comes from https://archive.ics.uci.edu/ml/datasets/Cervical+Cancer+Behavior+Risk.

The dataset contains 19 attributes regarding ca cervix behavior risk with class label is ca_cervix with 1 and 0 as values which means the respondent with and without ca cervix, respectively.

Import Library

We import libraries that we’ll use

Read the Data

We import the data first

then we get an overview and data information

From the data information, it can be seen that there is no missing value in the data and the data is already in integer form, so the data preprocessing has been completed.

Specify features and targets

On this occasion, we will use all existing attributes. For the target, we will use the ca_cervix column which consists of 2 classes, namely 0 (without cervical cancer) and 1 (with cervical cancer).

Normalizing the dataset

Let’s normalize the dataset

Split the Dataset

We split the dataset

Modelling

1. Logistic Regression

Let’s build our model using LogisticRegression from the Scikit-learn package. This function implements logistic regression and can use different numeric optimizers to find parameters, including the solver ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’. You can find complete information about the pros and cons of this optimizer if you search the internet.

Logistic Regression version in Scikit-learn, supports regularization. Regularization is a technique used to solve overfitting problems in machine learning models. The parameter C represents the inverse of regularization power which must be a positive float.

We determine the estimate of C and let another parameters to be default

Let’s use C=0.3 for the logistic regression model

We use the F1 score to evaluate the model

2. Support Vector Machine

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

1.Linear
2.Polynomial
3.Radial basis function (RBF)
4.Sigmoid

Each of these functions has its characteristics, its pros and cons, and its equation

The other parameters that important for SVM are gamma, which corresponds to the inverse of the width of the Gaussian kernel and C, which is is a regularization parameter

We determine the estimate of best C and gamma, and let another parameters to be default

Let’s use the best parameters for the svm model

We use the F1 score to evaluate the model

From both calculation and modelling above, we get same F1 Score. So, what model that we can choose?

Let’s calculate score for each training and test set

So, in this case, we prefer to choose Logistic Regression as the model because it’s more generalize than SVM.

It’s better if we try another classifier such as KNN, Decision Tree, Random Forest etc to find out what most suitable classifier for this dataset.

Thank you.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store