AI in Detecting Diseases

The breast cancer diagnosis process is a complex and unpleasant process for the patients. This example will present a possible improvement of this process by using machine learning (ML). 

The patient goes through several process steps from which one step is where a digitized image of a breast mass is created and analyzed by the computer and the so called cell nucleus characteristics are measured and recorded. By studying and comparing the characteristics of the cell nucleus for many patients, who have or do not have cancer, and feeding the collected data to an ML model, the ML model can learn which characteristics result in cancer of the patient. The necessary ML training data attributes are decided by specialist and computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. 

Building and using an ML model in the decision process does not only decrease process time significantly but it also makes the process more reliable (this depends of course on the accuracy of the ML model) because it eliminates possible human error. Another advantage could be that the input data can automatically be fed to the ML model and by this eliminating a very time consuming manual process step.

The input data

Each record contains a series of attributes and the final diagnosis whether the patient with these attributes has cancer (malignant tumor) or not. The aim is to collect all possible combinations of the attributes in a way that the ML model can be trained well and that it then can decide very accurately whether the patient has breast cancer or not.

Two digitized images with the cell nucleus present are shown below.

The different attributes in the data are as follows:
  • Column 1
    • Diagnosis: Malignant=1, Benign=2)
  • Columns 2-31
    • Ten real-valued features are computed for each cell nucleus. The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features:
      • Radius (mean of distances from center to points on the perimeter)
      • Texture (standard deviation of gray-scale values)
      • Perimeter
      • Area
      • Smoothness (local variation in radius lengths)
      • Compactness (perimeter^2 / area - 1.0)
      • Concavity (severity of concave portions of the contour)
      • Concave points (number of concave portions of the contour)
      • Symmetry
      • Fractal dimension ("coastline approximation" - 1)
digitized images with the cell nucleus present, source [2]

The data file (can be downloaded at the end of the article) has a simple tab separated format. In order to use the data in the AI-TOOLKIT we need to change the extension of the data file to ‘.TSV’ (the AI-TOOLKIT expects this extension for tab delimited data files).

In order to use the fully numerical ML model all attributes need to be converted to numerical values. In our case there is only one non-numerical attribute and that is the Decision variable which is the Diagnosis whether the patient has breast cancer or not. The two possible options can be simply converted to Malignant=1, Benign=2. The AI-TOOLKIT can do this conversion automatically for you while importing the data (select the ‘Automatically Convert Categorical or Text values’ option) or you can just do a text replace in a text editor.

After preparing the input data in the appropriate format (tab separated values) the type of the ML model must be chosen. Let us choose an SVM model for this example.

First, in case of an SVM model the ML model parameters need to be optimized. This can be done automatically by the AI-TOOLKIT by using the built-in SVM Parameter Optimization module. The AI-TOOLKIT will report the best parameter combination for the input data which then can be filled in as follows:
    id: 'ID-WFcqHlreYm'
    type: SVM
    path: 'wdbc.sl3'
        - svm_type: C_SVC 
        - kernel_type: RBF 
        - gamma: 15.0 
        - C: 1.779 
        - data_id: 'wdbc' 
        - dec_id: 'decision' 
        - data_id: 'wdbc_t' 
        - dec_id: 'decision'
        - data_id: 'input_data' 
        - dec_id: 'decision'
        - data_id: 'output_data'
        - col_id: 'decision'
After importing the data, defining the data table names (wdbc and wdbc_t) and entering the optimal model parameters the ML model can be trained. 

When the ML model is ready learning the problem it will let you know the accuracy of the model on the training dataset:

Performance Evaluation Results: TRAINING

  Confusion Matrix [predicted x original] (number of classes: 2):

  (0) (1)
(0) 199 0
(1) 0 342
Accuracy 100.00%  
Error 0.00%  
C.Kappa 100.00%  
  (0) (1)
Precision 100.00%  100.00%
Recall 100.00% 100.00%
FNR 0.00% 0.00%
F1 100.00% 100.00%
TNR 100.00% 100.00%
FPR 0.00% 0.00%
The ML model is able to predict correctly whether the patient has breast cancer or not in all of the cases but do not forget that the model still needs to be tested with an appropriate number of data records (attribute sets) unseen during the training of the ML model in order to make sure that the ML model is learned enough about the phenomena and that it generalizes well!

In the case we use 5% of the input data for testing (removing it from the training data) and let the AI-TOOLKIT test the trained ML model with this test data then we get the results as follows:

Performance Evaluation Results: TEST

  Confusion Matrix [predicted x original] (number of classes: 2):

  (0) (1)
(0) 12 1
(1) 1 14
Accuracy 92.86%  
Error 7.14%  
C.Kappa 85.64%  
  (0) (1)
Precision 92.31%  93.33%
Recall 92.31% 93.33%
FNR 7.69% 6.67%
F1 92.31% 93.33%
TNR 93.33% 92.31%
FPR 6.67% 7.69%
The test results are less good than what we have seen during the training but the trained ML model can still predict 26 cases well from 28 which is still a very good result especially if we consider that we use new data! The ML model makes 1 mistake predicting incorrectly cancer when it should not and 1 mistake predicting no cancer when it should. Predicting incorrectly cancer is a less sever mistake because the diagnosis can still be checked by a medical doctor but the mistake of predicting no cancer when there is cancer should be eliminated! This is an important special way of ML model evaluation in the healthcare sector, not all mistakes have the same weight!

The above SVM model can still be improved by adding more data or/and changing the input features. It is of course also possible to choose another ML model e.g., a neural network model.

The extended performance evaluation results of the AI-TOOLKIT allows us to make a thorough analysis of the performance of the ML model but this is left as an exercise for the reader.

The trained ML model can be used to make important decisions and the input data could be fed to the ML model automatically and the results could also be collected automatically. The ML algorithm could even be integrated into different digital devices in order to have an all-in automatic analysis possible.


As we have seen above an ML model can be very useful in the improvement of business processes. The techniques explained in this article can be used not only in the healthcare sector but in many other sectors too! There are two important considerations while using an ML model:
  1. The attributes and the data records (attribute sets) used to train the ML model are very important. The capabilities of the ML model will depend on the data it gets for learning a specific phenomenon. You can of course add more data and/or attributes and re-train the model. Not only the amount of input data but the selection of the right attributes (features) is also very important.
  2. Extensively testing the ML model is very important in order to make sure that it is trained well in all aspects of the studied phenomena and that the model generalizes well (performs well in case of using during the training unseen input data).


  1. The Application of Artificial Intelligence, Zoltan Somogyi.
  2. Breast Cancer Wisconsin (Diagnostic) Data Set: Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792. You can download the dataset here: Breast Cancer Diagnosis data set.


Please use the contact form:


Antwerp, Belgium


You may contact the AI-TOOLKIT helpdesk:

Search This Website