Cite this asNegahbani M, Joulazadeh S, Marateb HR, Mansourian M (2015) Coronary Artery Disease Diagnosis Using Supervised Fuzzy C-Means with Differential Search Algorithm-based Generalized Minkowski Metrics. Biomed Sci Eng 1(1): 006-014. DOI: 10.17352/abse.000002
Introduction: Coronary Artery Disease (CAD), one of the leading causes of death, is narrowing the walls of the coronary arteries. Angiography is the most accurate but invasive and costly CAD diagnosis method associated with mortality. The aim of this study was to design a computer-based non-invasive CAD diagnosis system.
Methods: In this work, a dataset from Cleveland clinic foundation, containing 303 patients and 20 features, was used. Supervised Fuzzy C-means (SFCM) classification was used to design a classifier for CAD diagnosis. The Generalized Minkowski Metrics (GMM) was used to handle objects containing different measurement scale features. The performance of the SFCM was assessed with/without Statistical Feature Selection (SFS). The weights of the GMM, i.e. the significance of different features, beside other classifier parameters were tuned using Differential Search Algorithm (DSA), and the validity of the proposed classifier was further investigated. The hold-out and 10-fold cross validation were used for the performance assessment.
Result: The average accuracy of the base classifier (SFCM + GMM) was 79% (hold-out validation). It increased to 82% when using SFS. The average accuracy, sensitivity and specificity of the DSA-based classifier were 88%, 86% and 88%, respectively (cross-validation).
Conclusion: The most important features were the number of major vessels colored by fluoroscopy, the family history of CAD, peak exercise systolic blood pressure, maximum exercise heart rate achieved, chest pain type, resting heart rate, Fasting Blood Sugar and gender. This classifier showed substantial agreement with the angiographic results. The hybrid diagnosis system is thus promising. However, it is necessary to improve its reliability.
CAD: Coronary Artery Disease; DSA: Differential Search Algorithm; GMM: Generalized Minkowski Metrics; HDL: High-density lipoprotein; LDL: Low-density lipoprotein; MLR: Multiple Logistic Regression; PSO: Particle Swarm Optimization; SFCM: Supervised Fuzzy C-Means
Coronary Artery disease (CAD), the most common type of heart disease, is one of the leading causes of death in industrialized countries and is rapidly achieving the same dubious distinction in developing nations as well .
CAD is the result of the accumulation of plaques within the walls of the coronary arteries supplying blood to the myocardium . Blockage of one or more coronary arteries interrupts the flow of the blood to the heart, which causes heart attack . The CAD is considered when narrowing of at least one of the coronary arteries is more than 50% .
The CAD risk factors have been identified over the past several decades include abnormal levels of circulating cholesterol, hypertension, cigarette smoking, diabetes, male gender, postmenopausal state, advancing age, sedentary lifestyle, obesity, and a positive family history of premature vascular disease. Moreover, new risk factors have been emerged as elevated blood levels of homocysteine, fibrinogen, inflammation and infection, atherogenic lipoprotein phenotype, elevated levels of lipoprotein, insulin resistance syndrome, psychosocial factors and a number of genetic polymorphisms .
There are several diagnostic tools for CAD [6-8]. Some of the general diagnostic tests include physical examination, lab tests, Electrocardiogram (ECG), echocardiogram, stress test, electron beam computed tomography, coronary angiography and cardiac catheterization. One of the major limitations of ECG is the undiagnosed symptoms of CAD. On the other hand, another alternative invasive methodology, angiogram, is painful and discomfort to the patients. Furthermore, the above mentioned procedures take a lot of cost, time and effort .
Computer aided diagnostic methods which extract relevant features and use them in classifiers for automated detection of diseases, can overcome such difficulties. Such techniques are noninvasive and provide reproducible and objective diagnosis, and hence, can prove to be valuable adjunct tools in clinical practice .
Yan et al. used an improved back propagation algorithm to train the CAD medical diagnosis system . A novel inference engine named fuzzy-evidential hybrid inference engine proposed by Khatibi et al., used Demister–Shafer theory of evidence and fuzzy sets theory to diagnose CAD. This hybrid engine precisely modeled the information’s vagueness and decision making’s uncertainty and through information fusion, provided accurate results . A fuzzy expert system based on particle swarm optimization (PSO) was developed by Muthukaruppan et al., in order to classify heart disease and healthy condition. In the proposed method, the significant attributes and fuzzy rules were extracted using the decision tree algorithm . Giri et al., proposed a methodology for the automatic detection of normal and CAD using heart rate signals. It was shown that Gaussian Mixture Model classifier had the best results among the three other classifiers Support Vector Machine, Probabilistic Neural Network and K-Nearest Neighbor . Using feature selection and extraction algorithm, Alizadehsani et al., enriched the dataset. Then, Information Gain and confidence were used to determine the effectiveness of features on CAD .
In this study, we proposed an automated medical diagnosis system based on the statistical feature selection, supervised fuzzy c-means (SFCM) and Generalized Minkowski Metrics (GMM). Since the features used for CA diagnosis have different measurement scales (nominal, ordinal or interval), a mixed-type data distance metric was used. A statistical feature selection method was used to reduce the feature space. Alternatively, the weights of the input features were tuned on the GMM using a novel stochastic optimization method called Differential Search Algorithm (DSA) and the important features were selected. The data-set, methodologies and the validation procedure will be studied at the following sections.
In this work, the CAD dataset from the University of California (UCI, Irvine), available online, taken from the Cleveland Clinic Foundation datasets [14-17], was used. This database consisted of 303 records with 76 attributes (features), among which 13 to 20 features have been widely used in the literature . The experimental protocol of recording the dataset was mentioned elsewhere in details [14,18]. A number of 303 consecutive patients referred for coronary angiography at the Cleveland Clinic between May 1981 and September 1984, without the history of prior myocardial infarction or known volvuli or cardiomyopathy diseases, participated in the experiment. Different demographic and clinical attributes (some of which were listed in (Table 1)) were recorded from the CAD (case) and healthy (control) subjects . When at least one of the coronary arteries narrowed more than 50%, shown by angiography, the CAD was considered in the subjects . The aim of the study was to design a computer-based CAD diagnosis system using 30 recorded features whose outcome had acceptable agreement with that of coronary angiography. In the next section, the Fuzzy C-means (FCM) data mining techniques are introduced.
Risk factors are the smallest units indicating the existence of a disease. A syndrome, on the other hand, is a collection, a set, or a cluster of concurrent risk factors, which together indicate the presence and the nature of the disease. Here the main question is that what the relation between these risk factors and a specific syndrome is. Classification and clustering are therefore basic concerns in medicine. Classification depends on the definition of the classes and on the required degree of affiliation of their elements .
Clustering algorithms are generally divided into two groups. First, hard partitioning algorithms which are based on classical set theory; they require that an object either does or does not belong to a cluster. Soft clustering methods however, allow the objects to belong to several clusters simultaneously with different degrees of membership . Fuzzy clustering methods are one of the well-known methods of soft clustering which are vastly used in solving medical diagnosis problems. In medicine, there are usually imprecise conditions and highly overlapping classes and therefore fuzzy methods seem to be more suitable than crisp ones .
FCM algorithm was first introduced by Bezdek as the enhancement to the classical K-means clustering . This algorithm estimates the membership function of the object k to the clusters i ( ) to minimize the following cost function :
Where dik is the distance measure of kth data point from ith cluster center and parameters c, n and m ≥ 1 are the number of clusters and objects in the dataset and the fuzzy coefficient, respectively. In the probabilistic FCM, the following constrain must be met when optimizing the above cost function:
The FCM algorithm iteratively estimates the cluster centers and the membership functions to minimize the cost function (Jr) which could be found elsewhere in details .
Despite all the benefits of using FCM as the clustering core algorithm, it is still a blind method and may misclassify the input data. Thus it is necessary to train the algorithm in a way that induces a meaningful convergence. As a result, the classification will be even more accurate.
In the original FCM the data distance to the cluster centers are normally calculated using standard Euclidean distance. In our case, each object is a vector of multiple risk factors with various measurement scales types in different ranges, hence using the classic Euclidean distance is not appropriate . Basically, there are three major data types in clinical data sets: nominal, discrete ordinal, and Interval. Nominal scales are only used for non-ranked qualitative classification e.g. gender, blood type, and health condition. A discrete–ordinal scale is a nominal variable, but the different states are ordered in a meaningful sequence e.g. the slope of the peak exercise ST segment. Interval scales are measured on a linear scale e.g. BMI (Body Mass Index) and age. It is important to define a distance measure to balance all these differences in a way that no feature lessens the other features’ effect or vice versa.
The class labels provide a useful guidance during training procedure. Hence, it is necessary to use the labeled samples in training phase and unlabeled samples in testing phase to improve the performance of FCM. This idea led to the development of a new algorithm called Supervised Fuzzy C-Means (SFCM) algorithm, a slight modification of FCM .
The main goal of SFCM is to use the labeled data samples to guide the iterative optimization procedure. In this method, a known fixed set of categories and category-labeled training data are used to induce a classification function. The determination of fuzzy partition matrix U (dividing N data sets into C classes) using Supervised Fuzzy C-Means clustering is an iterative optimization procedure. The objective function of SFCM classification is defined as:
Where U is the fuzzy partition matrix, V is the cluster center, fik is the membership degree of kth labeled sample belonging to the ith cluster (value is either 0 or 1).
The coefficient ‘a’ denotes the scaling factor. The role of ‘a’ is to maintain a balance between supervised and unsupervised component within the optimization procedure and parameter ‘m’ controls the amount of fuzziness in the classification. The typical value of m is 2 and a=L/n, L denotes the size of labeled samples . However, it is better to tune these two parameters based on the properties of the dataset. Function Jm can take a large number of values, the smallest one being associated with the best clustering.
An effective algorithm for supervised fuzzy classification is discussed herein. The steps of algorithm are as below [20,25]:
1) Initiate fuzzy partition matrix, U(0), with random values between 0 and 1 and fix the number of cluster centers as the number of outcome classes.
2) Start the iterative procedure and set the iteration counter to one.
3) Calculate the cluster centers using the following equation:
Where, Vij(t) represents the ith cluster center of jth feature which j changes from 1 to m (number of features), and Zkj(train) is the kth data instance corresponding to the mth selected feature variable.
4) Calculate the distance between ith cluster center and kth dataset, distance measured with Euclidean Distance as follows:
5) Update the fuzzy partition matrix for the next iteration given by the following equation:
For test set samples, whose class labels are unknown, the fuzzy partition matrix is calculated as follows:
6) When ‖U(t+1) - U(t)‖ ≤ ε (ε is the iterative accuracy) has achieved, Stop the iteration; In this case outputs will be v (cluster center) and U (fuzzy matrix), otherwise return to step 3.
In many classification problems, especially in the biomedical domain, high dimensional data with few observations are used . This can lead to lower classification accuracy and clusters of poor quality. High dimensional data is also a serious problem for many classification algorithms due to its high computational cost, memory usage andthe curse of dimensionality .
Since most of the features are redundant or irrelevant, feature selection method (FS) is used to pick a subset of features that are relevant to the target concept . In this work, a statistical FS method entitled as Multiple Logistic Regression (MLR) was used, which is widely used to identify relevant risk factors in epidemiological studies. MLR, known as feature vector machine in machine learning, can be used to select statistically significant features. It not only considers significant features that provide acceptable discrimination between two classes, but it also takes into account the correlation between features. After running MLR on the input features (excluding the intercept point in the analysis), the selected features were used in the tested classifier [2,29,30].
It is impossible to perform any arithmetic operation on nominal data because it has no order. The only Operation defined here is the equality. The distance of two nominal instances A and B is 1, if A equals B, and 0 otherwise.
For interval scales, it is possible to calculate the distance with standard norm definitions. The distance between two data samples A and B from a given interval I, is defined as |A-B|. As the interval size could be different between multiple features, it is important to normalize the distances relative to the interval size given as |A-B| / |I|.
Discrete–ordinal scale is a nominal variable, but the different states are ordered in a meaningful sequence. Ordinal data has order, but the intervals between scale points may be uneven. But still, the distance of two samples lying in the same interval is computed similar to that of interval distance i.e. |A-B| / |I|. Now, each feature is normalized to a value between 0 and 1.
Note that |I| is calculated by subtracting the maximum and minimum values herein. The l1-norm was then used to combine the distance between different transformed features, simply known as GMM in the literature [31,32]. Thus, the GMM distance definition between two feature vectors A and B, could be given as below:
Where d is the number of dimensions, Ψ is the distance function for each feature which varies according to its measurement scale, and Ck are weights. The weights could be either set to the value of unity or tuned using an optimization algorithm. Accordingly, the features’ weights were set to unity when using GMM+SFCM with/without MLR. Alternatively, instead of using SFS, all of the features were used and their weights were calculated and the features with small value of weights were neglected. For the later approach, it is necessary to use an efficient optimization algorithm, discussed at the next section.
DSA is an optimization algorithm developed by P. Civicioglu simulating the Brownian-like random-walk movement used by an organism to migrate . The motivation of DSA, like many population-based stochastic optimization algorithms, was taken from the nature. Many living organisms show annual migration. In this migration, super organism is constituted containing large number of individuals. The movement of a super organism could be illustrated by a Brownian-like random-walk model [33,34]. In DSA, the population contains random solutions. The migration is performed to the global optimum of the cost function. At each iteration, some of the populations are selected. They move based on a Brownian-like random walk model . DSA is simple to implement and was shown to have acceptable performance in variety of the optimization problems in comparison with that of other traditional optimization algorithms while it is not too sensitive to the initialization of its parameters .
We have used DSA to optimize the parameters of GMM and SFCM. The cost function was the absolute error rate of the classifier on the training set. The initial setting of the DSA used in our study was similar to that of P. Civicioglu . Briefly, the size of the population was set to 30, and the maximum number of function evaluation value (i.e. the number of times that the cost function was called in the program) as the only stopping criterion was 2,000,000.
The performance of a classifier could be evaluated by computing the number of correctly recognized CAD subjects (TP: True Positives), the number of correctly recognized healthy subjects (TN: True Negatives), and examples that either were incorrectly assigned to the CAD class (FP: False Positives) or that were missed as class examples (FN: False Negatives). These four counts constitute the information-theory formulas to accurately measure the performance of the classification [35,36].
The demographic information of the Cleveland CAD data was shown in (Table 1) for the case (CAD) and control (healthy) groups. To assess the performance of the base classifier, the main dataset was randomly divided into two roughly equal size datasets, namely dataset 1 and dataset 2 (hold-out validation method ). The best Accuracy (Acc) achieved when tuning on the dataset 1 and testing on the dataset 2 and vice versa in ten runs of the SFCM algorithm with GMM distance measure (unity weights) was shown in (Table 2). The maximum number of iterations was set to 100 in all the classifiers. The scaling factor (a) was tuned in the training set using exhaustive grid search (a=0.8) while the value of the fuzziness parameter (m) was set to 2 in the base classifier. The overall percentage accuracy (the average Acc of the classifier on dataset 2 when it was tuned on dataset 1 and vice versa) in the base classifier was 79%. The average Sensitivity and Specificity of the base classifier were 71% and 84%, respectively.
Table 3 shows the results of running the algorithm with SFS. MLR revealed that following significant features: gender, cp (chest pain type), trestbps (resting systolic blood pressure), thalach (maximum exercise heart rate achieved), slope (the slope of the peak exercise ST segment), ca (number of major vessels (0-3) colored by fluoroscopy), and thal (thallium-201 stress scintigraphy). The overall accuracy for SFCM-SFS was 82%. The average Sensitivity and Specificity of this classifier were 85% and 82%, respectively.
Finally, DSA optimization method was used to tune the features’ weights, the fuzziness parameter and the scaling factor. The cost function was set as the absolute error rate of the classifier (SFCM+GMM) (i.e. 1-Acc) on the training set. Guarding against Type III error , 10-fold cross-validation  was used to assess the performance of the proposed hybrid classifier (Table 4). The average Accuracy, Sensitivity and Specificity of this classifier were 88%, 86%, and 88% respectively. The McNemar’s test  revealed that the performance of the hybrid classifier was significantly better than the base classifier (p_value<0.05) but comparable with that of SFS+SFCM. Meanwhile, Multi-fold cross validation was used instead of leave-one-out, since it is proven to have better performance in terms of accuracy and efficiency .
The values of the parameters of the hybrid classifier tuned using DSA were shown in Table 5. The most important features (feature weight w>0.5) were listed in the descending order: the number of major vessels colored by fluoroscopy (w=1), the family history of CAD (w=1), peak exercise systolic blood pressure (w=0.89), maximum exercise heart rate achieved (w=0.87), chest pain type (w=0.82), resting heart rate (w=0.67), Fasting Blood Sugar (w=0.64) and gender (w=0.54). However, the list significant attributes (w<0.1) were age, resting blood pressure, number of cigarettes per day, resting electrocardiographic results, peak exercise diastolic blood pressure, and the slope of the peak exercise ST segment. For the definition, and number of categories of the above attributes the reader is referred to (Table 1).
The overall performance of the hybrid classifier was shown in Table 6. It includes the contingency table (confusion matrix) on the total of 303 subjects. The agreement rate between the results of this classifier and those of the gold standard (i.e. CAD diagnosis using angiography) was assessed based on the Cohen’s kappa coefficient . Substantial agreement was shown between the outcomes of the proposed hybrid classifier and angiography (kappa=0.73) .
In this paper, three classification systems were designed for non-invasive CAD diagnosis; among which the hybrid classifier showed better performance (Tables 2-4). This diagnosis system was based on the SFCM classifier in which the distance between objects were calculated using GMM and the parameters of the system (SFCM parameters and GMM weights) were estimated using DSA optimization. Other approaches such as PSO  were used for optimization, but DSA showed more accurate results. The Type I error and the power of the hybrid classifier were 0.1 and 86%, respectively. Since the data-set was not totally balanced (i.e. the number of cases and controls were not identical), F1-score measure might be more accurate than the accuracy. The average F1-score during 10-fold cross-validation was 85±10 (%), indicating that the proposed system is accurate. The comparison between the performance of the postposed system and some of the other systems designed on the CAD dataset was shown (Table 7). Some methods had higher accuracy that n that of the proposed system. We compared the result of the method proposed by Muthukaruppan et al. . Although its accuracy was 93%, McNamara’s test showed that it was not significantly higher than our hybrid system (p_value>0.05). Another issue is that among the methods listed in (Table 7), those in which Fuzzy classification was used, showed higher accuracies. Since most or all classificatory concepts in medicine are fuzzy, fuzzy taxonomy was used in our study. Meanwhile, it is very difficult to define sharp borders between various symptoms in the set of all symptoms and between various diseases in the set of diseases . Thus, the framework of fuzzy systems is very useful to deal with the absence of sharp boundaries of the sets of symptoms, diagnoses, and phenomena of diseases [19,42].
The significant features selected by the DSA, were known to be directly involved in CAD. Fluoroscopy is one of the most popular non-invasive CAD diagnosis methods whose accuracy ranges between 35% and 75% in comparison with that of the gold standard (i.e. angiography) in the literature [43,44]. We permed a univariate (i.e. the number of major vessels (0-3) colored by fluoroscopy) classification based on the Receiver Operating Characteristic (ROC) plot. Its accuracy was 75% (Area under Curve: AUC=0.75; cut-off=0.5). The average number of vessels colures were statistically different in the CAD and normal group (independent-samples t-test; p_value 0.05). High value of GMM weights are in agreement with the statistical test. Thus, it was a suitable feature but not enough for accurate classification. The other traditional non-invasive CAD diagnosis method is thallium-201 stress scintigraphy. The prevalence of CAD in three groups of scintigraphy was statistically different (Chi-square test; p_value<0.05). Having designed a decision-tree classifier with scintigraphy feature, the accuracy was 76%. However, due to the directional correlation between fluoroscopy and scintigraphy (Eta=0.3), DSA estimated the fluoroscopy and the scintigraphy wegihts as 1.00 and 0.28. The other clinical variable is ST segment depression used in cardiography. Its eight was zero, indicating that no further information could be extracted by adding this variable. In the lietrature, the ranked order of CAD predictive were cardiac fluoroscopy score, thallium score and extent of Electrocardiography (e.g. ST segment depression) [45,46] which is in agreement with our findings.
CAD is associated with higher morbidity and mortality in women than in men . It was also shown that the incidence of CAD in women aged less than 70 years is lower than their male counterparts . In our study, the percentage of men and women having CAD were 84% and 16%, respectively. Considering that women in the CAD group had the age of 66 years old or lower, this is in agreement with our study. However, women usually have CAD 7 to 10 years later than men . In our data-set, the average age of women and men who had CAD was 60±5 and 55±8 years, respectively. Moreover, gender was a significant feature (w=0.539). Meanwhile, the age by itself was not a significant feature in our study (Table 5; w=0.085). This is in agreement with the fact that the average age of people in the CAD and normal groups was 56±8 and 53±9 years, respectively (Table 1). This is related to the stratified age sampling used in our study.
Although, it is proved that high blood pressure increases the risk of CAD , it was not significant in our study. Meanwhile elevated resting heart rate is known as a CAD risk factor, which is in agreement with our findings (w=0.674) . In the literature, the family history of CAD is a major CAD risk factor in adults . This is in agreement with our findings where it had the highest GMM weight (w=1; Table 5). Meanwhile, fasting blood sugar was known as an important determinant of CAD , in agreement with our findings where its GMM weight was estimated as 0.641. A high total cholesterol level can increase your risk of cardiovascular disease. However, decisions about when to treat high cholesterol are usually based upon the level of LDL or HDL cholesterol, rather than the level of total cholesterol. This might explain the fact that the weight of the cholesterol was 0.289 in our study. Moreover, total cholesterol/ high-density lipoprotein cholesterol ratio were shown to be associated to CAD rather than cholesterol, by itself [54,55].
Chest pain type (GMM weight of 0.818) was divided into the following categories: Typical angina pectoris, atypical angina, non-anginal pain, and no pain. Typical angina (pain that occurs in the anterior thorax, neck, shoulders, jaw, or arms is precipitated by exertion and relieved within 20 min by rest) was the most common symptom of CAD . It occurs when blood flow to an area of heart is decreased, impairing the delivery of oxygen and vital nutrients to the heart muscle cells. The byproduct of using this inefficient fuel is producing lactic acid that builds up in the muscle and causes pain .
The key to a good classification is a dataset containing all the possibly relevant features (i.e. risk factors mentioned in the literature) with enough cases (i.e. suitable sample size). Although the sample size of the Cleveland dataset is rather high, some important features such as BMI, LDL and HDL are missing. Meanwhile, we are going to design an automated CAD risk assessment program, based on the findings of this study, in collaboration with Isfahan Healthy Heart Program . Such a large database, could allow us to investigate the accuracy of the proposed diagnosis system in a broader sense. Another issue is that the performance of the base classifier with/without FS (Tables 2,3) was so different in the first and second scenarios. Having calculated the cluster representatives for the healthy and CAD groups in the first and second datasets, the dataset 1 showed better discrimination in comparison with dataset2 on the whole 20 features and also those selected by the MLR. This is why that performance of the base classifier with/without FS was higher on the data set 1 in the entire training and test procedure. Also, the discrimination with/without FS was not that different. This, in fact, shows that the FS could have selected features with significant discrimination power.
Another step would be developing a web-based online system with which patients/ medical doctors could assess their risk of having CAD at home. These Web-based diagnostic decision support systems have been recently focused in Medicine and are proven to be valuable in identifying the correct diagnosis in complicated cases . There might be two possible approaches to improve the performance of the proposed diagnosis system. First, further features could be defined by considering the interactions between input risk factors/predictors  e.g. simply multiplication of the predictors. Second, multiple clusters could be formed for each healthy and CAD class by using mixed-type data clustering methods . Then, supervised FCM could be used with multiple clusters corresponding with two healthy and CAD classes. Extracting supervised classification rules on groups of similar objects could potentially reduce the misclassification rate especially close to the class borderlines. These two approaches will be the focus of our future work.
The hybrid classifier showed the average accuracy of 87%. The power of the designed diagnosis system was 86%. Type I error (α) was 0.1 and the F-score was 85%. Although the power of the method is acceptable, type I error must be reduced down to 0.05, to introduce a reliable and accurate clinical test which is the focus of the future work. One possible strategy to improve the accuracy of the proposed diagnosis system is using classifier fusion. Combining different reliable classifiers, might improve the accuracy though the fusion procedure. In conclusion, we designed an automated non-invasive CAD diagnosis system based on the Fuzzy theory. The results showed that the proposed system is promising. However, further improvements are needed to be able to use it in clinical laboratories.
This work was supported by the University of Isfahan (MN, SJ, HM) and Isfahan University of medical Sciences (MM).
Subscribe to our articles alerts and stay tuned.