Abstract
Introduction: Osteoporosis is a disease that reduces bone density and loses the quality of bone microstructure leading to an increased risk of fractures. It is one of the major causes of inability and death in elderly people. The current study aims at determining the factors influencing the incidence of osteoporosis and providing a predictive model for the disease diagnosis to increase the diagnostic speed and reduce diagnostic costs.
Methods: An Individual's data including personal information, lifestyle, and disease information were reviewed. A new model has been presented based on the Cross-Industry Standard Process CRISP methodology. Besides, Support Vector Machine (SVM) and Bayes methods (Tree Augmented Naïve Bayes (TAN) and Clementine12 have been used as data mining tools.
Results: Some features have been detected to affect this disease. The rules have been extracted that can be used as a pattern for the prediction of the patients' status. Classification precision was calculated to be 88.39% for SVM, and 91.29% for (TAN) when the precision of TAN is higher comparing to other methods.
Conclusion: In this study, lactation duration, history of osteoporosis, calcium intake, immune-suppressor drugs, hyperlipidemia drugs, autoimmune diseases, number of pregnancies, hyperlipidemia, vitamin D, hyperparathyroidism, exercising during the week, anti-inflammatory drugs, thalassemia, waist disc, anti-coagulants drugs, hypothyroidism, hypertension drugs, history of surgery, diabetes and diabetes-related drugs were identified as important factors in relation to osteoporosis. These factors can be used for a new sample with defined characteristics to predict the possibility of osteoporosis in a person.
Keywords: Osteoporosis, Data mining, Support vector machine, Bayesian network
Introduction
Osteoporosis is a disease that is known to reduce bone density and lose the quality of bone microstructure increasing the risk of bone fractures (1,2,3
(. The disease has few symptoms and only appears when bones are easily broken. The
World Health Organization (WHO) has defined osteoporosis as a reduction in bone density that lies 2.5 standard deviations below the average value for young healthy people (4,5). According to WHO, annually more than 9.8 million breakages occur due to osteoporosis. It is estimated that osteoporosis will affect 200 million women worldwide (almost 1 out of 10 women aged over 60 years, 1 out of 5 women aged over 70 years, 2 out of 5 women aged over 80 years, and 2 out of 3 women aged over 90 years) (6,7,8). A recent study carried out by the Ministry of Health and Medical Education (MOHME) in Iran showed that osteoporosis affected 1 out of 4 Iranian women aged over 50 years old (9).
Genetic (10,11) and racial factors are among the most significant factors affecting bone density (12,13,14). Correspondingly, physiological, and environmental factors and lifestyles such as a diet with low contents of vitamin D and calcium, alcohol consumption and smoking, excess protein intake, caffeine, salt, and physical inactivity affect osteoporosis (15,16,17) which can significantly contribute to achieving maximum bone density and maintaining it throughout life (18,19). Some of these factors include the effect of sex hormones in adolescence (20), suitable nutrition, and total
body weight (21,22) as well as physical activity (23,24). Determining the risk factors concerning osteoporosis in different societies and countries and at various socioeconomic levels can help planning the osteoporosis prevention programs. Osteoporosis is one of the major causes of disability and mortality in older people (25). The mortality rate of hip fracture (femoral head) in the first year after the fracture occurred is about 20% in older people and half of these people will have some degree of disability for the rest of their lives. It is anticipated that more than 75% of osteoporosis fractures may occur in developing countries over the next 50 years (26). The country of Iran as well as other developing countries will have a significant population of the elderly in the next 50 years. So, it seems that the planning for prevention programs regarding osteoporosis must be one of the health priorities in Iran. Recently, some data mining methods can be helpful in the detection of the disease.
In the present study, the risk prediction of osteoporosis is investigated in clinical records using data mining algorithms. Using these methods, the risk of developing osteoporosis in a person can be prevented without using any diagnosis methods. Besides, the factors that are effective in this disease can be detected. Therefore, a model for the prevention of osteoporosis can be developed using Support Vector Machine (SVM) and (TAN).
Some studies predicting osteoporosis include the accuracy of the predictive osteoporosis model introduced by Wang et al. which is 70.5% in one of the finest hybrid models of the decision tree and artificial neural network with 33 features (27). Gao et al. have investigated a feature extracted from the fractal of the micro CT scan to predict osteoporosis in individuals. They used the decision tree C 4.5 for the predictive model and the accuracy is 92.9% (28). Moudani et al. have predicted the risk of osteoporosis using the random forest method in the decision tree. The model predicted four classes without risk, low risk, moderate risk, high risk, and a severe risk of osteoporosis. The study analyzed 15 features and obtained an accuracy of 99.92% (29). In another study, Alizadeh et al. showed that the artificial neural network model with multiple methods has the most accuracy for the prediction of osteoporosis in patients. The study was conducted on 670 patients with 60 features and had an accuracy of 95.70 (30).
Methods
For effective data mining, not only related information is needed, but also an appropriate
data mining method should be used. A method including all data mining steps such as data collection, data preparation, modeling, and evaluation is required.
Data collection and description
The clinical history used in this study was related to 4083 women referred to the osteoporosis research center placed in the endocrinology and metabolism Institute of Tehran University of Medical Sciences between 2006 and 2010. The history included personal information, lifestyle, and disease information. There are 425 attributes in this data set that includes personal information, lifestyle, and illness information. Having eliminated ineffective attributes such as name, father's name, phone number, etc, 400 attributes were extracted. All referrals were labeled as having osteoporosis. In this study, 4083 women with 365 cases had osteoporosis.
Data preparation
This step aimed at increasing the quality of user data, so that appropriate data could be provided for the next phase and data modeling.
Data cleaning: This step included noise softening, detection and removing fling data (isolated), resolving incompatibilities, filling gaps, and missed data.
Homogenizing formats: Because of its significance and its hidden aspects, this issue is of utmost importance while aggregating the data, so data miners address it lonely. For instance, sometimes in the data, the weight is calculated using the gram scale or kilogram scale which should be edited in the way that all data has the same format.
Data normalization: Data such as age, weight, height, BMI, lactation duration, and the number of pregnancies could have special values because they can be more effective in the analysis. The whole data, which are in the same way, should be normalized in this step.
Combination of some data: To facilitate modeling and extraction of results, that could be referable in medical uses, data were converted from continuous values to an interspersed value that has medical definitions.
To reduce the size of the initial 400 features in the data, according to the opinion of an expert in the Institute of Endocrinology and Metabolism, and with the help of scientific papers (31, 32, 33, 34, 35, 36, 37, 38, 39) in the field of osteoporosis, 45 attributes were selected.
Features include osteoporosis, staying around the day, arrhythmia, autoimmune disease, pregnancy rate, high blood lipids, anti-inflammatory drugs, gastrointestinal drugs, diabetes-related drugs, osteoporosis drug, blood lipids drug, cancer drugs, immunosuppressive drugs, hypertension, chloroquine drug, diabetes, arthritis, familial osteoporosis, age, menopausal age, body mass index, surgery, duration of breastfeeding, contraceptive pills, calcium intake, multivitamin intake, home type,
menstruation age, cancer, anemia, weekly exercise, antiepileptic and seizure drug, gastrointestinal surgery, coffee, and tea consumption, hyperadrenalism, lumbar disc, falling, hyperparathyroidism, hyperthyroidism, hypothyroidism, smoking, vitamin D, thalassemia, history of fracture, and anticoagulation drug.
Data modeling
Support Vector Machine (SVM) is a classification method specifically used for large data sets. A large data set is a set that has a lot of predictors such as samples used in bioinformatics. The method is usually used when a lot of data is available such as medical datasets. (40 , 41).
The SVM simple model was used in this study. Firstly, data were divided into two educational (for education) and test (for evaluation) groups. This classification was performed in a completely random manner using Partition node in Clementine software training data in which educational data had 70% of the data and test data had the remaining 30% of the data (42) ( Figure 1).
Figure 1. Support Vector Machine (SVM)
SVM node was selected in the Clementine software. Primarily, 45 studied features were entered in the elementary configuration of this model. These 45 features have been selected using dimensional reduction methods (KMeans algorithm and the opinion of specialists). Osteoporosis in two groups of healthy people and patients was considered as a target feature and other features were considered as entrance features. The results of this step are calculated for SVM.
Subsequently, regarding the effects of the dimensional reduction in data mining, the model was repeated with 35 features. Feature selection was performed based on their importance degree in the previous model so that features that were more important in the last step are selected in this step and the model was run based on them. An SVM model was run using these 35 features and its results were obtained.
Bayesian Network
Bayesian network makes it possible to develop a probable model via a combination of observed and recorded evidence from the real world to calculate the probability of events from a set of apparently unrelated features. Bayes networks are used for the prediction of different status. A Bayes network is known as a graphical model revealing the variables of a dataset (that are often called a node) along with their probable and conditional relationships. Casual relationships between nodes can be shown using the Bayes network (43).
However, connections of a network (arcs) do not necessarily show direct cause and effect aspects. For instance, a Bayes network can be used to calculate the probability of the presence of a disease in a specific person. In this case, the presence or absence of a specific symptom could be considered as a causal connection between the symptom and the disease in the Bayesian network. These networks are very powerful and make the best predictions about different types of data (43).
For designation of Bayes model (TAN), educational data are defined as elementary data in Bayes Net node and parameters of the model are adjusted as below:
Use Partitioned Data Structure Type: TAN
Parameter learning method: Maximum likelihood (Figure 2).
Figure 2. Bayesian networks model
Bayes Network node was selected in Clementine software. Firstly, 45 features were entered into the elementary configuration. Features of osteoporosis are defined as target features and other features were selected as entrance features. Results were obtained from the Bayes network.
In the second step, 35 features were selected from 45 features of the previous step. The selection of 35 features was based on their importance degree in the previous step. Afterward, the Bayes network model was run using 35 features and the results were obtained.
Features include calcium intake, weekly exercise, anemia, antiepileptic and seizure drugs, gastrointestinal surgery, coffee, and tea consumption, hyperadrenalism, lumbar disc, falling, hyperparathyroidism, hypothyroidism, hyperthyroidism, smoking, vitamin D intake, thalassemia, fracture history, an anticoagulant drug, osteoporosis, autoimmune disease, pregnancy rate, high blood lipids, anti-inflammatory drugs, diabetes-related drugs, osteoporosis drug, blood lipids drug, cancer drugs, an immunosuppressive drug, hypertension, diabetes, arthritis, familial osteoporosis, age, menopausal age, body mass index, duration of breastfeeding.
Evaluation of the model prepared using SVM
The precision of the classification evaluated for test and educational data separately via entering them in an Analysis nod. The precision of the developed model was also evaluated and factors having the highest importance were detected.
Results
The clinical history used in this study was 4083 women with 365 patients suffering from osteoporosis. The following table provides a brief description of the data.
The results of the evaluation were obtained by entering 45 features in the configuration section of the SVM node in the Clemantine software.
Figure 3 shows ten features that matter the most to the software output ( Figure 3).
Figure 3. Important features of SVM in the first stage with 45 features
The SVM-produced model (with 45 features in the first step) detected calcium usage, hypothyroidism, diabetes, anti-coagulants, immune-suppressor drugs, diabetes-related drugs, hyperparathyroidism, history of surgery, hypertension drugs, and waist disc as factors affecting osteoporosis the most.
The results of the SVM model with 35 features are as follows: (Figure 4).
Figure 4. Important features of SVM in the first stage with 35 features
In the second step, the SVM-produced model detected hyperlipidemia drugs, exercising during the week, history of osteoporosis, immune-suppressor drugs, autoimmune diseases, lactation duration, thalassemia, number of pregnancies, hyperlipidemia, and osteoporosis drugs as factors affecting osteoporosis the most.
Evaluation of the model prepared using Bayes network
The precision of the classification evaluated for test and educational data separately by entering them in an Analysis nod. The precision of the developed model was also evaluated and factors affecting osteoporosis the most are detected.
The results of the evaluation are obtained by entering 45 features in the configuration section of the Bayes network node in the Clementine software. Results are described as followings: (Figure 5) (Figure 6)
Figure 5. The network designed in Bayesian network model
Figure 6. Important features of the Bayesian network in the first step with 45 features
The Bayes network-produced model (with 45 features) detected immune-suppressor drugs, hyperlipidemia, osteoporosis drugs, history of osteoporosis in the family, calcium intake, lactation duration, and hyperlipidemia drugs as factors affecting osteoporosis the most.
The results of the Bayes network model with 35 features are as follows: (Figure 7)
Figure 7. Important features of the Bayes network in the second step with 35 features
In the second step, the Bayes network-produced model detected immune-suppressor drugs, anti-inflammatory drugs, vitamin D, and exercising during the week, lactation duration, and autoimmune diseases as factors affecting osteoporosis the most.
Discussion
SVM model: The classification precision of this model is lower than in other studied models. The highest classification precision of this model was 86.18 that obtained using 35 features. Moreover, calcium intake, hypoparathyroidism, diabetes, anti-coagulation drugs, immune-suppressor drugs, diabetes-related drugs, hyperparathyroidism, surgery, blood pressure drugs, and waist discs were detected to be the most effective factors on osteoporosis.
Bayes network model: This model had the highest classification precision using 35 features (91.29) which seems to be an appropriate precision. The model detected immune-suppressor drugs, hyperlipidemia, osteoporosis drugs, history of osteoporosis in family, calcium intake, lactation duration, and hyperlipidemia drugs to be the most effective factors on osteoporosis.
As it is shown, the Bayes method has a higher precision in comparison with the SVM method. Furthermore, evaluation of factors detected using each method indicated that age, vitamin D intake, calcium intake, hypothyroidism, immune-suppressor drugs, blood pressure drugs, weekly exercising, hypoparathyroidism and hyperparathyroidism, probability and tendency of people to fall, anemia, lactation duration, and hyperlipidemia are the most important factors known to be effective on osteoporosis.
Moreover, some features including osteoporosis history in family, diabetes, number of pregnancies, anti-seizures and epilepsy drugs, anti-cancer drugs, waist discs, and anti-coagulation drugs are factors that are also effective on osteoporosis. The notable point is that medical and clinical studies showed that osteoporosis risk factors include age (over 65 years), early menopause (before 45 years), calcium intake (or calcium loss), tea and coffee consumption, lack of exercise and activity, low weight (body mass index), thyroid and parathyroid disorders, etc (44).
The accuracy of the osteoporosis predictor model introduced by Wang et al. in the hybrid decision tree model and artificial neural network is 70.5% with 33 features in 2934 women (21). The results of the present study reveal that the Bayes network model has a 91.29% accuracy for 4083 women.
Dreiher et al. performed a study on the correlation of two osteoporosis and psoriasis diseases using multivariable logistic regression. The studied data included age, gender, race, socioeconomic status, and chronic diagnosis including osteoporosis and suspected or known disease as risk factors for osteoporosis (for example, smoking, thyroid disorders, inflammatory bowel disease, chronic hepatitis, rheumatoid arthritis, and chronic obstructive pulmonary disease), status related to decreased physical activity (blindness and depression) and obesity (a protective factor against osteoporosis) (45). In this regard, the current study is much more complete in terms of sample size and number of features studied.
Gao et al. investigating a feature extracted from a fractal spectrum of a microcosm T image which predicted osteoporosis in human beings. They performed one study on 14 healthy individuals and 14 people with osteoporosis using the C4.5 decision tree for the predictive model and the accuracy is 92.9% (28). In this study, similar accuracy to Gao's study was found with sample size and a higher number of features.
Moudani et al. (2011) predicted osteoporosis using the decision tree random forest method. The study was conducted on 2845 individuals with 15 characteristics. The model predicts four classes of risk-free, low-risk, moderate-risk, high-risk, and severe-risk for osteoporosis. In this study, 15 features were analyzed, and 99.92% accuracy was achieved (29).
Zhou et al. (2012) developed a model for risk factors for osteoporosis based on traditional Chinese medicine and modern western medicine using data mining. To achieve this purpose, two methods of support vector machine and random forest were used. The results of their studies showed that the symptoms and factors mentioned in traditional medicine play a more important role in assessing the risk of osteoporosis. Among the variables mentioned, postmenopausal years were among the significant risk factors for osteoporosis (46).
In another study, Alizadeh et al. (2014) Showed that the multiple-method neural network model is the most accurate model for predicting osteoporosis. This study was performed on 60 characteristics of the questionnaire information of 670 people and had an accuracy of 95.70 (30).
Conclusion
Studying the conditions and results of previous researches in the field of data mining and osteoporosis, it can be concluded that the present study has the highest number in terms of sample size and number of features studied. Also, the use of decision tree methods and neural network feature selection are other strengths points of this study. In this study, multiple models were examined on several different features and the results were compared to find the best predictive model in terms of accuracy. Given the above, the results of data mining on the clinical records studied show the predictive features of the data mining algorithms in this study which is both in line with the clinical results obtained from medical studies and the findings of the previous researches.
Finally, it is important to note that health care organizations always collect large amounts of information while this information and data are not used properly. This study indicates that these data can be properly utilized to improve the quality of diagnostic and therapeutic services by uncovering patterns and relationships hidden in these data.
Acknowledgments
This research was a part of MSc project number 21441007942006 of Islamic Azad University of Qazvin. The authors would like to express their gratitude to Endocrinology and Metabolism Research Institute of Tehran University of Medical Sciences for providing the data used in this study.
Authors' contribution
A.A., E.J, and A.K. contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript.
Conflict of Interests
The authors declare no conflict of interest.