Diabetes manually is a complicated task. Hence, data

Diabetes Mellitus is a disease which is caused due to lack of insulin
production from pancreas or lack of cell response to the produced insulin.If
diabetes is not treated earlier, it may lead to severe complications. According
to recent report by Centres for disease control out of 30.3 million people,
23.1 million people have been diagnosed with either Type1 or Type2 diabetes.
Diagnosing Diabetes manually is a complicated task. Hence, data mining
techniques are used to predict prevalence of the disease.Data Mining plays a vital role in analysing the huge amount of data
and extracting valuable information from them. Thisinformationis converted into
useful knowledge which should be understandable in nature. Data mining include
different techniques such as Clustering, Classification, Prediction, Association,
Sequential patterns etc. Clustering is the process of grouping the data with similarfeatures.
Hence the data within a cluster is similar and dissimilar to the data in
another cluster. Classification is used to analyse a new set of data and predict
the group to which the data is belonging. The purpose of prediction is to
identify the relationship between dependent and independent variables.
Sequential pattern analysis is used to uncover similar patterns that existing
in a transaction over a period. Those patterns are useful in business decision making. The Association rule
mining is used to discover set of rules from the frequently occurring factors
in a transaction. Association Rules generated from diabetes risk factor also
provide justifications, which may serve as a guide for diabetes care .Data
mining is used in various domain such as healthcare, Bioinformatics,Finance,
Business in order to improve the performance in future, reduce the cost,
enhance the efficiency and accuracy.2
KDD PROCESS             KnowledgeDiscovery from Database in shortly known as KDD. The main
objective of KDD process is to explore useful knowledge from large databases
and predict the interesting patterns among them. KDD is an iterative process in
which the following steps are repeated until an interesting, understandable pattern
is obtained.

Fig 2.1.1 Steps involved in KDDThe steps involved in KDD process are as
follows,Initially develop an understanding on the
domain and create/select a dataset. Later follow the steps given below, v  Data
Pre-processing and Cleansing- the noise data, missing data and outliers are handled in this step..v  Data
integration- where various
data sources are combined.v  Data
Selection- Gleaning the data from
the database, that are relevant to the analysis taskv  Data
transformation- Obtained
data are transformed into forms suitable for mining by performing different summery
or aggregation operations.v  Data
mining- Where the effective
methods are applied, to extract data patterns. v  Pattern
evolution- to identify the
truly interesting patterns representing knowledge based on some interestingness
measures. v  Knowledge
presentation- The
extracted knowledge are visualised by users using this knowledge representation
technique.2.2 MACHINE LEARNINGMachine learning is a kind of artificial intelligence,
which enables software to predict the accurate outcomes based on the relations
learnt from previous datasets. Machine learning used to generate algorithm to
receive inputs and perform statistical analysis on the input to predict the output.
ML algorithms are classified into three categories namely supervised learning,
unsupervised learning and reinforcementlearning.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

2.2.1 classification of machine learningThe supervised learning algorithm is carried
out on labelled dataset. In supervised learning, an algorithm is trained with
predefined inputs. In most of the cases it is unable to figure out the function
that always make the correct predictions .Hence the computer containing the
input is also fed with valid output and from which the system should be able to
learn the patterns. Based on the relationships that is learned between the
target output and input values, the output for new dataset is predicted. In
Unsupervised Learning, the computer is actually trained with unlabelled data.
In this learning there are no any predefined input or output relationships.
Here the algorithm use different techniques to analyse the data and discovers
the interesting patterns between the data. This algorithm is useful, when the
experts doesn’t know what to be found from the given dataset.In Reinforcement
Learning,the agent learn from the interactions with environment, inorder to
take actions and to maximize the reward.Learning from environment is done in a
iterative fashion.The agent must understand the current state, to make valid

2.2.2 Reinforcement Learning2.3
rule mining is defined as, to find the frequent patterns and correlations among
the relational database, transactionaldatabase andother data storage
repositories.Association rule mining was initially introduced in the market as
basket analysis tool.This technique is majorly concentrated in analysing
unsupervised data.This plays a vital role in biology and bio informatics to reveal
of a gene information data, as a result from a huge amount of raw data.
Association rule   performance is based
association rules are  applied in the
database transaction. Each data in the transactions are known as the item. By
applying some rules the frequent pattern is revealed out and basket analysis is
one of the examples for the frequent patternoccurrence in the association rule
is the process of analysing the given dataset and identifies the predefined
group or classes to which the analysed data is related to. Some of the
classification algorithms are Decision trees, Linear classifiers, Quadratic
classifiers, Kernel estimation, Support vector machines, Neural networks,
learning vector quantization.The calculation of Id3 is done in a data set starts with
root. The values in the data set are classified based on the Entropy value. The
data which are separated as a class and leaf.  The data belongs to same class comes under
the  one class. When there is no more
data of same class it will make as a separate class.ANN(Artificial Neural
Networks) is  more or less similar to
biological neural systems. Because of its adaptive nature  it 
process on a large volume of data input 
and is able for machine learning.3 DIABETES MELLITUSDiabetes Mellitus(DM) is a metabolic
disorder which is caused due to high blood glucose level in the body for a
prolonged period.  This disease is mainly
caused when there is an irregular segregation of insulin by the body and when
there is no proper response of the human body cell towards the segregated
insulin. Symptoms of diabetes
are Polyuria, Polyphagia, Polydipsia in common. Untreated diabetes may lead to
severe complications ranging from cardio vascular disease, macro vascular
disease to death

3.1 Consequences of diabetesIn the year of 2012,2.2 million life loss was
recorded which the root cause was diagnosed as the high blood sugar level.  In the year of 2014 the adults who crossed
age of 18 had diabetes. In 2015 the death rate caused by the diabetes is 1.6
million. The WHO (World Health Organization) extends its service to prevent and
control of the diabetes and its effects to the low and middle income countries
to reduce the loss of life. WHO provides “Global Report on Diabetes” which
presents as a guideline for the government civil society and private sector. It
also includes the burden and complications of diabetes and its side effects. It
suggests people to follow the healthy diet and physical exercise to reduce the
effect of this harmful disease.3.1
TYPES OF DIABESTESThere exist few types of diabetes
they are Type 1DM, Type 2DM, Gestational
diabetes. TYPE 1 DIABETES:Type 1 diabetes mellitus is caused due to lack of insulin production
from pancreas. The insulin production is reduced by depleting Beta cell. This
type of diabetes is also referred as IDDM(Insulin dependent diabetes mellitus).
Type 1 diabetes mostly occur in younger age groupsThe symptoms are constant hunger, vision changes, fatigue, vomiting
etc.This type is treated with insulin injection, which balances the lack of
insulin in the human body. The type 1 diabetes cannot be prevented in earlier
stages, but they are controlled with the use of insulin injections.TYPE 2 DIABETES:Type 2 diabetes mellitus is caused due to lack of cell response to the
produced insulin. They are also referred as NIDDM (non-Insulin dependent
diabetes mellitus).The type 2 diabetes occurs in adult age group specifically
greater than 35. Type 2 diabetes symptoms are similar to type 1.These symptoms spread very slowly and may also be absent in some
cases, hence this type is considered as menacing when compared to other types.
Th type 2 diabetes is treated with proper diet, medications, weight loss
surgery and insulin injection in severe cases. GESTATIONAL DIABETES:Gestational diabetes occurs during maternity period of women. Usually
it will be naturally cured or it may result in type 2 diabetes after pregnancy.
Gestational diabetes requires a careful medical monitoring to cure it
completely. If gestational diabetes is not treated in earlier stages it may
affect either the mother or babies health. 3.2 COMPARISON OF DIABETES TYPES




Age Of Onset

Diagnosed during
childhood(age<20) Diagnosed in adults(age>30)

Body weight

Thin or Normal








Symptom Growth

Develop rapidly

Develop slowly and
may be subtle or absent.


Beta cells are

resistance; Other defects


Insulin Injections
or Insulin Pump devices are used

Tablets and proper
Diet is followed. Insulin Injections are also used in severe cases.




ADABOOST:2The objective of this paper is to classify
the diabetes patients in three different age groups using the risk factors identified.
The algorithms involved in this process are Adaboost, Bagging ensemble
techniques using J48 Decision tree as base learner.In this paper, chi-square
test is been carried out to determine the presence of diabetes, the result
produced was adult had more possibilities to develop diabetes when compared
with other age groups. In order to find out the efficiency of classifiers the
dataset from CPCSSN is divided into three cohorts and AROC curve test is
performed..Finally, the performance of Adaboost 
is found to be efficient, while dealing with small classes of data.4.2BAGGING:2In this paper the efficiency of each
ensemble techniques namely Adaboost, Bagging that is appliedalong with J48
decision tree as base learner is determined.According to bagging, each bootstrap replicates contains
63.2% of original data .Hence by processing repeatedly , the result obtained
from weak learners is composed with strong learners to improve the accuracy.Once
the diabetes presence is predicted, the efficiency of ensemble techniques with
J48 decision tree is predicted using AROC curve test.By doing so, the Bagging
algorithm Proves to be more efficient while processing large datasets with
0.98% of  Perceptioncapability of
classifiers.  4.3
K-NEAREST NEIGHBORS:3The objective of this paper is to group the
patients with Diabetes Mellitus using Classification algorithm  namely J48 Decision Tree, KNN, Support Vector
Machine, Random Forest and to predict the algorithm with better performance in
terms of Accuracy, Sensitivity, Specificity.KNN algorithm is used for
classifying the available data based on the vote rendered by the neighbours and
Euclidean function is used here to measure the distance. It was suggested that
how large is the k-value , the accuracy will be that much greater. In order to
estimate the performance of algorithms based on the above mentioned criteriaan
Confusion Matrix is constructed.Accuracy=(TP+TN)/(TP+TN+FP+FN)Sensitivity=TP/(TP+TN)Specificity=TN/(TN+FP)The above mentioned equations are used in this
paper to calculate the accuracy, sensitivity, specificity from the
constructed  matrix. Based on the
determined criteria the performance of algorithm is predicted before
pre-processing, the result produced was J48 decision tree is optimal with 73.82%
accuracy, 59.7% sensitivity, 81.4% specificity. After PreProcessing, once again
the Performance comparison is done, in which KNN at K=1 is found to be optimal
with 100% of accuracy, sensitivity and specificity.4.4
RANDOM FOREST:3In this paper the performance of
Classification algorithms are compared in two different stages namely Before
PreProcessing and After PreProcessing. Random Forest algorithm is a machine
learning algorithm which uses bagging approach to achieve better prediction .while
analysing thealgorithms Performance before cleaning the data, Random forest has
shown only 71.74% of accuracy,53.81% of sensitivity, 80.4% of specificity. The
performance of Random Forest has reached 100% after cleaning the data.From this paper,the importance ofPreProcessing in
identifying the efficient algorithm.4.5
DECISION STUMP:4In this paper Adaboost algorithm is used
for predicting the prevalence of diabetes. The process involved in this paper
are global and local dataset collection, training of global data with Adaboost algorithm
 under various base learners namely
Support vector machine, Naïve Bayes, Decision Tree, Decision Stump. And then
local dataset is validated on the classifiers mentioned above. Finally the
performance comparison of classifiers in terms of accuracy, sensitivity,
specificity, error rate is done. Decision stump is a kind of decision tree
which contains only one level under the root. The performance of classifiers
without Adaboost algorithm is estimated. In which SVM shows the highest
accuracy rate 79.6%, but the decision stump had the lowest accuracy rate 74.4%
while comparing with other classifiers. After including Adaboost algorithm with
classifiers, the accuracy rate of decision stump 80.72% has become the highest than
all other classifiers and even the error rate of decision stump 19.27% is the
least one. The performance of SVM remains the same even after including
boosting algorithm, but the decision stump is proves to be efficient is
predicting in the prevalence of diabetes while working as a base classifier for
Adaboost algorithm.4.7
APRIORI:            8
In this paper Apriori algorithm is used for predicting diabetes medications.
The process carried out in this paper is collection of raw data, pre-processing
and the data is converted into binary matrix form. This representation is
subjected to two divisions with minimum support of distinct values (30% &
50%) . While analysing with minimum support of 30%, the result produces all
symptoms and all common medications are preferred. While analysing with minimum
support of 50%, the result produced  was
Type 1 Diabetes symptoms ,based on that appropriate insulin medications are

            In this paper a study on various existing data mining techniques are
performed. The study includes the details about performance of algorithms in
different scenarios. As per the study, it is understood that bagging algorithm
can be used for reducing variances and it is capable of handling large dataset.
The Adaboost algorithm is efficient in classifying smaller groups. Decision
Stump show higher accuracy in prediction when boosted with Adaboost algorithm.
KNN and Random Forest both  the algorithm
renders 100% accuracy, while processing 
with a pre-processed data. This study can be used as a guideline for
future enhancement of these techniques.

Comments are closed.