The performance of the
k-nearest neighbour classifier on data sets
Abstract:
Distance-based
methods are frequently used for data classification problems. K-nearest
neighbour classification is one of the most often used distance-based
techniques (k-NN). The distances between the test sample and the training
samples are used to determine the final classification result. The standard
k-NN classifier performs well with numerical data. The main objective of this
research is to see how well k-NN works on datasets with a mixture of numerical
and categorical features. For the purpose of simplicity, this study focuses on
only one type of categorical data: binary data.
Introduction:
A
supervised machine learning technique that translates incoming data into
predetermined groups or classes is known as classification. The most important
requirement for using a classification approach is that all data items be
allocated to classes, with each data object belonging to just one class.
Distance-based
classification methods employ a distance function to compute the distance
between the test sample and all training samples in order to categorise data
items. Distance-based algorithms, on the other hand, were first developed to
deal with one type of data by determining the similarity of data items using
distance-based measurements.
The
goal of this study is to examine the effectiveness of k-NN classification on
heterogeneous data sets using two types of measures: well-known (Euclidean and
Manhattan) distances and a mixture of similarity measures developed by
combining existing numerical distances with binary data distances. It also
intends to offer a first attempt at advising on the appropriate similarity
function to employ with k-NN for heterogeneous data categorization (of
numerical and binary features). The remainder of this document is organised as
follows: It covers the study topic's concepts, background, and literature
review.
The
notion of data item similarity is widely utilised in a number of fields to
tackle pattern recognition issues such as categorization, classification, clustering,
and forecasting . Various methods for comparing data items have been presented
in the literature. The concepts of distance and similarity are explained in
this part, which is followed by a discussion of the k-NN method and its performance
evaluation.
Definition 1:
If a
distance measure d:X X R meets the following criteria, it is termed metric.
x,y,z :
1. 0d(x,y)
(Non-negative);
2. d(x,
y) = 1 (Identity);
3. d(x,
y) = d(y, x) (Symmetry);
4. d
(x, z) d(x, y) + d(y, z) (Triangle inequality).
Similarity
measurement, on the other hand, generates more discussion since it allows for
some freedom in determining how near two data elements are. A similarity
measure is often thought of as a complement to a distance measure.
Definition 2 of the similarity metric S:X X R is a function that meets the following x,yX requirements:
2.If
and only if x=y (Identity), S(x, y) = 1.
3.S(x,
y) = S(y, x) = S(y, x) = S(y, x) = S(y, x (Symmetry).
Analyses based on experiments:
This
section compares the performance of standard k-NN and k-NN combined with
similarity measures on six heterogeneous data sets from various fields. Only
numerical and binary characteristics are used to characterise the data sets. Table
4 lists the features of the data sets. Two data sets, Hypothyroid and
Hepatitis, were obtained from the UCI Machine Learning Repository [52], and
four data sets, Treatment, Labour training assessment, Catsup, and Azpro, were
obtained from the R packages. [53] provides a more detailed explanation of the
data sets.
1.
Only numerical and binary characteristics should be included in the data
collection.
2.
There should be no more than 3% of missing values in the data.
3.
For computing similarity, the number of features for each kind of data should
be sufficient (not less than 2).
4.
The number of classes should be kept to a minimum.
Both
types of data sets (benchmark and actual) were chosen to cover small to
medium-sized data sets.
Conclusions and plans for the future:
The
selected distance function is critical in deciding the final classification
result since the k-NN classification is dependent on calculating the distance
between the test sample and each of the training samples. The main goal of this
work was to evaluate the performance of k-NN for estimating the similarity of
data items represented by numerical and binary characteristics using a variety
of measures, including single measures (Euclidean and Manhattan) and a number
of combinations of similarity measures. Six heterogeneous data sets from
various fields were used to generate experimental findings.
Our
tests revealed that Euclidean distance is not an acceptable metric for
categorising a heterogeneous data collection comprising numerical and binary characteristics
when combined with k-NN.
Furthermore,
our findings revealed that combining the results of numerical and binary
similarity measures is a potential technique for achieving better outcomes than
utilising only one. Furthermore, we found no significant variations in the
results reported by the three examples of the supplied weights with k-NN, which
might indicate that the algorithm is resistant to the impact of compact
heterogeneous features on classification performance.
0 Comments