The performance of the k-nearest neighbour classifier on data sets

 The performance of the k-nearest neighbour classifier on data sets

Abstract:

Distance-based methods are frequently used for data classification problems. K-nearest neighbour classification is one of the most often used distance-based techniques (k-NN). The distances between the test sample and the training samples are used to determine the final classification result. The standard k-NN classifier performs well with numerical data. The main objective of this research is to see how well k-NN works on datasets with a mixture of numerical and categorical features. For the purpose of simplicity, this study focuses on only one type of categorical data: binary data.

Introduction:

A supervised machine learning technique that translates incoming data into predetermined groups or classes is known as classification. The most important requirement for using a classification approach is that all data items be allocated to classes, with each data object belonging to just one class.

Distance-based classification methods employ a distance function to compute the distance between the test sample and all training samples in order to categorise data items. Distance-based algorithms, on the other hand, were first developed to deal with one type of data by determining the similarity of data items using distance-based measurements.

The goal of this study is to examine the effectiveness of k-NN classification on heterogeneous data sets using two types of measures: well-known (Euclidean and Manhattan) distances and a mixture of similarity measures developed by combining existing numerical distances with binary data distances. It also intends to offer a first attempt at advising on the appropriate similarity function to employ with k-NN for heterogeneous data categorization (of numerical and binary features). The remainder of this document is organised as follows: It covers the study topic's concepts, background, and literature review.

 Measures of distance and similarity:

The notion of data item similarity is widely utilised in a number of fields to tackle pattern recognition issues such as categorization, classification, clustering, and forecasting . Various methods for comparing data items have been presented in the literature. The concepts of distance and similarity are explained in this part, which is followed by a discussion of the k-NN method and its performance evaluation.

Definition 1:

If a distance measure d:X X R meets the following criteria, it is termed metric. x,y,z :

1. 0d(x,y) (Non-negative);

2. d(x, y) = 1 (Identity);

3. d(x, y) = d(y, x) (Symmetry);

4. d (x, z) d(x, y) + d(y, z) (Triangle inequality).

Similarity measurement, on the other hand, generates more discussion since it allows for some freedom in determining how near two data elements are. A similarity measure is often thought of as a complement to a distance measure.

Definition 2 of the similarity metric S:X X R is a function that meets the following x,yX requirements:

 1.0S(x,y) (Non-negative); 1.0S(x,y) (Non-negative); 1.0S(x,y

2.If and only if x=y (Identity), S(x, y) = 1.

3.S(x, y) = S(y, x) = S(y, x) = S(y, x) = S(y, x (Symmetry).

Analyses based on experiments:

This section compares the performance of standard k-NN and k-NN combined with similarity measures on six heterogeneous data sets from various fields. Only numerical and binary characteristics are used to characterise the data sets. Table 4 lists the features of the data sets. Two data sets, Hypothyroid and Hepatitis, were obtained from the UCI Machine Learning Repository [52], and four data sets, Treatment, Labour training assessment, Catsup, and Azpro, were obtained from the R packages. [53] provides a more detailed explanation of the data sets.

1. Only numerical and binary characteristics should be included in the data collection.

2. There should be no more than 3% of missing values in the data.

3. For computing similarity, the number of features for each kind of data should be sufficient (not less than 2).

4. The number of classes should be kept to a minimum.

Both types of data sets (benchmark and actual) were chosen to cover small to medium-sized data sets.

Conclusions and plans for the future:

The selected distance function is critical in deciding the final classification result since the k-NN classification is dependent on calculating the distance between the test sample and each of the training samples. The main goal of this work was to evaluate the performance of k-NN for estimating the similarity of data items represented by numerical and binary characteristics using a variety of measures, including single measures (Euclidean and Manhattan) and a number of combinations of similarity measures. Six heterogeneous data sets from various fields were used to generate experimental findings.

Our tests revealed that Euclidean distance is not an acceptable metric for categorising a heterogeneous data collection comprising numerical and binary characteristics when combined with k-NN.

Furthermore, our findings revealed that combining the results of numerical and binary similarity measures is a potential technique for achieving better outcomes than utilising only one. Furthermore, we found no significant variations in the results reported by the three examples of the supplied weights with k-NN, which might indicate that the algorithm is resistant to the impact of compact heterogeneous features on classification performance.

Post a Comment

0 Comments