k-Nearest Neighbor (k-NN) is a non-parametric algorithm widely used for the estimation and classification of data points especially when the dataset is distributed in several classes. It is considered to be a lazy machine learning algorithm as most of the computations are done during the testing phase instead of performing this task during the training of data. Hence it is practically inefficient, infeasible and inapplicable while processing huge datasets i.e. Big Data. On the other hand, clustering techniques (unsupervised learning) greatly affect results if you do normalization or standardization techniques, difficult to determine "k" Value. In this paper, some novel techniques are proposed to be used as pre-state mechanism of state-of-the-art k-NN Classification Algorithm. Our proposed mechanism uses unsupervised clustering algorithm on large dataset before applying k-NN algorithm on different clusters that might running on single machine, multiple machines or different nodes of a cluster in distributed environment. Initially dataset, possibly having multi dimensions, is pass through clustering technique (K-Means) at master node or controller to find the number of clusters equal to the number of nodes in distributed systems or number of cores in system, and then each cluster will be assigned to exactly one node or one core and then applies k-NN locally, each core or node in clusters sends their best result and the selector choose best and nearest possible class from all options. We will be using one of the gold standard distributed framework. We believe that our proposed mechanism could be applied on big data. We also believe that the architecture can also be implemented on multi GPUs or FPGA to take flavor of k-NN on large or huge datasets where traditional k-NN is very slow.

k-Nearest Neighbor (k-NN) Classification Algorithm (known as Supervised learning technique), widely used algorithm in machine learning having applications in many areas from recommendation systems to image processing, but it is usually known as lazy learning that most of the time it takes during query matching (testing), and it act as a brute-force technique as each record will be matched with query for finding similarity which makes it difficult to apply in huge data that is big data. Although we have GPUs and multi-core systems to do this particular task, still we have memory constraints. We have proposed a technique using many systems in a cluster (distributed system) with each node contains many cores inside. We are taking the advantages of both classification and clustering technique here, we are using k-Means and k-NN together to process big data.

We applied k-means clustering technique on dataset "S" such that the dataset is divided into chunks equal to the number of nodes in a cluster. Number of nodes in a cluster is same as the number of processing machines connected in our distributed environment. Each chunk of data is then transferred to nodes in cluster. Data Transfer can be carried out manually or using some technique to transfer chunks to nodes over the network. Data distribution is the task of Controller or coordinator node. Each of the nodes in cluster executes k-NN classification on their respective chunk of dataset which return their "k" best results after matching query with chunk of data. The selector node selects the best results amongst all received results from each node. For selection of best result, we can also use some weighted scheme. This technique can be used in various areas like real time transactions data, twitter data, any real time streaming, stock exchange data, and even in image processing. The proposed technique is theoretically proven efficient and more accurate in case of large datasets as compared to straight forward k-NN classification algorithm. This technique can also be modified for implementing multi-GPUs or FPGA based systems.

The rest of the paper discusses K-Means clustering, k-NN classification, proposed methodology and result. In section 2 of this paper, current techniques are discussed for using K-NN algorithm as related works. Section 3 and section 4 is more about the description of k-Means clustering technique and k-NN classification algorithm respectively. Section 5 discusses our proposed methodology, while results and conclusion/future work are discussed in section 6 and 7 respectively.

For finding accurate estimations of measures for high dimensional data in (

When there is large amount of data or may be high dimensional data, k-NN in such case fails due to most of time it takes during testing query, to avoid such issue, technique is proposed in (

and threads. Proposed algorithm out-performs in big data analytics platform.

K-NN joins, for finding nearest neighbors from dataset "S", the primitive operation widely used in data mining applications. This becomes costly operation in case of big data; the investigation has been done in (Lu

In (

Multi-GPU implementation of k-Nearest Neighbor classification algorithm has been proposed in (Masek

Similar matching like k-NN technique has been shown used for recommendation system, for closest match in (

K-means is one of the simplest and most widely used unsupervised learning algorithms that solve well known problems. It classifies data through certain number of clusters that are fixed priori. The main idea behind k-Means is to define "k" centroids, one for each cluster. These centroids are randomly chosen at start and should be place at different locations for better results. Then take each point from dataset and calculate from which cluster it belongs to and classify to that cluster. This process is iterative process until proper clusters are created. The objective function of k-Means is the squared error function; k-Means tries to minimize function value as given eq. 1.

Suppose that we have "n" data points or sample feature vectors [_{1}, _{2}, _{3},……._{n}] all from the same class, and we know that they fall into "k" clusters, _{i} be the mean of the vector in cluster i.

In Algorithm

In classification algorithms, the k-Nearest Neighbor also know as lazy learning technique essentially boils down to forming a majority vote between the "k" most similar examples to given unseen examples. Similarity is defined according to distance metrics between two examples. Distance can be found using popular Euclidean Distance formula as given in eq. 2,3,

or Manhattan distance formula

and others likes jaccard similarity, cosine similarity, Chebyshev and Hamming distance. Each has pros and cons and can be used in specific problem, depend upon the nature of problem.

Formally given integer "k" and unseen examples "x", and the distance "d", K-Nearest Classifier performs two steps.

In Algorithm

Fig

Coordinator Node or Controller Node

Intermediate Nodes

Selector Node

Algorithm

Algorithm

In our framework the selector node is the node where no supervised or unsupervised algorithm executed, this node only serves for retrieving best results from each of the nodes in clusters by using nodeID. Algorithm

Global best result is selected on a function, we called this selection function. Three different types of selection functions are introduced in proposed technique:

Voting selection function: Select class variable on the basis of votes by different clusters of data. Let suppose the final results is [(C1, class1),(C2, class2),…..(CN, class N)], of which three of the clusters results says that query belongs to class 1, then final global result is selected on this basis. If all of the clusters results are unique, then select class having minimum distance.

Accuracy selection function: Take the Average of all Accuracies of different clusters and then select maximum.

Weighted clusters selection function: Weight is allocated to the node according to the amount of data they possess. Higher value of weight is given to the node with larger dataset. Cluster is created according to the value of k and weight of nodes.

We have tested our proposed technique on a different datasets having different number of records and types of attributes. Basic data preprocessing is required if data in not numeric. Two of the parameters under consideration were time of execution (training time in machine learning terminology) and the Accuracy. Time of Execution was observed decreasing effectively and the Accuracy varies with number of clusters.

All the datasets described in this paper contains 20% noisy data.

The general intuitions that can be concluded from above experiments are as follows:

Proposed technique can be very useful to process huge amount of data in terms of execution time, because it cannot degrade accuracy. So, we can significantly reduce execution time while maintaining same accuracy.

In most of the cases it has been observed that clusters equal to 2 times of features or 6 times of features gives significant results.

For large value of instance to feature ration clustered k-NN has no significant results, but for small value of instance to features ration, proposed technique has significant results, so it can be concluded that proposed technique works best for high dimensional data where other techniques performed poorly. Hence no need of dimensionality reduction. We can make better decisions through many dimensional data having lesser number of records.

Proposed technique out performed for small datasets. It solves the issue of large datasets requirements for training better model to avoid over-fitting and under-fitting.

In case of noisy data, it always out-performed than simply applying k-NN.

The proposed technique works best and efficient in case of high dimensional big data and especially out performs when datasets are noisy in terms of accuracy and execution time. Time required for creating clusters has been considered in statistics shown in paper, because clustering is one-time process. This technique can be compared with straight forward k-Nearest Neighbor classification algorithm applied on large and noisy data. This proposed technique works best in case of outliers in data as discussed that all datasets contains 20% noisy data i.e. outliers. The accuracy of proposed technique overshoots in most of the cases in the presence of Noise in data. Clear distinction between traditional k-NN and clustered k-NN has been observed on mentioned datasets except pen based dataset, we have not got any advantage of accuracy as k-NN has greater than our proposed technique. But we have advantage in our proposed technique in term of execution time. We can also try parallel k-Nearest Neighbor for application of proposed technique in a large-scale data processing. This technique can be modified to execute over GPUs. Or we can also try other clustering technique instead of k-Means to observe accuracy and execution time. This technique has been shown effective and efficient for large and noisy datasets i.e. having datasets with high tendency of outliers. Once system has been deployed somewhere, arrival of new records does not need re-clustering or retraining, we just calculate distance between new record(s) to all clusters and then keep in the most similar cluster. In case, when cluster size increases, splitting of cluster may take place. In future, the proposed technique will be evaluated on a multi-node compute cluster (distributed system) to evaluate the scalability of the proposed methodology. Proposed system may be very successful in case of other datasets such as textual and images data. Potential work can be performed on relation between clusters and accuracy, as in some cases accuracy increases when a cluster increases and vice-versa.