Research on data outlier detection method based on sample parameter selection LOF

. The LOF data anomaly detection method has some defects, such as the value of k has great influence on the accuracy of detection results, and the selection of k value usually adopts trial method, which consumes a lot of calculation time. Therefore, this paper proposes an anomaly detection method for LOF data based on sample parameter selection, Tagged according to the sample data set point of normal and abnormal point, the adaptive selection of k value and outlier detection, so as to improve the accuracy of data outlier detection and calculation speed, and through the example of meteorological data outliers detection showed that LOF abnormal data points based on sample parameter selection method in the detection accuracy and reliability are improved significantly.


Introduction
In general prediction problems, the model is usually a form of expression of the data structure of the whole sample, which usually captures the general properties of the whole sample, and the points that are completely inconsistent with the whole sample in these properties are called outliers [1] . These anomalies often lead to inaccurate prediction results and serious adverse consequences. At present, the detection methods of outliers mainly include distance-based, clustering-based, distribution statistics-based, support vector machine-based and density-based methods [2][3][4] . Density-based and distance-based detection methods are widely used in various scenes. The set of outliers generated by these anomaly detection methods and their scores may be highly dependent on the number of clusters used and the existence of the total outliers in the data, as well as sensitive to the choice of parameters. The LOF outlier detection algorithm [5][6][7] combines the data point q with the surrounding k points for analysis, which makes the final outlier factor value more reasonable, reduces the impact of density maximum and density minimum on the whole data, and uses the numerical form to represent the outlier degree of data points, which is easier to understand. Only one parameter k needs to be set, which is easy to operate and implement [8] . However, in the LOF outlier detection method, there is a problem of selecting the parameter k. whether the value of k is properly selected will directly affect the effect of the outlier detection algorithm.

Research on LOF anomaly detection method based on sample parameter selection
Considering the advantages and disadvantages of various methods and the characteristics of meteorological data, among the conventional density-based outlier detection algorithms, LOF outlier factor (local outlier factor) detection algorithm is relatively mature and defined based on distance. The method defines the k neighborhood distance of the object and calculates the local anomaly factor LOF, which reflects the degree of anomaly, according to the local density of the object. All objects whose LOF value is greater than a specified threshold value will be judged as outliers [8] . However, in the LOF outlier detection method, there is a problem of selecting the parameter k. whether the value of k is properly selected will directly affect the effect of the outlier detection algorithm. In order to eliminate the problem caused by k selection, this paper adopts the selection method of sample parameters to adaptively select the appropriate k.

LOF anomaly detection method
LOF algorithm is a density-based unsupervised anomaly detection algorithm. This algorithm determines whether the point p is an isolated point by comparing the density of points in each point p and its adjacent area [9] . Suppose D is the noise data set, and each point in the region is the object in the data set, as shown in Figure 1. The calculation steps of LOF are as follows: (1) For any natural number k, calculate the distance of the point k far away from point p, which is recorded as k dist − , that is o{ , , , , } s t v q h , there are at least k point O objects in data set D, which satisfies ( , ) t d p o k dis ≤ − ; There are at most k-1 point O objects, which are included in the D dataset except points p.
(2) Calculate the k-th distance neighborhood of p, expressed as ( ) k N p , That is, the set of all points within the k-th distance of the p object. Namely (3) Calculate the reachable distance of p, expressed as ( , ) reach dist p o − , if the reachable distance is less than or equal to -k dist , It indicates that the O object is in the array of the k-th neighborhood, and the reachable distance at this time is k dist − ; If point O is not in the k-th range, the reachable distance is the real distance between the two objects. It can be expressed as the following formula: (4) And the set of mutual distances of objects in a data set D can be expressed as dist , then the sum of k nearest neighbor distances of p with respect to D can be expressed ( , )k d p D , and its calculation formula is: Calculate the local reachable density of p and record it as ( ) k lrd p , represents the reciprocal of the near sum of the data points adjacent to point p and k. Expressed as: The local reachable density represents the degree of dispersion of objects around the point. Objects with small local reachable density are more likely to be judged as isolated points, while objects with large local density are more likely to belong to normal points [10] .
(6) Calculate the local isolation coefficient of p, the value of k is given arbitrarily, The local isolation coefficient depends on the value of k. for different values of k, the same data point may have different local isolation coefficients. LOF algorithm measures the isolation of a data point by focusing on its relative density with adjacent data points, rather than its absolute local density. Therefore, it is expressed by the average of the ratio of the local reachable density of the neighborhood of point p, If the local isolation coefficient of object p is large, it indicates that there are few objects in the local range of the object, that is, the local density is small, which indicates that the object has a high probability of being an isolated point, and vice versa.

A LOF method based on sample parameter selection
When detecting outliers based on LOF anomaly detection algorithm, the performance of the algorithm is directly related to the selection of parameter k, and the value of k directly affects the detection results of the algorithm. If k value is too small, outliers will be lost and recall rate will decrease. If k value is too large, the number of abnormal points will be increased to a certain extent, which will increase the misjudgment rate. Therefore, the reasonable selection of parameter k is very important for LOF algorithm anomaly detection. In the current research, the selection of k usually adopts the heuristic method, which starts from 1 and proceeds to the most effective k. Although this method is simple, it is too blind. It will calculate more useless k values and consume more time. In order to eliminate the problems caused by k selection, this paper adopts the method based on sample parameter selection, calculates and analyzes the marked normal points and abnormal points in the sample data set, and adaptively selects the appropriate k.
In order to prevent misjudgment of outliers due to a small value of k or omission of outliers due to a large value of k, this paper selects k reasonably by calculating the dispersion to maximize the difference between outliers and normal points.

According to Formula (3), the smaller the ( , )k d p D
value of an object p, it means that the neighborhood objects of p are dense; On the contrary, the larger ( , )k d p D , the more sparse the neighborhood range of p, and the more likely the p object is an outlier. That is ( , )k d p D , outliers and outliers can be distinguished. Since ( , )k d p D is greatly affected by k, the ratio of k-nearest neighbor distance and the number of edges ( 1) / 2 k k − between k-nearest neighbors is recorded as dispersion ( ) k D p to maximize the difference between normal points and abnormal points. The calculation formula is: Therefore, use  To sum up, the LOF anomaly detection algorithm flow based on sample parameter selection is presented, as shown in Figure 3.  Fig. 3. LOF anomaly detection algorithm flow based on sample parameter selection.
Step 1: given the sample data of marked normal points and isolated points, respectively calculate the change of normal points and isolated points with k, the ratio of k-nearest neighbor distance and the edge number ( 1) / 2 k k − between k-nearest neighbors, and obtain the k value corresponding to the maximum difference of calculation results, that is, the k value selected by the sample method.
Step 2: take the value of k as the parameter k of anomaly detection based on LOF algorithm.
Step 3: calculate the k-th distance neighborhood of each point according to Formula (1), and calculate the reachable distance of each point according to Formula (2).
Step 4: The local accessible density is calculated according to Formula (4), and the local isolation coefficient is calculated according to Formula (2)(3)(4)(5). According to the local isolation coefficient, outliers are outputted.

experimental verification
The accuracy of wind speed measurement has an important influence on the numerical prediction and calculation results in wind sand environment. Therefore, this paper takes the wind speed data measured by a meteorological station in Northwest China as the verification object, and takes the wind speed as the data object for outlier detection. The test set of this experiment comes from a meteorological station in the northwest. Its value range is limited to a certain extent, and some data points not within this range are generated according to a certain probability, which are regarded as isolated points. When the LOF anomaly detection algorithm is running, the sample test set composed of normal data points and outliers is extracted from the experimental test set in a certain proportion. Table 1 lists some experimental test sets. Based on LOF anomaly detection algorithm, the isolation degree can be quantified by calculating the local reachable density and isolation coefficient. The isolation degree of each data object can be seen very intuitively. The tester can determine the real anomaly point according to the actual situation, which is more practical than directly determining whether it is an anomaly point.
Generally, the performance of outlier detection algorithm is evaluated from recall rate and misjudgment rate [17] : Recall is defined as follows: where: η represents the recall rate, f m and n m represent the number of outliers actually contained and detected respectively. Recall ratio indicates the detection capability of the algorithm. The higher the recall ratio is, the more outliers the algorithm detects, that is, the stronger the ability of detecting outliers.
The false positive rate is defined as follows: where, ξ is the false positive rate, f m and n m are the number of misjudged normal points and the number of actual normal points respectively. The false positive rate indicates the detection accuracy of the method. The smaller the false positive rate, the fewer the number of normal points misjudged, that is, the higher the detection accuracy of the algorithm. As shown in Figure 4, when k = 4, the difference between isolated points and normal points is the largest. For the normal point, the LOF value changes steadily and is small. For outliers, the LOF value is much larger than that of normal points. The accuracy of selection based on sample data k is proved. Therefore, in this paper, the parameter k =4 is taken for the detection of outlier points based on LOF in test data. Figure 5, Figure 6 and Figure 7 show the curves of wind speed original data and LOF value when k =4, 6 and 8 respectively.   According to Figure 5, Figure 6 and Figure 7, certain abnormal points can be detected no matter when k is 4, 6 and 8. From the original wind speed data, t there are 5 outliers, and the LOF calculated by LOF algorithm is significantly higher than other points in 4. According to formula (7) and (8), when k is 4, the recall rate is 100%, when k is 6 and 8, the recall rate is 80%, and the misjudgment rate is 0. It can be seen that LOF anomaly detection algorithm has high accuracy and stability in detecting outliers.
Through the above experiments and result analysis, the accuracy and reliability of sample data selection k and LOF anomaly detection algorithm are verified. Thus, it can effectively solve the problem of abnormal value detection of climate data and improve the quality of data used in environmental prediction calculation.