^{1}

^{1}

^{1}

^{*}

As the cash register system gradually prevailed in shopping malls, detecting the abnormal status of the cash register system has gradually become a hotspot issue. This paper analyzes the transaction data of a shopping mall. When calculating the degree of data difference, the coefficient of variation is used as the attribute weight; the weighted Euclidean distance is used to calculate the degree of difference; and
*k*-means clustering is used to classify different time periods. It applies the LOF algorithm to detect the outlier degree of transaction data at each time period, sets the initial threshold to detect outliers, deletes the outliers, and then performs SAX detection on the data set. If it does not pass the test, then it will gradually expand the outlying domain and repeat the above process to optimize the outlier threshold to improve the sensitivity of detection algorithm and reduce false positives.

Along with the development of living standards, the purchasing power of residents is also increasing. In Shopping malls, as the market with the most extensive sales of goods, a large number of commodities and customers in domestic generate huge amounts of cash register information every day. The anomaly detection of such information and maintaining the normal operation of the cash register system are critical [

At present, the anomaly detection of data based on the LOF algorithm has achieved lots of research results. For example, Chen Wei [

This paper applies the LOF algorithm to calculate the local degree of data outliers. Then a loose threshold is set to screen outliers. After deleting the outliers, the similarity between the screening data and reasonable data is measured by SAX test. If not, it will expand the abnormal limit, increase screening power, and perform loop testing and optimization. Through the gradual adjustment and optimization of the outlier threshold, false alarms can be avoided to the greatest extent.

Firstly, we choose the transaction system data in late January of a shopping mall to analyze. We have a total of 12,954 transaction records. The data samples includes: transaction date, time, volume, success rate, response time. Some transaction data are as follows

The k-means clustering algorithm first selects k objects as the initial clustering center randomly. Then the distance between each object and each seed cluster center is calculated and each object is assigned to the nearest cluster center. The cluster centers and the objects assigned to them represent a cluster. After all objects have been assigned, the cluster centers of each cluster are recalculated based on the existing objects in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no object is reassigned to different clusters, and at that point the squared error sum is locally minimum.

The basic operation is as follows:

1) Take k elements randomly from element set d as the respective centers of k clusters.

2) Calculate the degree of dissimilarity between the remaining elements to the centers of k clusters, and assign these elements to clusters with the lowest

Date | Time | Volume | Success Rate | Response Time |
---|---|---|---|---|

1.23 | 0:00 | 178 | 94.94% | 105 |

1.23 | 0:01 | 158 | 98.73% | 87 |

1.23 | 0:02 | 129 | 98.45% | 97 |

1.23 | 0:03 | 111 | 99.1% | 93 |

1.23 | 0:04 | 124 | 95.16% | 95 |

1.23 | 0:05 | 105 | 95.24% | 96 |

dissimilarity, respectively. The dissimilarity algorithm is as follows:

d i s k = ω i ∑ i = 1 n ( x i − x k i )

Among them, x i is an attribute value of an i-th element. x k i is the i-th attribute value of the k-th cluster center. ω i is attribute value weight. In order to avoid the influence of the dimension, the variation coefficient of each attribute variable is used as the weight, and the formula is:

ω i = S i / x ¯ i

Among them, S i is variance of attribute variables. x ¯ i is average value for attribute variables.

3) According to the clustering result, take the arithmetic average of the respective dimensions of all the elements in the cluster and recalculate the centers of the k clusters. The formula is:

x k i = ∑ x i / n k

Among them, x i is the i-th attribute of all elements in the k-th cluster. n k is the number of all elements in the k-th cluster.

4) Regroup all elements in d according to the new center.

5) Repeat step 4 until the clustering result no longer changes.

6) Output the result.

The CH indicator describes the compactness by intra-class dispersion matrices. And the disparity matrix between classes describes the degree of separation. The indicators are defined as:

C H ( k ) = t r B ( k ) / ( k − 1 ) t r W ( k ) / ( n − k )

Among them, “n” denotes the number of clusters. “K” denotes the current class. “ t r B ( k ) ” denotes the trace of the disparity matrix between classes and “ t r W ( k ) ” denotes the trace of the intra-class dispersion matrix.

From the definition of the CH indicator, it can be known that the greater the CH indicator is, the closer the class itself is and the more dispersed is the class and another class, that is, the clustering result is better. In order to measure the effectiveness of the clustering results, the CH indicator was selected to measure the effectiveness of the cluster. K values ranged from 2 to 8, and cluster centers were randomly selected for clustering. Each k value was clustered 10 times and its average CH value results are as follows

According to the result graph, when the k value is 4, the clustering result is optimal.

Select k = 4 as the number of clusters, apply transaction time, volume, success rate, response time as the transaction category attribute, and cluster the transaction date to obtain the clustering result (

According to the results of the specific classification, the date of transaction data is basically divided into four periods: before the Spring Festival, after the Spring Festival, on the working day, and on the non-workdays.

K-means clustering is performed on the daily time period according to the above date classification. All dates in the four date categories are selected. The average value of each category attribute is obtained, and the timeline data of the transaction data is plotted (

From the trend of line chart, we know that the daily transaction volume trends are basically the same in all time periods. The transaction volume gradually increases from 0 o’clock. At midday, there is a small downtrend in transaction volume and then it rises and eventually begins to decrease. The trough periods and peak periods are more obvious. Therefore, the k-means clustering

analysis is also performed on it, and the k value is set to 2 to obtain the time clustering result as shown in the following

The trough periods and peak periods are basically the same in all periods. The clustering results are good.

Basic variable definition:

1) The distance between two points, p and o, is the difference between two points of data, where the trading volume distance is:

d 1 ( p , o ) = x 1 ( p ) − x 1 (0)

The distance of transaction success rate is:

d 2 ( p , o ) = x 2 ( p ) − x 2 (0)

The distance of response time distance is:

d 3 ( p , o ) = x 3 ( p ) − x 3 (0)

First trough period | Peak period | Second trough period | |
---|---|---|---|

Before the Spring Festival | 0:00 - 8:19 | 8:19 - 20:23 | 20:23 - 23:59 |

After the Spring Festival | 0:00 - 8:35 | 8:35 - 19:51 | 19:51 - 23:59 |

Workdays | 0:00 - 7:54 | 7:54 - 20:36 | 20:36 - 23:59 |

Non-workdays | 0:00 - 7:58 | 7:58 - 20:30 | 20:30 - 23:59 |

2) The k-th distance: The distance from the point k away from the point p to p, excluding p.

3) The k-th distance neighborhood: The k-th distance neighborhood of point p, that is, all points within the k-th distance of, including the k-th distance. Therefore, the number of k-th neighbors of p.

4) Reachable distance:

r e a c h - d i s t a n c e ( p , o ) = max { d k ( o ) , d ( p o ) }

5) Local accessible density: the higher the density is, the more likely it is to belong to the same cluster. The lower the density, the more likely it is to be an outlier. The local reachable density of point p is expressed as:

l r d k ( p ) = 1 ∑ o ∈ N k ( p ) r e a c h - d i s t ( p , o ) / | N k ( p ) |

6) Local outlier factor: indicates the degree of abnormality of the data objects, and its size reflects the degree of isolation of the data object relative to the points in its data area, which referred to as:

L O F k ( P ) = ∑ o ∈ N k ( p ) l r d k ( o ) l r d k ( p ) | N k ( p ) |

The basic operation is as follows [

1) Query the neighborhood of each data object p in the overall data set d, and obtain the neighborhood N k ( p ) recalculation distance.

2) Sort the distance, and calculate the k-th distance and the k-th field.

3) Calculate the reachability density of each transaction data.

4) Calculate the local outlier factor for each transaction data.

5) Sort and output local outlier factors for each transaction data.

Based on the periods divided in question 1, the average transaction time, success rate, and response time for each of the eight periods were used to calculate the degree of outliers at each time point. Taking the maximum value of outliers as the abnormal condition under the mean value, the abnormal regions in each period are set as follows

According to this, the degree of local outliers of the transaction data factors at any time of the day can be determined, and the abnormality can be determined based on the abnormality threshold. The following are the outlier excursions and abnormal point discrimination charts at each time on the 1.23 day (

Symbolic Aggregate Appro-Ximation (SAX) is a data compression and information extraction method that can discretize data sequences and convert them into

Abnormal volume of transaction volume | Abnormal volume of transaction success rate | Abnormal volume of transaction response rate | |
---|---|---|---|

Peak period on the workdays | (1.6179, ∞) | (1.2724, ∞) | (2.5193, ∞) |

Trough period on the workdays | (1.3588, ∞) | (2.3952, ∞) | (7.6093, ∞) |

Peak period on the non-workdays | (1.8366, ∞) | (1.3604, ∞) | (4.3227, ∞) |

Trough period on the non-workdays | (1.3994, ∞) | (7.2853, ∞) | (8.6643, ∞) |

Peak period before the Spring Festival | (2.0111, ∞) | (1.3745, ∞) | (3.2204, ∞) |

Trough period before the Spring Festival | (1.7611, ∞) | (2.9421, ∞) | (5.5650, ∞) |

Peak period after the Spring Festival | (2.0286, ∞) | (1.6719, ∞) | (6.8982, ∞) |

Trough period after the Spring Festival | (2.0914, ∞) | (2.7498, ∞) | (8.8315, ∞) |

symbol sequences according to the characteristics of data density [

1) Z-score standardization of all data is converted into data that conforms to the standard normal distribution. The conversion function is:

x * = x − μ σ

2) A segmented aggregate approximate conversion PAA is performed on the original time series. The total length is n, and the normalized time series are divided into w groups one by one in chronological order. Then find the arithmetic mean value m of each set of sequences, and use m to replace the value of the entire sequence set, reduce the dimension of the original data by about n/w, and change the fluctuating time series into a staircase sequence.

3) Divide the probability density curve of N(0, 1) into a interval functions according to probability, replace the PAA segment with discrete letters, and complete the symbolization of the sequence.

4) Similarity measure and comparison of symbol sequences. Assuming that P, Q are two symbol sequences, and denotes the value of the ith element of the corresponding symbol sequence, then the distance between symbol sequences is defined as:

D ( P , Q ) = n w ∑ i = 1 w ( d i s t ( p i , q i ) ) 2

where

d i s t ( p i , q i ) = { 0 , | p i − q i | ≤ 1 b max ( p i , q i ) − 1 − b min ( p i , q i ) − 1 | p i − q i | ≥ 1

b is the area split point under the normal distribution curve.

After eliminating the outliers in the original sequence, symbolic aggregation approximation processing is performed on the transaction data. Taking the 23rd in April transaction volume as an example, the PAA conversion graph is as follows

After replacing the interval segment with discrete letters, the symbolized data is as follows

Calculate the approximate result of the symbol aggregation for the average data of each time period, and calculate the distance between the symbolization result and the mean value at each period of the transaction data, and perform the following test:

1) If d i s t ( p i , q i ) = 0 , keep the original abnormal threshold;

2) If d i s t ( p i , q i ) > 0 , increase the abnormality threshold.

After expanding the abnormal threshold, new abnormal points are deleted,

and the data is checked and corrected again until all the time periods have passed the test. The correction of the threshold value can increase the sensitivity of the anomaly detection model and minimize the false alarms and omissions of abnormal data. Based on the initial abnormality threshold SAX test results are as follows

From

The test results of transaction volume | The test results of transaction success rate | The test results of transaction response time | |
---|---|---|---|

Peak period on the workdays | 0 | 105.2836 | 148.3655 |

Trough period on the workdays | 0 | 78.3514 | 284.9316 |

Peak period on the non-workdays | 0 | 109.3652 | 372.7653 |

Trough period on the non-workdays | 0 | 239.5185 | 109.8498 |

Peak period before the Spring Festival | 0 | 0 | 38.5632 |

Trough period before the Spring Festival | 0 | 35.9762 | 0 |

Peak period after the Spring Festival | 0 | 0 | 67.3628 |

Trough period after the Spring Festival | 0 | 0 | 79.2295 |

Abnormal volume of transaction volume | Abnormal volume of transaction success rate | Abnormal volume of transaction response rate | |
---|---|---|---|

Peak period on the workdays | (1.6179, ∞) | (1.4524, ∞) | (2.9893, ∞) |

Trough period on the workdays | (1.3588, ∞) | (2.6052, ∞) | (7.7593, ∞) |

Peak period on the non-workdays | (1.8366, ∞) | (1.8204, ∞) | (4.4827, ∞) |

Trough period on the non-workdays | (1.3994, ∞) | (7.9753, ∞) | (8.7343, ∞) |

Peak period before the Spring Festival | (2.0111, ∞) | (1.3745, ∞) | (3.2704, ∞) |

Trough period before the Spring Festival | (1.7611, ∞) | (2.9621, ∞) | (5.5650, ∞) |

Peak period after the Spring Festival | (2.0286, ∞) | (1.6719, ∞) | (6.9082, ∞) |

Trough period after the Spring Festival | (2.0914, ∞) | (2.7498, ∞) | (8.9415, ∞) |

According to the initial abnormality threshold and the optimized abnormality threshold, the trading data is anomalously detected. The detection accuracy rate is as follows

The accuracy of transaction volume | The accuracy of transaction success rate | The accuracy of response time | |
---|---|---|---|

Initial abnormality threshold | 97.2% | 93.1% | 96.4% |

Optimized abnormality threshold | 97.2% | 98.5% | 99.1% |

From

From the above results, it can be known that the LOF algorithm can measure the degree of local outliers of data points well and thus can be used in anomaly detection. However, the abnormal threshold setting is often subjective. Based on the SAX algorithm, the deletion of outlier data can be tested, which can effectively find the deficiency of artificially set abnormal threshold, so as to adjust and improve. Applying SAX to test the adjusted abnormality threshold can greatly improve the accuracy of anomaly detection.

Long, K., Wu, Y.H. and Gui, Y.F. (2018) Anomaly Detection of Store Cash Register Data Based on Improved LOF Algorithm. Applied Mathematics, 9, 719-729. https://doi.org/10.4236/am.2018.96049