Outlier detection from a set of patterns is an active part of research into data collection processing. Many modeling techniques can resist the exterior or reduce their impact. Outlier detection and understanding of them can lead to interesting discoveries.
Definition of Outlier Detection
Outliers are generally defined as models that are exceptionally far from the mainstream of data. There is no strict mathematical definition of what alienation is; determining whether an observation is an abstraction is ultimately a subjective exercise.
An outlier can be interpreted as data or observation that deviates greatly from the mean of a given protocol or set of data. An exception may occur by accident, but it may indicate a measurement error or the given set of data may have a heavier distribution.
Therefore, outlier detection can be defined as the process of detecting and then excluding outsiders from a given set of data. There are no standardized outlier identification methods because these are mostly dataset-dependent. Outlier detection as a branch of data processing has many applications in data stream analysis.
This paper focuses on the problems of outlier detection by data stream and specific techniques used to detect streaming data in data mining. We will also focus on recent research on outlier detection methods and external analysis.
Our discussion will cover areas of standard application of outlier detection such as fraud detection, public health, and sports, and touch on different approaches such as proximity-based approaches and angle-based approaches.
Outlier Detection Techniques
To identify the exterior in the database, it is important to keep in mind the context and find the answer to the most basic and relevant question: “Why should I find outliers?” The context will explain the meaning of your findings.
Remember two important questions about your database during Outlier Identification:
(i) What and how many features do I consider for outlier detection? (Similarity/diversity)
(ii) Can I take the distribution (s) of values for the features I have selected? (Parameter / non-parameter)
There are four Outlier Detection techniques in general.
1. Numeric Outlier
A numerical outlier is a simple, non-standard outlier detection technique in a one-dimensional feature space. Exteriors are calculated by IQR (InterQuartile Range). For example, the first and third quarters (Q1, Q3) are calculated. Outlier is a data point xi that is out of range.
Using the interquartile amplifier value k=1.5, the limits are the typical upper and lower whiskers of a box plot.
This technique can be easily implemented on the KNIME Analytics platform using the Numeric Outliers node.
The Z-score technique considers the Gaussian distribution of data. Outliers are data points that are on the tail of the distribution and are therefore far from average.
The z-score of any data point can be calculated by the following expression, after making appropriate changes to the selected feature interval of the dataset:
When calculating the z-score for each sample, a limit must be specified in the data set. Some good ‘thumb rule’ limits may be fixed deviations of 2.5, 3, 3.5, or more.
This outlier detection technique is based on the DBSCAN clustering method. DBSCAN is a non-standard, density-based outlier detection method. Here, all data points are defined as focal points, boundary points, or noise points.
4. Isolated forest
This non-parameter system is suitable for large datasets in one or more dimensional features. Isolation number is very important in this outlier detection technique. Isolation number is the number of divisions required to isolate a data point.
Outlier Detection Methods
Models for Outlier Detection Analysis
There are many approaches to detecting abnormalities. Outlier detection models can be classified into the following groups:
1. Intensive value analysis
Extreme value analysis is the most basic form of outlier detection and is suitable for 1-dimensional data. In this external analysis approach, the largest or smallest values are considered externally. The Z-Test and the Students’ T-Test are excellent examples.
These are good heuristics for the initial analysis of data but they are not of much value in multifaceted systems. Extreme value analysis is often used as a final step in interpreting the outputs of other outlier detection methods.
2. Linear Models
In this approach, data is structured outside the lower dimensional substructure using linear interactions. The distance of each data point is calculated for a plane that corresponds to the sub-interval. This distance is used to detect outliers. PCA (primary component analysis) is an example of a linear model for anomaly detection.
3. Probabilistic and Statistical Models
In this approach, probability and statistical models consider specific distributions of data. Expectation-enhancement (EM) methods are used to estimate the parameters of the sample. Finally, they calculate the probability of the member of each data point for the calculated distribution. Points with the lowest probability of membership are marked externally.
4. Proximity-based Models
In this mode, the outliers are designed as points of isolation from other observations. Cluster analysis, density-based analysis, and neighborhood environment are key approaches of this type.
5. Information-theoretical models
In this mode, outliers increase the minimum code length to describe a data set.