📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • OUTLIER DETECTION
  • ISOLATION FOREST
  • LOCAL OUTLIER FACTOR
  • ELLIPTIC ENVELOPE
  • ONE CLASS SVM
  • CLUSTERING METRICS

Was this helpful?

  1. Machine Learning

Anomaly Detection

PreviousClustering AlgorithmsNextDecision Trees

Last updated 3 years ago

Was this helpful?

“whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier).

=> Often, this ability is used to clean real data sets

Two important distinctions must be made:

novelty detection:

The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations.

outlier detection:

The training data contains outliers, and we need to fit the central mode of the training data, ignoring the deviant observations

  1. - good

  2. Index for

  3. about AD using 20 algos in a .

  4. A of One-class SVM versus Elliptic Envelope versus Isolation Forest versus LOF in sklearn. (The examples below illustrate how the performance of the degrades as the data is less and less unimodal. The works better on data with multiple modes and and perform well in every cases.)

  5. - the information is there, but its all over the place.

  6. Twitter anomaly -

  7. Microsoft anomaly - a well documented black box, i cant find a description of the algorithm, just hints to what they sort of did

  8. STL and by microsoft

OUTLIER DETECTION

ISOLATION FOREST

  • randomly selecting a feature

  • randomly selecting a split value between the maximum and minimum values of the selected feature.

Recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeable shorter paths for anomalies.

=> when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Note that the tree height limit l is automatically set by the sub-sampling size ψ: l = ceiling(log2 ψ), which is approximately the average tree height [7].

The rationale of growing trees up to the average tree height is that we are only interested in data points that have shorter-than average path lengths, as those points are more likely to be anomalies

LOCAL OUTLIER FACTOR

  • It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.

  • In practice the local density is obtained from the k-nearest neighbors.

  • The LOF score of an observation is equal to the ratio of the average local density of his k-nearest neighbors, and its own local density:

    • a normal instance is expected to have a local density similar to that of its neighbors,

    • while abnormal data are expected to have much smaller local density.

ELLIPTIC ENVELOPE

  1. We assume that the regular data come from a known distribution (e.g. data are Gaussian distributed).

  2. From this assumption, we generally try to define the “shape” of the data,

  3. And can define outlying observations as observations which stand far enough from the fit shape.

ONE CLASS SVM

The resulting hypersphere is characterized by a center and a radius R>0 as distance from the center to (any support vector on) the boundary, of which the volume R2 will be minimized.

CLUSTERING METRICS

For community detection, text clusters, etc.

Silhouette:

is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. The outlier detection methods should allow the user to identify global, contextual and collective outliers.

(great)

(Scalable Unsupervised Outlier Detection) is an acceleration framework for large-scale unsupervised outlier detector training and prediction. Notably, anomaly detection is often formulated as an unsupervised problem since the ground truth is expensive to acquire. To compensate for the unstable nature of unsupervised algorithms, practitioners often build a large number of models for further combination and analysis, e.g., taking the average or majority vote. However, this poses scalability challenges in high-dimensional, large datasets, especially for proximity-base models operating in Euclidean space.

SUOD is therefore proposed to address the challenge at three complementary levels: random projection (data level), pseudo-supervised approximation (model level), and balanced parallel scheduling (system level). As mentioned, the key focus is to accelerate the training and prediction when a large number of anomaly detectors are presented, while preserving the prediction capacity. Since its inception in Jan 2019, SUOD has been successfully used in various academic researches and industry applications, include PyOD and medical claim analysis. It could be especially useful for outlier ensembles that rely on a large number of base estimators.

- the basic idea is that for an anomaly (in the example) only 4 partitions are needed, for a regular point in the middle of a distribution, you need many many more.

-Isolating observations:

In the training stage, iTrees are constructed by recursively partitioning the given training set until instances are isolated or a specific tree height is reached of which results a partial model.

computes a score (called local outlier factor) reflecting the degree of abnormality of the observations.

It looks like there are , - The 2nd one: The algorithm obtains a spherical boundary, in feature space, around the data. The volume of this hypersphere is minimized, to minimize the effect of incorporating outliers in the solution.

for deciding how many clusters to use, the knee/elbow method.

, using the SuperHeat package, clustering w2v cosine similarity matrix, measuring using silhouette score.

Using IQR for AD and why IQR difference is 2.7 sigma
Medium
kdnuggets
Z-score and other moving averages.
A survey
A great tutorial
single python package
Mastery on classifying rare events using lstm-autoencoder
comparison
covariance.EllipticEnvelope
svm.OneClassSVM
ensemble.IsolationForest
neighbors.LocalOutlierFactor
Using Autoencoders
up/down trend, dynamic range, tips and dips
Api here
LSTM for anomaly prediction
Medium on AD
Medium on AD using mahalanobis, AE and
Pyod
Anomaly detection resources
Novelty and outlier detection inm sklearn
SUOD
[2]
IQVIA
Skyline
Scikit-lego
The best resource to explain isolation forest
Isolation Forest
the paper is pretty good too -
LOF
A nice article about ocs, with github code, two methods are described.
Resources for ocsvm
two such methods
Google search for convenience
TFIDF, PCA, SILHOUETTE
Embedding based silhouette community detection
A notebook
Topic modelling clustering, cant access this document on github
Alibi Detect