šŸ“’
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • SEQUENTIAL MINIMAL OPTIMIZATION (SMO)
  • SUPPORT VECTOR MACHINES (SVM)
  • MULTI CLASS SVM
  • Regularization and influence
  • SUPPORT VECTOR REGRESSION (SVR)
  • LibSVM vs LibLinear
  • Support vector clustering (SVC)
  • KERNELS
  • Intuition for regularization in SVM
  • Overfitting advice for SVM:
  • RBF kernel use cases

Was this helpful?

  1. Machine Learning

Linear Separator Algorithms

PreviousActive Learning AlgorithmsNextRegression

Last updated 3 years ago

Was this helpful?

SEQUENTIAL MINIMAL OPTIMIZATION (SMO)

- Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of the smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On real-world sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm.

&

SUPPORT VECTOR MACHINES (SVM)

- , ***:

  • For Optimal 2-class classifier.

  • Extended for regression and clustering problems (1 class).

  • Kernel-based

    • maps feature vectors into a higher-dimensional space using a kernel function

    • builds an optimal linear discriminating function in this space (linear?) or an optimal hyper-plane (RBF?) that fits the training data

  • In case of SVM, the kernel is not defined explicitly.

  • A distance needs to be defined between any 2 points in the hyper-space.

  • The solution is optimal, the margin is maximal. between the separating hyper-plane and the nearest feature vectors

  • The feature vectors that are the closest to the hyper-plane are called support vectors, which means that the position of other vectors does not affect the hyper-plane (the decision function).

  • The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin.

Math of SVM on youtube:

    • Linear - maximize the margin, optimal solution, only a few close points are really needed the others are zeroes by the alphas (alpha says ā€œpay attention to this variableā€) in the quadratic programming equation. XtX is a similarity function (pairs of points that relate to each other in output labels and how similar to one another, Xi’s point in the same direction) y1y2 are the labels. Therefore further points are not needed. But the similarity is important here(?)

    • Non-linear - e.g. circle inside a circle, needs to map to a higher plane, a measure of similarity as XtX is important. We use this similarity idea to map into a higher plane, but we choose the higher plane for the purpose of a final function that behaves likes a known function, such as (A+B)^2. It turns out that (q1,q2,root(2)q1q2) is engineered with that root(2) thing for the purpose of making the multiplication of X^tY, which turns out to be (X^tY)^2. We can substitute this formula (X^tY)^2 instead of the X^tX in the quadratic equation to do that for us.This is the kernel trick that maps the inner class to one side and the outer circle class to the other and passes a plane in between them.

    • Similarity is defined intuitively as all the points in one class vs the other.. I think

    • A general kernel K=(X^tY + C)^p is a polynomial kernel that can define the above function and others.

    • Quadratic eq with possible kernels including the polynomial.

  • Most importantly the kernel function is our domain knowledge. (?) IMO we should choose a kernel that fits our feature data.

  • The output of K is a number(?)

  • Infinite dimensions - possible as well.

  • Mercer condition - it acts like a distance\similar so that is the ā€œruleā€ of which a kernel needs to follow.

Regularization and influence

- (basically punishment for overfitting and raising the non- linear class points higher and lower)

SUPPORT VECTOR REGRESSION (SVR)

  • The method of SVM can be extended to solve regression problems.

  • Similar to SVM, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.

- using many kernel transforms to turn a non-linear problem into a linear problem beforehand.

From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.

  • libsvm (SMO) implementation

    • kernel (n^2)

    • Linear SVM (n^3)

  • liblinear - optimized to deal with linear classification without kernels

    • Complexity O(n)

    • does not support kernel SVMs.

    • Scores higher

n is the number of samples in the training dataset.

Support vector clustering (SVC)

KERNELS

  • allows us to do certain calculations faster which otherwise would involve computations in higher dimensional space.

  • K(x, y) = <f(x), f(y)>. Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n.

  • normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number.

  • Result is ONLY a scalar, i..e., 1-dim space.

  • We don’t need to do that calc if we use a clever kernel.

Example:

Simple Example: x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive: suppose x = (1, 2, 3); y = (4, 5, 6). Then:

f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) and f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)

<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024 i.e., 1*16+2*20+*3*24..

A lot of algebra. Mainly because f is a mapping from 3-dimensional to 9 dimensional space.

With a kernel its faster.

K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024

A kernel is a magical shortcut to calculate even infinite dimensions!

  • The idea of SVM is that y = w phi(x) +b, where w is the weight, phi is the feature vector, and b is the bias.

  • if y> 0, then we classify datum to class 1, else to class 0.

  • We want to find a set of weight and bias such that the margin is maximized.

  • Previous answers mention that kernel makes data linearly separable for SVM. I think a more precise way to put this is, kernels do not make the the data linearly separable.

  • The feature vector phi(x) makes the data linearly separable. Kernel is to make the calculation process faster and easier, especially when the feature vector phi is of very high dimension (for example, x1, x2, x3, ..., x_D^n, x1^2, x2^2, ...., x_D^2).

  • Why it can also be understood as a measure of similarity: if we put the definition of kernel above, <f(x), f(y)>, in the context of SVM and feature vectors, it becomes <phi(x), phi(y)>. The inner product means the projection of phi(x) onto phi(y). or colloquially, how much overlap do x and y have in their feature space. In other words, how similar they are.

  • There are heuristic methods that skip some search options

  • However, no need for heuristics, computation-time is small, grid can be paralleled and we dont skip parameters.

  • Controlling search complexity using two tier grid, coarse grid and then fine tune.

  • In non linear kernels:

    • Kernel choice

    • Kernel parameters

    • Reasonable first choice

    • when the relation between class labels and attributes is nonlinear.

    • Special case of C can make this similar to linear kernel (only! After finding C and gamma)

    • Certain parameters makes it behave like the sigmoid kernel.

    • Less hyperparameters than RBF kernel.

    • 0 <Kij <1 unlike other kernels where the degree is 0<k<infinity

    • Sigmoid is not valid under some parameters.

    • DON'T USE when the #features is very large, use linear.

  • Number of instances << number of features. I.e, 38 instances over 7000 features.

RBF=LINEAR When the number of features is large, we may not need to use RBF over Linear and vice versa (After finding C and gamma)

  • Number of Instances & features is VERY LARGE. I.e, 20K samples X 20K features.

Similar performance with libsvm and liblinear, liblinear is faster by 150 times. Rule of thumb is to use for document classification.

  • Number of instances >> number of features. Usually high dimensional mapping using non linear kernel. If we insist on liblinear, -s 2 leads to faster training.

- one against all, one against one, and Direct Acyclic Graph SVM (one against one with DAG). bottom line One Against One in LIBSVM.

- udacity

- expands on the quadratic equations that were introduced in the previous course above.

- controlling ā€˜C’

- .:

Conclusion: In practice libsvm becomes painfully slow at 10k samples. Hence for medium to large scale datasets use liblinear and forget about libsvm (or maybe have a look at approximate kernel SVM solvers such as , which saves training time and memory usage for large scale datasets).

,

- intuition and example

?:

:

SVM::LINEAR Linear kernel. No mapping is done, linear discrimination (or regression) is done in the original feature space. It is the fastest option. .

SVM::RBF Radial basis function (RBF), a good choice in most cases. .

- in openCV.

, C = 2^-5 , 2 ^-3 , . . . , 2^15 , γ = 2^-15 , 2 ^-13 , . . . , 2^3 ).

Regularization parameter C -

- gamma, low and high values are far and near influence

Great

use cases

What is the SMO (SVM) classifier?
Differences between libsvm and liblinear
smo vs libsvm
Definition
tutorial
MULTI CLASS SVM
A few good explanation about SVM, formulas, figures, C, gamma, etc.
very good number-based example #1
Very good but lengthy and chatty example with make-sense math #2
Super good lecture on MIT OPEN COURSE WARE
How does regularization look like in SVM
The best explanation about Gamma (and C) in SVM!
Definition Support Vector Regression
LibSVM vs LibLinear
LaSVM
paper
short explanation
What are kernels in SVM
Relation to SVM
Kernels
Intuition for regularization in SVM
Grid search for SVM Hyper parameters
Example in log space
I.e., (for example
Overfitting advice for SVM:
penalty example
RBF
Tutorial at LIBSVM
RBF kernel
Kdnuggets: When to use DL over SVM and other algorithms. Computationally expensive for a very small boost in accuracy.
K(x_i, x_j) = x_i^T x_j
K(x_i, x_j) = e^{-\gamma ||x_i - x_j||^2}, \gamma > 0