Deep Learning Models
Last updated
Was this helpful?
Last updated
Was this helpful?
- using keras’ functional API
- regular, deep, sparse, regularized, cnn, variational
A keras.io but explains AE quite nicely.
on PCA vs AE, basically some info about what PCA does - maximizing variance and projecting and then what AE does and can do to achieve similar but non-linear dense representations
summarized in the KPCA section of this notebook. + +xchange
,
, sequence to sequence pre training for NL generation translation and comprehension.
,
Git
NEAT
NEAT implements the idea that it is most effective to start evolution with small, simple networks and allow them to become increasingly complex over generations.**
**That way, just as organisms in nature increased in complexity since the first cell, so do neural networks in NEAT.
This process of continual elaboration allows finding highly sophisticated and complex neural networks.**
HYPER-NEAT
HyperNEAT is based on a theory of representation that hypothesizes that a good representation for an artificial neural network should be able to describe its pattern of connectivity compactly.**
An RBFN performs classification by measuring the input’s similarity to examples from the training set.
Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set.
When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype.
Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.
Under the BNN framework, prediction uncertainty can be categorized into three types:
Model uncertainty captures our ignorance of the model parameters and can be reduced as more samples are collected.
model misspecification
inherent noise captures the uncertainty in the data generation process and is irreducible.
Note: in a series of articles, uber explains about time series and leads to a BNN architecture.
Vanilla LSTM did not work properly, therefore an architecture of
Regarding point 1: ‘run prediction with dropout 100 times’
Is it applicable for time series? In the figure below he tried to predict the missing signal between each two dotted lines, A is a bad estimation, but with a dropout layer we can see that in most cases the signal is better predicted.
Going back to uber, they are actually using this idea to predict time series with LSTM, using encoder decoder framework.
Note: this is probably applicable in other types of networks.
“import keras
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(3)(inputs)
outputs = keras.layers.Dropout(0.5)(x, training=True)
model = keras.Model(inputs, outputs)“
Convolution Layer primary purpose is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data.
ReLU (more in the activation chapter) - The purpose of ReLU is to introduce non-linearity in our ConvNet
Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.
Dense / Fully Connected - a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer to classify. The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.
The overall training process of the Convolutional Network may be summarized as below:
Step1: We initialize all filters and parameters / weights with random values
Step2: The network takes a single training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
Let's say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
Since weights are randomly assigned for the first training example, output probabilities are also random.
Step3: Calculate the total error at the output layer (summation over all 4 classes)
(L2) Total Error = ∑ ½ (target probability – output probability) ²
Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
The weights are adjusted in proportion to their contribution to the total error.
When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
Step5: Repeat steps 2-4 with all images in the training set.
The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.
When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.
Over sampling
Undersampling
Thresholding probabilities (ROC?)
Cost sensitive classification -different cost to misclassification
One class - novelty detection. This is a concept learning technique that recognizes positive instances rather than discriminating between two classes
The results indication (loosely) that oversampling is usually better in most cases, and doesn't cause overfitting in CNNs.
CONV-1D
1x1 CNN
“This is the most common application of this type of filter and in this way, the layer is often called a feature map pooling layer.”
“In the paper, the authors propose the need for an MLP convolutional layer and the need for cross-channel pooling to promote learning across channels.”
“the 1×1 filter was used explicitly for dimensionality reduction and for increasing the dimensionality of feature maps after pooling in the design of the inception module, used in the GoogLeNet model”
“The 1×1 filter was used as a projection technique to match the number of filters of input to the output of residual modules in the design of the residual network “
MASKED R-CNN
Invariance in CNN
MAX AVERAGE POOLING
A max-pool layer compressed by taking the maximum activation in a block. If you have a block with mostly small activation, but a small bit of large activation, you will loose the information on the low activations. I think of this as saying "this type of feature was detected in this general area".
A mean-pool layer compresses by taking the mean activation in a block. If large activations are balanced by negative activations, the overall compressed activations will look like no activation at all. On the other hand, you retain some information about low activations in the previous example.
MAX pooling In other words: Max pooling roughly means that only those features that are most strongly triggering outputs are used in the subsequent layers. You can look at it a little like focusing the network’s attention on what’s most characteristic for the image at hand.
Dilated CNN
To Add keras book chapter 5 (i think)
Classifier: The pre-trained model is used directly to classify new images.
Standalone Feature Extractor: The pre-trained model, or some portion of the model, is used to pre-process images and extract relevant features.
Integrated Feature Extractor: The pre-trained model, or some portion of the model, is integrated into a new model, but layers of the pre-trained model are frozen during training.
Weight Initialization: The pre-trained model, or some portion of the model, is integrated into a new model, and the layers of the pre-trained model are trained in concert with the new model.
a basic NN node with a loop, previous output is merged with current input (using tanh?), for the purpose of remembering history, for time series - to predict the next X based on the previous Y.
N to 1 = classification
N to N = predict frames in a movie
N\2 with time delay to N\2 = predict supply and demand
Vanishing gradient is 100 times worse.
Gate networks like LSTM solves vanishing gradient.
** Experimental improvements:
Masking for RNNs - the ideas is simple, we want to use variable length inputs, although rnns do use that, they require a fixed size input. So masking of 1’s and 0’s will help it understand the real size or where the information is in the input. Motivation: Padded inputs are going to contribute to our loss and we dont want that.
That return sequences return the hidden state output for each input time step.
That return state returns the hidden state output and cell state for the last input time step.
That return sequences and return state can be used at the same time.
TimeDistributed Layer - used to connect 3d inputs from lstms to dense layers, in order to utilize the time element. Otherwise it gets flattened when the connection is direct, nulling the lstm purpose. Note: nice trick that doesn't increase the dense layer structure multiplied by the number of dense neurons. It loops for each time step! I.e., The TimeDistributed achieves this trick by applying the same Dense layer (same weights) to the LSTMs outputs for one time step at a time. In this way, the output layer only needs one connection to each LSTM unit (plus one bias).
For this reason, the number of training epochs needs to be increased to account for the smaller network capacity. I doubled it from 500 to 1000 to match the first one-to-one example
Sequence Learning Problem
One-to-One LSTM for Sequence Prediction
Many-to-One LSTM for Sequence Prediction (without TimeDistributed)
Many-to-Many LSTM for Sequence Prediction (with TimeDistributed)
Stateful vs Stateless: crucial for understanding how to leverage LSTM networks:
Machine Learning mastery:
1. Scale to -1,1, because the internal activation in the lstm cell is tanh.
Return_sequence is needed for stacked LSTM layers.
This is a nice helper add-on by Keras, and most other Keras examples you have seen the training and test set was passed into the fit method, after you have manually made the split. The value of having a validation set is significant and is a vital step to understand how well your model is training. Ideally on a curve you want your training accuracy to be close to your validation curve, and the moment your validation curve falls below your training curve the alarm bells should go off and your model is probably busy over-fitting.
Keras is a wonderful framework for deep learning, and there are many different ways of doing things with plenty of helpers.
This tutorial clearly shows how to manipulate input construction, lstm output neurons and the target layer for the purpose of those three problems (1:1, 1:m, m:m).
BIDIRECTIONAL LSTM
(what is?) Wiki - The basic idea of BRNNs is to connect two hidden layers of opposite directions to the same output. By this structure, the output layer can get information from past and future states.
BRNN are especially useful when the context of the input is needed. For example, in handwriting recognition, the performance can be enhanced by knowledge of the letters located before and after the current letter.
.. It allows you to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer. The options are:
‘sum‘: The outputs are added together.
‘mul‘: The outputs are multiplied together.
‘concat‘: The outputs are concatenated together (the default), providing double the number of outputs to the next layer.
‘ave‘: The average of the outputs is taken.
The default mode is to concatenate, and this is the method often used in studies of bidirectional LSTMs.
update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future.
Reset gate essentially, this gate is used from the model to decide how much of the past information to forget.
RECURRENT WEIGHTED AVERAGE (RNN-WA)
What is? (a type of cell that converges to higher accuracy faster than LSTM.
it implements attention into the recurrent neural network:
Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.
Nodevectors
Analyse signal variability and correlation
MULTI NETWORKS
Unread -
- improves VAE
Optimus - , , ****
,
***
,
, intuition towards each node and what it represents in a vision. I.e., each face resembles one of K clusters.
, explains inference - averaging, and cons of the method.
**stands for NeuroEvolution of Augmenting Topologies. It is a method for evolving artificial neural networks with a genetic algorithm.
**computes the connectivity of its neural networks as a function of their geometry.
The encoding in HyperNEAT, called **, is designed to represent patterns with regularities such as symmetry, repetition, and repetition with variationץ
(WIKI) [Compositional pattern-producing networks]() (CPPNs) are a variation of artificial neural networks (ANNs) that have an architecture whose evolution is guided by genetic algorithms**
+
The approach is more intuitive than the MLP.
- (what is?) according to Uber - architecture that more accurately forecasts time series predictions and uncertainty estimations at scale. “how Uber has successfully applied this model to large-scale time series anomaly detection, enabling better accommodate rider demand during high-traffic intervals.”
- training on multi-signal raw data, training X and Y are window-based and the window size(lag) is determined in advance.
***
The blog post explains, for example, that with a CNN of apples, oranges, cat and dogs, a non related example such as a frog image may influence the network to decide its an apple, therefore we can’t rely on the probability as a confidence measure. The ‘run prediction with dropout 100 times’ should give us a confidence measure because it draws each weight from a bernoulli distribution.
“By applying dropout to all the weight layers in a neural network, we are essentially drawing each weight from a . In practice, this mean that we can sample from the distribution by running several forward passes through the network. This is referred to as .”
Taken from Yarin Gal’s . In this figure we see how sporadic is the signal from a forward pass (black line) compared to a much cleaner signal from 100 dropout passes.
, he talks about uncertainty in Neural networks and using BNNs. he may have proved this thesis, but I did not read it. This blog post links to his full Phd.
Old note: ) that in order to trust your network’s classification, you drop some of the neurons during prediction, you do this ~100 times and you average the results. Intuitively this will give you confidence in your classification and increase your classification accuracy, because only a partial part of your network participated in the classification, randomly, 100 times. Please note that Softmax doesn't give you certainty.
The says to add trainable=true for every dropout layer and add another drop out at the end of the model. Thanks sam.
() -
- we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue
Using several imbalance scenarios, on several known data sets, such as MNIST
on 1x1 cnn, for dim reduction, decreasing feature maps and other usages.
- “Small shifts -- even by a single pixel -- can drastically change the output of a deep network (bars on left). We identify the cause: aliasing during downsampling. We anti-alias modern deep networks with classic signal processing, stabilizing output classifications (bars on right). We even observe accuracy increases (see plot below).
: In the last few years, experts have turned to global average pooling (GAP) layers to minimize overfitting by reducing the total number of parameters in the model. Similar to max pooling layers, GAP layers are used to reduce the spatial dimensions of a three-dimensional tensor. However, GAP layers perform a more extreme type of dimensionality reduction,
**** - the trick behind them, concatenating both f(x) = x
, where features can be identified without relations to each other in an image, i.e. changing the location of body parts will not affect the classification, and changing the orientation of the image will. The promise of capsule nets is that these two issues are solved.
there are more parts to the series
on TL using CNN
(What is RNN?) by Andrej Karpathy - , basically a lot of information about RNNs and their usage cases 1 to N = frame captioning
(how to initialize?) - don't worry about initialization, use normalization and GRU for big networks.
- ”Simplified RNN, with pytorch implementation” - changing the underlying mechanism in RNNs for the purpose of parallelizing calculation, seems to work nicely in terms of speed, not sure about state of the art results. , author claims he already mentioned these ideas (QRNN) , a year before, however it seems like his ideas have also been reviewed as (PixelRNN). Its probably best to read all 3 papers in chronological order and use the most optimal solution.
, enables you to build complex rnns with keras. Details on their significance are inside the link
, ,
Visual attention RNNS - Same idea as masking but on a window-based cnn.
LSTM - the first reference for LSTM on the web, but you should know the background before reading.
- you have to understand this concept before you dive in. i.e, Hidden state is overall state of what we have seen so far. Cell state is selective memory of the past. The hidden state (h) carries the information about what an RNN cell has seen over the time and supply it to the present time such that a loss function is not just dependent upon the data it is seeing in this time instant, but also, data it has seen historically.
- a comparison of many LSTMs variants and they are pretty much the same performance wise
- comparison of lstm variants, vanilla is mostly the best, forget and output gates are the most important in terms of performance. Other conclusions in the paper..
Master on
Mastery on - but makes sense for all types of networks
Mastery on r
Mastery on ,
Mastery on and seq2seq
Mastery on , as a whole model wrap, or on every layer in the model which is equivalent and preferred.
Master on for sequence prediction
Unread - sentiment classification of IMDB movies using
- (jakob) single point prediction, sequence prediction and shifted-sequence prediction with code.
**
on stateful vs stateless, intuition mostly with code, but not 100% clear
important notes:
2. - True, needs to reset internal states, False =stateless. Great info & results , with seeding, with training resets (and not) and predicting resets (and not) - note: empirically matching the shampoo input, network config, etc.
3. , and how to use each one and both at the same time.
4. - each layer has represents a higher level of abstraction in TIME!
- a good explanation about differences between input_shape, dim, and what is. Additionally about layer calculation of inputs and output based on input shape, and sequence model vs API model.
A of LSTM/GRU/MGU with batch normalization and various initializations, GRu/Xavier/Batch are the best and recommended for RNN
: - it looks like LSTM and GRU are competitive to mutation (i believe its only in pytorch) adding a bias to LSTM works (a bias of 1 as recommended in the ), but generally speaking there is no conclusive empirical evidence that says one type of network is better than the other for all tests, but the mutated networks tend to win over lstm\gru variants.
- unit_forget_bias: Boolean. If True, add 1 to the bias of the forget gate at initializationSetting it to true will also force bias_initializer="zeros". This is recommended in
- The validation split variable in Keras is a value between [0..1]. Keras proportionally split your training set by the value of the variable. The first set is used for training and the 2nd set for validation after each epoch.
: unclear.
- using maxlength it will either pad with zero if smaller than, or truncate it if bigger.
Imbalanced classes? Use s, another explanation about class_weights and sample_weights.
SKlearn Formula for balanced class weights and why it works,
, but with focus on lstm one to one, one to many and many to many - here the timedistributed is applying a dense layer to each output neuron from the lstm, which returned_sequence = true for that purpose.
explanation- It involves duplicating the first recurrent layer in the network so that there are now two layers side-by-side, then providing the input sequence as-is as input to the first layer and providing a reversed copy of the input sequence to the second.
, ,
- To solve the vanishing gradient problem of a standard RNN, GRU uses, so called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.
1. the keras implementation is available at **
2. the whitepaper is at
(amazing) - really good insight to what they do (compressing data, vs adjacy graphs, vs graphs, high dim relations, etc.)
(amazing)
Octavian in medium on graphs, , clever, mcgraph, regression, classification, embedding on graphs.
**
, w2v, pytorch w2v, networkx, sparse matrices, matrix factorization, dictionary optimization, part 1 here
, original:
Really good - **
Michael Bronstein’s (worth reading)
, paper, examples - The graph attentional layer utilised throughout these networks is computationally efficient (does not require costly matrix operations, and is parallelizable across all nodes in the graph), allows for (implicitly) assigning different importances to different nodes within a neighborhood while dealing with different sized neighborhoods, and does not depend on knowing the entire graph structure upfront—thus addressing many of the theoretical issues with approaches.
Medium on
, : Learning Node Representations from Structural Identity- The struc2vec algorithm learns continuous representations for nodes in any graph. struc2vec captures structural equivalence between nodes.
, from ML to GNN.
- graphs, sets, groups, GNNs.
and medium on
,
, , , “Is a Single Embedding Enough? Learning Node Representations that Capture Multiple Social Contexts”
16.
17. , similar to deep walk with node skips. - lots of improvements, works in scale due to lower size representations, improves results, etc.
, The fastest network node embeddings in the west
- decomposing frequencies
:
, compression, detect edges, detect features with various orientation, analyse signal power, detect and localize transients, change points in time series data and detect optimal signal representation (peaks etc) of time freq analysis of images and data.
Can also be used to , analyse images in space, frequencies, orientation, identifying coherent time oscillation in time series
(did not read) - can this be applied to other time series prediction?