Thursday, November 12, 2015

Why to use deep networks?

http://datascience.stackexchange.com/questions/1253/why-are-nlp-and-machine-learning-communities-interested-in-deep-learning/4888#4888


Why to use deep networks?


Let's first try to solve very simple classification task. Say, you moderate a web forum which is sometimes flooded with spam messages. These messages are easily identifiable - most often they contain specific words like "buy", "porn", etc. and a URL to outer resources. You want to create filter that will alert you about such suspecious messages. It turns to be pretty easy - you get list of features (e.g. list of suspicious words and presence of a URL) and train simple logistic regression (a.k.a. perceptron), i.e. model like:

g(w0 + w1*x1 + w2*x2 + ... + wnxn)

where x1..xn are your features (either presence of specific word or a URL), w0..wn - learned coefficients and g() is a logistic function to make result be between 0 and 1. It's very simple classifier, but for this simple task it may give very good results, creating linear decision boundary. Assuming you used only 2 features, this boundary may look something like this:

linear boundary

Here 2 axes represent features (e.g. number of occurrences of specific word in a message, normalized around zero), red points stay for spam and blue points - for normal messages, while black line shows separation line.

But soon you notice that some good messages contain a lot of occurrences of word "buy", but no URLs, or extended discussion of porn detection, not actually refferring to porn movies. Linear decision boundary simply cannot handle such situations. Instead you need something like this:

non-linear boundary

This new non-linear decision boundary is much more flexible, i.e. it can fit the data much closer. There are many ways to achieve this non-linearity - you can use polynomial features (e.g. x1^2) or their combination (e.g. x1*x2) or project them out to a higher dimension like in kernel methods. But in neural networks it's common to solve it by combining perceptrons or, in other words, by building multilayer perceptron. Non-linearity here comes from logistic function between layers. The more layers, the more sophisticated patterns may be covered by MLP. Single layer (perceptron) can handle simple spam detection, network with 2-3 layers can catch tricky combinations of features, and networks of 5-9 layers, used by large research labs and companies like Google, may model the whole language or detect cats on images.

This is essential reason to have deep architectures - they can model more sophisticated patterns.

Why deep networks are hard to train?


With only one feature and linear decision boundary it's in fact enough to have only 2 training examples - one positive and one negative. With several features and/or non-linear decision boundary you need several orders more examples to cover all possible cases (e.g. you need not only find examples with word1, word2 and word3, but also with all possible their combinations). And in real life you need to deal with hundreds and thousands of features (e.g. words in a language or pixels in an image) and at least several layers to have enough non-linearity. Size of a data set, needed to fully train such networks, easily exceeds 10^30 examples, making it totally impossible to get enough data. In other words, with many features and many layers our decision function becomes too flexible to be able to learn it precisely.

There are, however, ways to learn it approximately. For example, if we were working in probabilistic settings, then instead of learning frequencies of all combinations of all features we could assume that they are independent and learn only individual frequencies, reducing full and unconstrained Bayes classifier to a Naive Bayes and thus requiring much, much less data to learn.

In neural networks there were several attempts to (meaningfully) reduce complexity (flexibility) of decision function. For example, convolutional networks, extensively used in image classification, assume only local connections between nearby pixels and thus try only learn combinations of pixels inside small "windows" (say, 16x16 pixels = 256 input neurons) as opposed to full images (say, 100x100 pixels = 10000 input neurons). Other approaches include feature engineering, i.e. searching for specific, human-discovered descriptors of input data.

Manually discovered features are very promising actually. In natural language processing, for example, it's sometimes helpful to use special dictionaries (like those containing spam-specific words) or catch negation (e.g. "not good"). And in computer vision things like SURF descriptors or Haar-like features are almost irreplaceable.

But the problem with manual feature engineering is that it takes literally years to come up with good descriptors. Moreover, these features are often specific

Unsupervised pretraining


But it turns out that we can obtain good features automatically right from the data using such algorithms as autoencoders and restricted Boltzmann machines. I described them in detail in my other answer, but in short they allow to find repeated patterns in the input data and transform it into higher-level features. For example, given only row pixel values as an input, these algorithms may identify and pass higher whole edges, then from these edges construct figures and so on, until you get really high-level descriptors like variations in faces.

deep learning

After such (unsupervised) pretraining network is usually converted into MLP and used for normal supervised training. Note, that pretraining is done layer-wise. This significantly reduces solution space for learning algorithm (and thus number of training examples needed) as it only needs to learn parameters inside each layer without taking into account other layers.

And beyond...


Unsupervised pretraining have been here for some time now, but recently other algorithms were found to improve learning both - together with pretraining and without it. One notable example of such algorithms is dropout - simple technique, that randomly "drops out" some neurons during training, creatig some distortion and preventing networks of following data too closely. This is still a hot research topic, so I leave this to a reader. 

No comments:

Post a Comment