How can an (AI-) model learn such complex and abstract tasks, like distinguishing an American tourist from a Persian cat?
This is done by training the model using training samples (e.g. example images annotated/labelled with a class, e.g. “American tourist”). Initially, the model contains arbitrary parameters and is thus useless. However, by feeding the training samples and measuring the model’s output in view of the actually correct output (i.e. according to the annotation), the parameters can be successively optimized until the mathematical model has “learnt” the task.
The cool thing: You do not need to tune the parameters manually, but the model does it automatically during training (more than a nice-to-have, if you use a model with several Billion parameters).
The model is typically trained on a large dataset and thereby automatically learns features from the data in each layer. As a result, the model learns to generalize on the task given in the dataset. The term “generalization” means that once the model is trained, it will be able to perform the task on unseen input data. For example, the trained model can classify an unseen image (i.e. which was not present in the training dataset) This makes deep learning very different to e.g. rule-based or knowledge-based algorithms, which cannot handle any input data, which do not correspond to the internal rules or knowledge. Think of a state-of-the-art machine translator (e.g. DeepL). Thanks to generalization, the translator is able to translate almost any unseen text e.g. from German into English. Previous rule-based approaches failed dramatically in this task (mostly you were not even able to understand the output text).
The figure above depicts the so-called “supervised learning”. This means that the training samples comprise labels, i.e. the true result which is expected by the model (these labels may also be referred to as “ground truth”). For example, a training sample may be an image and the label may be a class, for example “cat”.
How do training labels look like?
The training labels usually have the same format like the model output. For example, in case the model is a binary classifier outputting two possible classes (e.g. dog and cat), the training labels will contain these two classes. In case the model outputs a translated English text (e.g. the translation of a German input text), the training labels will also contain an English translation. Likewise, the input samples of the training set usually represent the same type of information and have the same format like the input to the trained model (i.e. in our examples above, images or German texts).
How can a neural network, i.e. the mathematical model actually handle text?
The words are typically transformed to numbers in a vector form. Then, the mathematical model can process these vectors, i.e. makes tons of calculations and output one or more numbers or vectors. Depending on the task (e.g. in case of a translator) these numbers are retransformed into words.
What is an error function / loss function?
The actual training, i.e. network optimization is done using an error/loss function. In simple terms, the loss function is a method of evaluating how well your algorithm is modeling your dataset. The loss functions serves as the basis for model training in an optimization method, i.e. to adjust the parameters in a direction that minimizes the loss and improves predictive accuracy.
How large must a training dataset be?
We mentioned above that a “large dataset” is necessary for training. How large? 100, 10.000 or 1 Million training samples? Well, it depends on the task. The more complex the task (so called “dimensionality”), the more tunable parameters are needed (i.e. a larger model is required). At the same time, in order to sufficiently optimize the model parameters, the larger the dataset must be. Unfortunately, the data volume size requirements increase exponentially as the complexity of the task increases. This is known as the curse of dimensionality. So, why is anyway possible to successfully train deep learning models? Because the data space we are actually interested in is very small compared to the mathematically possible space. Think of the following example:
Let’s take an image of white noise, i.e. a random selection of white and black pixels.
If you try all possible combinations of black and white pixel combinations, you will also arrive at one point in the almost infinite future at a meaningful image, let’s say of the empire state building or the Mona Lisa:
However, since the total number of possible combinations of black and white pixel distributions is probably higher than number of sand grains in the universe, the current universe will probably collapse before you arrive at this particular distribution. Hence, compared to the mathematically possible data space, our data space of interest (meaningful images) is very small. In other words, probability distribution over images that occur in real life is highly concentrated. By the way, the same is true for other data types, like text strings or sounds. Due to this high concentration of the probability distribution of the training data, the model will also learn in this particular space which is the only relevant one. Or in simple terms: Statistically, real life images are all very similar to each other. Only humans feel that the Empire State Building looks different from the Mona Lisa.
Hence, deep learning models need less training data than what would be necessary in view of the complexity the tasks to be learnt. The sizes can anyway still be impressive: ImageNet-21k contains 14,197,122 images divided into 21,841 classes. That requires tons of crowdsource work.
Happily, there are techniques to significantly reduce the amount of required training data. For example, models can be “pre-trained” on a large dataset, e.g. ImageNet-21k, and then specialized on a specific task, e.g. the recognition of different cat species. This technique is called transfer learning or fine-tuning.
Is there a difference between transfer learning or fine-tuning? Yes!
Fine-tuning means that the model parameters of the pretrained model are slightly adapted during fine-tuning, i.e. they are fine-tuned. In case of a large model with let’s say 1 billion parameters, this can be computationally still quite expensive. In contrast, transfer learning means that the pre-trained model (i.e. its parameters) is not changed, i.e. they are “frozen”. Additional layers, e.g. a new classifier layer, is put on top of the pretrained model and trained on the specific task. Hence, only the parameters of the additional layers need to be trained. Beside possibly reduced computational costs, transfer learning can be advantageous, in case the new task is different from the original training task of the pretrained model. The pretrained model is merely used as a feature extractor in this scenario. Beside computer vision applications (e.g. image recognition), popular applications comprise NLP (natural language processing) : Typically, a language model (e.g. BERT) is used to extract information from an input text (i.e. “understand” the language), and e.g. an added classifier layer is trained to classify the language into different text types (let’s say patent literature and non-patent literature).
In suitable cases only a few hundred training samples are needed for fine-tuning or transfer learning.
So, we now know how the core of an AI model (neural network) looks like and how it can learn from data. But there is much more to know than can be covered in this brief overview. There are still many AI buzz words you would like to understand and generally dive deeper into AI? Have a look to AI Basics part 4 where I explain some more concepts and give some book tips.
author: Christoph Hewel
email: hewel@paustian.de
(photo: Can AI models learn by experience, like kids do?)