Consider a grayscale input image of size 28x28 pixels. For a single such image, the dimensions would then be: 1x28x28, where the “1x” part refers to a single grayscale channel.
For a typical color image of size 28x28, the dimensions would be 3x28x28, as color images have three channels: red, green, and blue.
Internally (within the CNN), we can select any number N of channels (N values per pixel), resulting in Nx28x28 values per image.
Think of this as having "images with more color channels than humans can detect." While humans have three types of color detectors (R,G,B), imagine an alien with a much richer vision system—say, 32 color channels. This would give us 32x28x28 values per "image."
When using multiple convolution kernels, we end up with much more output data compared to the input to the layer. Stacking many such layers results in too many parameters in subsequent layers, since the amount of data from layer X directly affects the number of weights in layer X+1.
There are two common ways to reduce dimensionality of the data, after such an expansion:
Imagine we have a 32x28x28 input feeding into a 1x1 convolution layer with 16 kernels. Using our alien vision analogy, this is like converting a rich 32-channel "alien color" image into a more focused 16-channel version. The network learns the most valuable combinations of the original 32 channels for the specific task, creating an efficient yet meaningful representation. Think of it like converting an RGB image (3 channels) to black and white (1 channel), but with more sophisticated dimensionality.
Max pooling reduces dimensions by taking 2×2 pixel groups (or larger) and selecting the highest value from each group. This process causes some data loss, but it's acceptable since the back-propagation training accounts for this reduction. As a result, the network develops an optimization pressure to preserve important information.
<aside> 💡
Max Pooling is also explained here:
Tensor dimensions, dual conv2d
</aside>
