To limit the number of dimensions sent to the next layer, it is often necessary to squeeze the data a bit.
Example:
Assume you have 10,000 tomatos delivered in a 100x100-slot grid-like container from your “tomato-processing machine #1”.
Further, assume you also have a “tomato-processing machine #2”, taking its input from the first machine. But machine number 2 requires the input tomatoes to be supplied in a 50x50-slot grid. Not a 100x100 grid.
Now you have got a problem; you’ll have to press 2x2=4 tomatoes into each slot:

<aside> 💡
Note: I have no intention of drawing 100x100+50x50 tomatos, so the picture instead shows: 4x4 → 2x2, but the idea is exactly the same.
</aside>
That’s going to be slightly messy, and some finer details of each tomato may get lost in the process - but all 4 tomatos will contribute to the tomato-like mush in the corresponding 50x50 slot.
The mushiness is simply the “price” we have to pay for insisting on having a machine that expects a 50x50-grid as input. And there may actually be benefits of the mushiness as well.
This is an explanation of how to understand the exact tensor dimensions involved when data flows through two consecutive convolutional conv2d layers. Let’s use the MNIST dataset dimensions in this example. It consists of grayscale (i.e. 1-channel) images, 28x28 pixels each. For simplicity, here are some limitations of this example:
After applying this layer: