Lab's weekly topic - week 50 2023 - Detailed explanation for encoder of U-Net
In U-Net the encoder block comprises two components: 2D convolution and max-pooling operations. The 2D convolution is a technique in image processing that generates a feature map. While max-pooling can reduce the size/shape of the feature map. The 2D convolution equation is as follows:
Where 𝑲 is the kernel or filter, pass it over the image 𝑰. The indexes of rows and columns of the results feature map are marked with m and n respectively. In Figure 1, we show what are the operations in the encoder block, including 2D convolution, max pooling and related technologies.
Figure 1. Operations in the encoder block
In Figure 1, we add a layer with the number “0” called padding, which is used to maintain the size of the original input. Then the filter K shaped as a 3 by 3 matrix performs an element-wise multiplication with the input image 𝑰. The values are summed to generate the feature map. This operation is achieved by sliding the filter over the input image, and the movement happens in steps known as strides. Finally, the max-pooling operation is employed to calculate the maximum value within each patch of the feature map. The purpose of max-pooling is to reduce the dimensionality of the feature map.
Therefore, there are two outputs for each encoder block, feature map and max pooling map. The max pooling map $\mathbf{S}^l_{map}$ will be the input of the next encoder block, and the feature map $\mathbf{FC}^l_{map}$ will be passed to the other side with a skip connection. We describe the lth encoder block as follows:
\begin{equation} \mathbf{FC}^l_{map} = Conv2D(\mathbf{S}^{l-1}_{map}, \mathbf{Filters}^l) \end{equation}
\begin{equation}\mathbf{S}^l_{map} = Max Pooling(\mathbf{FC}^l_{map}) \end{equation}
Comments
Post a Comment