跳转至

Layer Normalization

本文用一个简单的数值例子,说明了 LayerNorm 的作用。

\[ y = \frac{x - \mathrm{E}[x]}{ \sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta \]
Python
import pytorch_lightning as pl
import torch
import torch.nn as nn

pl.seed_everything(0)
Text Only
Seed set to 0

0
Python
input_tensor = torch.randint(0, 10, (2, 3, 4)).float()
input_tensor
Text Only
tensor([[[4., 9., 3., 0.],
         [3., 9., 7., 3.],
         [7., 3., 1., 6.]],

        [[6., 9., 8., 6.],
         [6., 8., 4., 3.],
         [6., 9., 1., 4.]]])

查看 LayerNorm 的效果

normalized_shape 指定为最后一维特征

下面我们以 normalized_shape4 为例,验证 LayerNorm 的效果。

normalized_shape4 时,意味着对最后一个维度上的 4 个元素进行标准化。

以上述输入张量为例,则需要对 [4., 9., 3., 0.] 这一行数据进行标准化。

Python
layer_norm = nn.LayerNorm(4)
Python
output_tensor = layer_norm(input_tensor)
output_tensor
Text Only
tensor([[[ 0.0000,  1.5430, -0.3086, -1.2344],
         [-0.9622,  1.3471,  0.5773, -0.9622],
         [ 1.1531, -0.5241, -1.3628,  0.7338]],

        [[-0.9622,  1.3471,  0.5773, -0.9622],
         [ 0.3906,  1.4321, -0.6509, -1.1717],
         [ 0.3430,  1.3720, -1.3720, -0.3430]]],
       grad_fn=<NativeLayerNormBackward0>)

手动验证结果

Python
input_tensor.mean(dim=2, keepdim=True)
Text Only
tensor([[[4.0000],
         [5.5000],
         [4.2500]],

        [[7.2500],
         [5.2500],
         [5.0000]]])
Python
input_tensor.std(dim=2, unbiased=False, keepdim=True)
Text Only
tensor([[[3.2404],
         [2.5981],
         [2.3848]],

        [[1.2990],
         [1.9203],
         [2.9155]]])
Python
(input_tensor - input_tensor.mean(dim=2, keepdim=True)) / (
    input_tensor.std(dim=2, unbiased=False, keepdim=True) + 1e-5
)
Text Only
tensor([[[ 0.0000,  1.5430, -0.3086, -1.2344],
         [-0.9622,  1.3471,  0.5773, -0.9622],
         [ 1.1531, -0.5241, -1.3628,  0.7338]],

        [[-0.9622,  1.3471,  0.5773, -0.9622],
         [ 0.3906,  1.4321, -0.6509, -1.1717],
         [ 0.3430,  1.3720, -1.3720, -0.3430]]])

normalized_shape 指定为最后两维特征

下面我们以 normalized_shape[3, 4] 为例,验证 LayerNorm 的效果。

normalized_shape[3, 4] 时,意味着对最后两个维度上的 3 行 4 列的元素进行标准化。

以上述输入张量为例,则需要对

Text Only
[[4., 9., 3., 0.],
 [3., 9., 7., 3.],
 [7., 3., 1., 6.]]

进行标准化。

Python
layer_norm = nn.LayerNorm([3, 4])
Python
output_tensor = layer_norm(input_tensor)
output_tensor
Text Only
tensor([[[-0.2053,  1.5541, -0.5571, -1.6128],
         [-0.5571,  1.5541,  0.8504, -0.5571],
         [ 0.8504, -0.5571, -1.2609,  0.4985]],

        [[ 0.0702,  1.3335,  0.9124,  0.0702],
         [ 0.0702,  0.9124, -0.7720, -1.1932],
         [ 0.0702,  1.3335, -2.0354, -0.7720]]],
       grad_fn=<NativeLayerNormBackward0>)

手动验证结果

Python
input_tensor.mean(dim=(1, 2), keepdim=True)
Text Only
tensor([[[4.5833]],

        [[5.8333]]])
Python
input_tensor.std(dim=(1, 2), unbiased=False, keepdim=True)
Text Only
tensor([[[2.8419]],

        [[2.3746]]])
Python
(input_tensor - input_tensor.mean(dim=(1, 2), keepdim=True)) / (
    input_tensor.std(dim=(1, 2), unbiased=False, keepdim=True) + 1e-5
)
Text Only
tensor([[[-0.2053,  1.5541, -0.5571, -1.6128],
         [-0.5571,  1.5541,  0.8504, -0.5571],
         [ 0.8504, -0.5571, -1.2609,  0.4985]],

        [[ 0.0702,  1.3335,  0.9124,  0.0702],
         [ 0.0702,  0.9124, -0.7720, -1.1932],
         [ 0.0702,  1.3335, -2.0354, -0.7720]]])

\(\gamma\)\(\beta\)

在 LayerNorm 中,\(\gamma\)\(\beta\) 是可学习的参数,用于对标准化后的数据进行缩放和平移。

\(\gamma\)\(\beta\) 的初始值为 1 和 0,维数和 normalized_shape 一致。

Python
layer_norm = nn.LayerNorm(4)
print(layer_norm.weight)
print(layer_norm.bias)
Text Only
Parameter containing:
tensor([1., 1., 1., 1.], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)
Python
layer_norm = nn.LayerNorm([3, 4])
print(layer_norm.weight)
print(layer_norm.bias)
Text Only
Parameter containing:
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], requires_grad=True)
Parameter containing:
tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]], requires_grad=True)

评论