pytorch-lightning 的梯度累积中的数值不精确?

Numerical inprecision in pytorch-lightning's Gradient Accumulation?

提问人:Astrid Hofmann 提问时间:10/7/2023 更新时间:10/7/2023 访问量:15

问:

根据我对梯度累积的理解(例如这篇文章这篇文章),使用 vs. 应该是一样的。batch_size = xaccumulate_gradient = h, batch size = x/h

因此,这三个示例应该在 pytorch-lightning 中计算相同的内容:

import pytorch_lightning as pl

pl.seed_everything(42)

batch_size = 8
trainer = pl.Trainer(max_epochs=1)

batch_size = 8
trainer = pl.Trainer(max_epochs=1, accumulate_grad_batches=1)

batch_size = 4
trainer = pl.Trainer(max_epochs=1, accumulate_grad_batches=2)

但是,他们没有。在经过几百个批次后,模型的权重在所有三种情况下都略有不同 (~ e⁻5)。


下面是一个完全可重现的示例:

[蟒蛇 3.11.5,pytorch-lightning 2.0.9]

import numpy as np
import pandas
import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

#seed everything
pl.seed_everything(42)

# happens with this dataset and also my original working dataset, MasakhaNER
training_data = datasets.CIFAR10(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.CIFAR10(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

# happens with this CNN, but also with an AutoEncoder and a huggingface BERT model
class Network(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.conv1 = torch.nn.Conv2d(3, 32, 3, padding=1)
        self.pool = torch.nn.MaxPool2d(2, 2)
        self.fc1 = torch.nn.Linear(32 * 16 * 16, 128)
        self.fc2 = torch.nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = x.view(-1, 32 * 16 * 16)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

    def training_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        loss = torch.nn.CrossEntropyLoss()(outputs, labels)

        self.writeWeights(inputs, batch_idx)

        return loss

    def configure_optimizers(self):
# also happens with SDG
        return torch.optim.Adam(self.parameters(), lr=0.001)

  
    # the weight after the last batch should be the same across all all versions, but is not!
    def writeWeights(self, x, batch_idx):
        if ACC:
            if (accumulate_gradient == 1):
                path = "weight_tracker_a1.csv"
            else:
                path = "weight_tracker_a.csv"
        else:
            path = "weight_tracker.csv"

        np.set_printoptions(formatter={'float': '{: e}'.format})
        df = pandas.read_csv(path)
        params = list(self.parameters())[0]
        params = params.detach().cpu().numpy()
        df = df._append({"shape": x.shape, "batch_idx": batch_idx, "weights_1": params}, ignore_index=True)
        df.to_csv(path, float_format='{.e}'.format, index=False)



#Set the 3 different test options here:
ACC = True
#Set accumulate_gradient to 1 or 2 if ACC = True for logging
accumulate_gradient = 2
#set accumulate gradient here manually for training!
#also happens without limit_train_batches
trainer = pl.Trainer(limit_train_batches=limit_train_batches, max_epochs=1, accumulate_grad_batches=2)

if ACC:
    if (accumulate_gradient == 1):
        batch_size = 16
        limit_train_batches = 100
        df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
        df.to_csv("weight_tracker_a1.csv", float_format='{e}'.format, index=False)
    else:
        batch_size = 8
        limit_train_batches = 200
        df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
        df.to_csv("weight_tracker_a.csv", float_format='{e}'.format, index=False)

else:
    batch_size = 16
    limit_train_batches = 100
    df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
    df.to_csv("weight_tracker.csv", float_format='{e}'.format, index=False)


# init stuff
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

model = Network()

#train
trainer.fit(model=model, train_dataloaders=train_dataloader)


我尝试切换学习率、优化器、数据集和模型;3 个版本的差异仍然存在。我也在 CPU 和 2 个不同的 GPU 上尝试过,差异仍然存在。

我想知道这种差异究竟从何而来,如果可以根除,有什么想法吗?

浮点 精度 pytorch-lightning

评论


答: 暂无答案