提问人:Astrid Hofmann 提问时间:10/7/2023 更新时间:10/7/2023 访问量:15
pytorch-lightning 的梯度累积中的数值不精确?
Numerical inprecision in pytorch-lightning's Gradient Accumulation?
问:
根据我对梯度累积的理解(例如这篇文章和这篇文章),使用 vs. 应该是一样的。batch_size = x
accumulate_gradient = h, batch size = x/h
因此,这三个示例应该在 pytorch-lightning 中计算相同的内容:
import pytorch_lightning as pl
pl.seed_everything(42)
batch_size = 8
trainer = pl.Trainer(max_epochs=1)
batch_size = 8
trainer = pl.Trainer(max_epochs=1, accumulate_grad_batches=1)
batch_size = 4
trainer = pl.Trainer(max_epochs=1, accumulate_grad_batches=2)
但是,他们没有。在经过几百个批次后,模型的权重在所有三种情况下都略有不同 (~ e⁻5)。
下面是一个完全可重现的示例:
[蟒蛇 3.11.5,pytorch-lightning 2.0.9]
import numpy as np
import pandas
import pytorch_lightning as pl
import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
#seed everything
pl.seed_everything(42)
# happens with this dataset and also my original working dataset, MasakhaNER
training_data = datasets.CIFAR10(
root="data",
train=True,
download=True,
transform=ToTensor()
)
test_data = datasets.CIFAR10(
root="data",
train=False,
download=True,
transform=ToTensor()
)
# happens with this CNN, but also with an AutoEncoder and a huggingface BERT model
class Network(pl.LightningModule):
def __init__(self):
super().__init__()
self.conv1 = torch.nn.Conv2d(3, 32, 3, padding=1)
self.pool = torch.nn.MaxPool2d(2, 2)
self.fc1 = torch.nn.Linear(32 * 16 * 16, 128)
self.fc2 = torch.nn.Linear(128, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = x.view(-1, 32 * 16 * 16)
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
def training_step(self, batch, batch_idx):
inputs, labels = batch
outputs = self(inputs)
loss = torch.nn.CrossEntropyLoss()(outputs, labels)
self.writeWeights(inputs, batch_idx)
return loss
def configure_optimizers(self):
# also happens with SDG
return torch.optim.Adam(self.parameters(), lr=0.001)
# the weight after the last batch should be the same across all all versions, but is not!
def writeWeights(self, x, batch_idx):
if ACC:
if (accumulate_gradient == 1):
path = "weight_tracker_a1.csv"
else:
path = "weight_tracker_a.csv"
else:
path = "weight_tracker.csv"
np.set_printoptions(formatter={'float': '{: e}'.format})
df = pandas.read_csv(path)
params = list(self.parameters())[0]
params = params.detach().cpu().numpy()
df = df._append({"shape": x.shape, "batch_idx": batch_idx, "weights_1": params}, ignore_index=True)
df.to_csv(path, float_format='{.e}'.format, index=False)
#Set the 3 different test options here:
ACC = True
#Set accumulate_gradient to 1 or 2 if ACC = True for logging
accumulate_gradient = 2
#set accumulate gradient here manually for training!
#also happens without limit_train_batches
trainer = pl.Trainer(limit_train_batches=limit_train_batches, max_epochs=1, accumulate_grad_batches=2)
if ACC:
if (accumulate_gradient == 1):
batch_size = 16
limit_train_batches = 100
df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
df.to_csv("weight_tracker_a1.csv", float_format='{e}'.format, index=False)
else:
batch_size = 8
limit_train_batches = 200
df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
df.to_csv("weight_tracker_a.csv", float_format='{e}'.format, index=False)
else:
batch_size = 16
limit_train_batches = 100
df = pandas.DataFrame(columns=["shape", "batch_idx", "weights_1"])
df.to_csv("weight_tracker.csv", float_format='{e}'.format, index=False)
# init stuff
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)
model = Network()
#train
trainer.fit(model=model, train_dataloaders=train_dataloader)
我尝试切换学习率、优化器、数据集和模型;3 个版本的差异仍然存在。我也在 CPU 和 2 个不同的 GPU 上尝试过,差异仍然存在。
我想知道这种差异究竟从何而来,如果可以根除,有什么想法吗?
答: 暂无答案
评论