提问人:Steven Sagona 提问时间:9/1/2023 最后编辑:Steven Sagona 更新时间:9/10/2023 访问量:139
加速 pytorch 代码的困难:使用复杂的多对一非线性函数训练 MLP
Difficulty speeding up pytorch code: training a MLP using a complicated many-to-one nonlinear function
问:
总之:
我的目标是弄清楚是否可以使用特定的复杂非线性函数来替换神经网络中的单个神经元。理想情况下,我想证明我可以训练MNIST的数字图片。我已经尝试过 pytorch,但它太慢了,主要是因为我无法弄清楚如何并行执行批处理和神经元,我正在寻找能够显着加快该过程的想法或方法。
神经网络中的典型神经元被定义为执行点积,然后对该点积的输出 f(x dot w) 执行非线性函数。
我考虑的不是 f(x 点 w),而是多对一非线性函数,它是 x 和 w 的更一般的非线性函数,即 f(x, w)。非线性函数 f(x, w) 采用 X 的一维数组和 W 的一维数组,并返回单个输出。我有执行此计算的numpy代码。它模拟一个真实的物理系统,需要一系列递归积分来计算。在上一个问题中,我了解到我可以将我的 numpy 代码转换为 pytorch 函数,并且 pytorch 应该能够自动为我执行反向传播渐变。
现在我有描述非线性函数 f(x,w) 的 pytorch 代码。我想证明我可以用它来学习数字图片,因此我将MNIST数字减少到10x10像素图像,并建立了一个受MLP启发的网络,其中包含100个输入,隐藏大小为100,输出为10。
更详细地解释这个受 MLP 启发的网络:
第一层由 100 个“神经元”组成,其中典型的神经元被我的非线性函数 f(x, w) 取代。100 个“神经元”中的每一个都接受 X 的输入,并具有一组不同的权重 w。最后,这 100 个神经元的输出被传递到下一层。下一层只是 10 个神经元,每个神经元的输出用于识别 10 个数字中的每一个。
以下是网络前向传递的代码片段:
class MLP(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(MLP, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_classes = num_classes
self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size)) # weights of size (input_size, hidden_size)
self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)
def forward(self, x):
hidden_outputs = []
for neuron_weights in self.weights1.T: # loop over each neuron's weights in the first layer
output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
output_value = F.relu(output_value) # apply a relu activation function
hidden_outputs.append(output_value)
final_outputs = []
for neuron_weights in self.weights2.T: # loop over each neuron's weights in the second layer
output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
final_outputs.append(output_value)
final_outputs = [output.unsqueeze(0) for output in final_outputs]
final_output = torch.stack(final_outputs, dim=0)
final_output = final_output.t()
return final_output
问题在于,训练该网络的每次迭代大约需要 20 分钟,只需对单个数字进行一次前馈传递即可进行训练。所以,我真的需要弄清楚如何让它更快。
input_output_nonlinearity() 是我的非线性函数 f(x, w)。从代码中可以看出,我通过遍历每个权重,分别找到网络中每个“神经元”的输出。不过,原则上,每个神经元都是完全独立的,可以并行运行。
因此,一种方法是进一步矢量化我的代码。但是,我无法找到一种简单的矢量化方法,可以将矩阵 X 和矩阵 W 传递给 f(x, w),这样我就可以获得一组不同神经元的输出和一组输入数据(我将在最后给出完整的代码)。对我来说,这似乎很难实现(但我确信这是可能的)。
另一个想法是,也许还有另一种方法可以告诉 pytorch 这些计算是完全独立的,因此它可以在引擎盖下进行一些并行处理?如果可能的话,有什么想法,或者我是否必须通过完全矢量化的解决方案强行破解?
这是整个事情的代码。对于篇幅太长,我深表歉意,但我想提供完整的代码,以便正确识别任何速度效率低下的问题。
以下是网络训练的代码:
from neural_network_pytorch_dot_product import *
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch.utils.data as data
import numpy
import torch.nn.functional as F
numpoints_for_greens_integrals = 100
C_total = .1
readoutStrength = 1/C_total
def reduce_dataset(dataloader, fraction):
num_samples = int(len(dataloader.dataset) * fraction)
indices = torch.randperm(len(dataloader.dataset))[:num_samples]
new_dataset = data.Subset(dataloader.dataset, indices)
new_dataloader = data.DataLoader(new_dataset, batch_size=dataloader.batch_size, shuffle=True)
return new_dataloader
class MLP(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(MLP, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_classes = num_classes
self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size)) # weights of size (input_size, hidden_size)
self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)
def forward(self, x):
hidden_outputs = []
for neuron_weights in self.weights1.T: # loop over each neuron's weights in the first layer
output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
output_value = F.relu(output_value) # apply a relu activation function
hidden_outputs.append(output_value)
final_outputs = []
for neuron_weights in self.weights2.T: # loop over each neuron's weights in the second layer
output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
final_outputs.append(output_value)
final_outputs = [output.unsqueeze(0) for output in final_outputs]
final_output = torch.stack(final_outputs, dim=0)
final_output = final_output.t()
return final_output
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# hyperparameters
input_size = 100
hidden_size = 100
num_classes = 10
num_epochs = 10
batch_size = 1
learning_rate = 0.001
pixelX = 10
# MNIST dataset (28x28 images!)
# train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.ToTensor())
# Compressed images to (10x10 images!)
# Define a new transform to resize the images
resize_transform = transforms.Resize((10, 10))
# MNIST dataset with resize transform
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.Compose([
transforms.ToTensor(),
resize_transform
]), download=True)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.Compose([
transforms.ToTensor(),
resize_transform
]))
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
# Update train loader and test loader
train_loader = reduce_dataset(train_loader, 0.01) # reduce to 1% of original size
test_loader = reduce_dataset(test_loader, 0.01) # reduce to 1% of original size
# instantiate the MLP
model = MLP(input_size, hidden_size, num_classes).to(device)
# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# train the model
for epoch in range(num_epochs):
running_loss = 0.0
for i, (images, labels) in enumerate(train_loader):
images = images.reshape(-1, pixelX*pixelX).to(device)
labels = labels.to(device)
# forward pass
print('beginning forward pass')
outputs = model(images)
loss = criterion(outputs, labels)
# backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")
# test the model
with torch.no_grad():
correct = 0
total = 0
for images, labels in test_loader:
images = images.reshape(-1, pixelX*pixelX).to(device)
labels = labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on test images: {} %'.format(100 * correct / total))
下面是描述非线性函数 f(x, w) 的代码:
import torch
from torch import nn, optim
import numpy as np
from scipy import special
def readinKernel_torch(wdummy, z, Ec, Ep, kval=1, ic = 10**-9*torch.sqrt(3.14/(8*torch.log(torch.tensor([2.]))))*2*3.14*3*10**6, od = 10000, gamma = 2*3.14*18*10**9/(2*3.14*3*10**6), extra = 1, Np = (8*torch.log(torch.tensor([2.]))/(torch.pow(torch.tensor([10.])**-9*(2*3.14*3*10**6),2)*torch.tensor(torch.pi))).pow(0.25)):
return Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(z, (1 - wdummy))))*torch.exp(-1*1j*Ec**2*extra*repmat_torch(wdummy, len(z))*kval**2*gamma/od)* torch.sqrt(ic)*Ep*Np
def readoutKernel_torch(zdummy, z, B_in, Ec, kval=1):
return (Ec * kval *
steep_sigmoid_torch(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), 50) *
1 / torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10)) *
torch.special.bessel_j1(2 * Ec * kval * torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10))) *
repmat_torch(B_in, len(z)))
def final_readoutKernel_torch(zdummy, w, Ec, B_in, kval=1):
# This is the same Kernel as the readin kernel, but with K(z, w) switched to K(w, z).
out = Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(w, (1 - zdummy))))*repmat_torch(B_in, len(w))
return out
def repmat_torch(arr, num_reps):
return arr.view(1, -1).repeat(num_reps, 1)
def steep_sigmoid_torch(x, k=50):
return 1.0 / (1.0 + torch.exp(-k*x))
def complex_trap_torch(z, xvals, axisdim):
real_values = torch.real(z)
imaginary_values = torch.imag(z)
real_integral = torch.trapz(real_values, x=xvals, dim=axisdim)
imaginary_integral = torch.trapz(imaginary_values, x=xvals, dim=axisdim)
complexout = real_integral + 1j * imaginary_integral
return complexout
def spinwave_recursive_calculation_torch(B_in, z_values, w_values, Ec, Ep, c_per_mode = 1):
readin_values = readinKernel_torch(w_values, z_values, Ec, Ep, kval = c_per_mode)
readout_values = readoutKernel_torch(z_values, z_values, B_in, Ec, kval = c_per_mode)
readin_integrals = complex_trap_torch(readin_values, xvals=w_values, axisdim=1)
readout_integrals = complex_trap_torch(readout_values, xvals=z_values, axisdim=1)
spinwave = readin_integrals - readout_integrals + B_in
return spinwave
def input_output_nonlinearity_torch(x, w, numpoints = 100, C_total = 1, readoutStrength = 1):
z_values = torch.linspace(1e-10, 1-1e-10, numpoints)
w_values = torch.linspace(1e-10, 1-1e-10, numpoints)
Bin = torch.zeros(len(z_values), dtype=torch.complex128)
BoutMatrix = repmat_torch(Bin, len(w))
c_per_mode = C_total/len(w)
for i in range(len(w)):
E_c_val = w[i]
E_p_val = x[i]
# print('E_p_val', E_p_val)
# print('x', x)
BoutMatrix[i, :] = spinwave_recursive_calculation_torch(Bin, z_values, w_values, E_c_val, E_p_val, c_per_mode)
Bin = BoutMatrix[i, :]
Bout = BoutMatrix[-1, :]
output_Efield_w_z = final_readoutKernel_torch(z_values, w_values, readoutStrength, Bout, kval=1)
output_Efield_w = torch.trapz(torch.real(output_Efield_w_z), x = z_values, dim =1)
output_Efield = torch.trapz(torch.real(output_Efield_w), x = w_values, dim = 0)
return output_Efield
(再次,为冗长的代码道歉,但我问题的关键困难恰恰在于很难将如此复杂的事情矢量化。如果我尝试编写一个更简单的示例,那么它可能会更清楚如何矢量化它,而用户对该简化问题的回答对我没有帮助。
答: 暂无答案
评论