加速 pytorch 代码的困难：使用复杂的多对一非线性函数训练 MLP-解网

问：

总之：

我的目标是弄清楚是否可以使用特定的复杂非线性函数来替换神经网络中的单个神经元。理想情况下，我想证明我可以训练MNIST的数字图片。我已经尝试过 pytorch，但它太慢了，主要是因为我无法弄清楚如何并行执行批处理和神经元，我正在寻找能够显着加快该过程的想法或方法。

神经网络中的典型神经元被定义为执行点积，然后对该点积的输出 f（x dot w）执行非线性函数。

我考虑的不是 f（x 点 w），而是多对一非线性函数，它是 x 和 w 的更一般的非线性函数，即 f（x， w）。非线性函数 f（x， w）采用 X 的一维数组和 W 的一维数组，并返回单个输出。我有执行此计算的numpy代码。它模拟一个真实的物理系统，需要一系列递归积分来计算。在上一个问题中，我了解到我可以将我的 numpy 代码转换为 pytorch 函数，并且 pytorch 应该能够自动为我执行反向传播渐变。

现在我有描述非线性函数 f（x，w）的 pytorch 代码。我想证明我可以用它来学习数字图片，因此我将MNIST数字减少到10x10像素图像，并建立了一个受MLP启发的网络，其中包含100个输入，隐藏大小为100，输出为10。

更详细地解释这个受 MLP 启发的网络：

第一层由 100 个“神经元”组成，其中典型的神经元被我的非线性函数 f（x， w）取代。100 个“神经元”中的每一个都接受 X 的输入，并具有一组不同的权重 w。最后，这 100 个神经元的输出被传递到下一层。下一层只是 10 个神经元，每个神经元的输出用于识别 10 个数字中的每一个。

以下是网络前向传递的代码片段：

class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size))  # weights of size (input_size, hidden_size)
        self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)

    def forward(self, x):
        hidden_outputs = []
        for neuron_weights in self.weights1.T:  # loop over each neuron's weights in the first layer
            output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            output_value = F.relu(output_value)  # apply a relu activation function
            hidden_outputs.append(output_value)

       final_outputs = []
       for neuron_weights in self.weights2.T:  # loop over each neuron's weights in the second layer
           output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
           final_outputs.append(output_value)

       final_outputs = [output.unsqueeze(0) for output in final_outputs]
       final_output = torch.stack(final_outputs, dim=0)
       final_output = final_output.t()
       return final_output

问题在于，训练该网络的每次迭代大约需要 20 分钟，只需对单个数字进行一次前馈传递即可进行训练。所以，我真的需要弄清楚如何让它更快。

input_output_nonlinearity（）是我的非线性函数 f（x， w）。从代码中可以看出，我通过遍历每个权重，分别找到网络中每个“神经元”的输出。不过，原则上，每个神经元都是完全独立的，可以并行运行。

因此，一种方法是进一步矢量化我的代码。但是，我无法找到一种简单的矢量化方法，可以将矩阵 X 和矩阵 W 传递给 f（x， w），这样我就可以获得一组不同神经元的输出和一组输入数据（我将在最后给出完整的代码）。对我来说，这似乎很难实现（但我确信这是可能的）。

另一个想法是，也许还有另一种方法可以告诉 pytorch 这些计算是完全独立的，因此它可以在引擎盖下进行一些并行处理？如果可能的话，有什么想法，或者我是否必须通过完全矢量化的解决方案强行破解？

这是整个事情的代码。对于篇幅太长，我深表歉意，但我想提供完整的代码，以便正确识别任何速度效率低下的问题。

以下是网络训练的代码：

from neural_network_pytorch_dot_product import *
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch.utils.data as data
import numpy
import torch.nn.functional as F

numpoints_for_greens_integrals = 100
C_total = .1
readoutStrength = 1/C_total

def reduce_dataset(dataloader, fraction):
    num_samples = int(len(dataloader.dataset) * fraction)
    indices = torch.randperm(len(dataloader.dataset))[:num_samples]
    new_dataset = data.Subset(dataloader.dataset, indices)
    new_dataloader = data.DataLoader(new_dataset, batch_size=dataloader.batch_size, shuffle=True)
    return new_dataloader


class MLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLP, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_classes = num_classes
        self.weights1 = nn.Parameter(torch.randn(input_size, hidden_size))  # weights of size (input_size, hidden_size)
        self.weights2 = nn.Parameter(torch.randn(hidden_size, num_classes)) # weights of size (hidden_size, num_classes)

    def forward(self, x):
        hidden_outputs = []
        for neuron_weights in self.weights1.T:  # loop over each neuron's weights in the first layer
            output_value = input_output_nonlinearity_torch(x.squeeze(), neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            output_value = F.relu(output_value)  # apply a relu activation function
            hidden_outputs.append(output_value)

        final_outputs = []
        for neuron_weights in self.weights2.T:  # loop over each neuron's weights in the second layer
            output_value = input_output_nonlinearity_torch(hidden_outputs, neuron_weights, C_total = C_total, readoutStrength = readoutStrength)
            final_outputs.append(output_value)

        final_outputs = [output.unsqueeze(0) for output in final_outputs]
        final_output = torch.stack(final_outputs, dim=0)
        final_output = final_output.t()
        return final_output

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# hyperparameters

input_size = 100
hidden_size = 100
num_classes = 10
num_epochs = 10
batch_size = 1
learning_rate = 0.001

pixelX = 10

# MNIST dataset (28x28 images!)
# train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True)
# test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.ToTensor())

# Compressed images to (10x10 images!)
# Define a new transform to resize the images
resize_transform = transforms.Resize((10, 10))

# MNIST dataset with resize transform
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms.Compose([
    transforms.ToTensor(),
    resize_transform
]), download=True)

test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms.Compose([
    transforms.ToTensor(),
    resize_transform
]))


train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Update train loader and test loader
train_loader = reduce_dataset(train_loader, 0.01) # reduce to 1% of original size
test_loader = reduce_dataset(test_loader, 0.01) # reduce to 1% of original size


# instantiate the MLP
model = MLP(input_size, hidden_size, num_classes).to(device)

# loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


# train the model
for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        images = images.reshape(-1, pixelX*pixelX).to(device)
        labels = labels.to(device)
        
        # forward pass
        print('beginning forward pass')
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")

# test the model
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, pixelX*pixelX).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on test images: {} %'.format(100 * correct / total))

下面是描述非线性函数 f（x， w）的代码：

import torch
from torch import nn, optim
import numpy as np
from scipy import special

def readinKernel_torch(wdummy, z, Ec, Ep, kval=1, ic = 10**-9*torch.sqrt(3.14/(8*torch.log(torch.tensor([2.]))))*2*3.14*3*10**6, od = 10000, gamma = 2*3.14*18*10**9/(2*3.14*3*10**6), extra = 1, Np = (8*torch.log(torch.tensor([2.]))/(torch.pow(torch.tensor([10.])**-9*(2*3.14*3*10**6),2)*torch.tensor(torch.pi))).pow(0.25)):
    return Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(z, (1 - wdummy))))*torch.exp(-1*1j*Ec**2*extra*repmat_torch(wdummy, len(z))*kval**2*gamma/od)* torch.sqrt(ic)*Ep*Np

def readoutKernel_torch(zdummy, z, B_in, Ec, kval=1):
    return  (Ec * kval *
            steep_sigmoid_torch(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), 50) *
            1 / torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10)) *
            torch.special.bessel_j1(2 * Ec * kval * torch.sqrt(torch.clamp(torch.sub(z.repeat(zdummy.size(0), 1).T, zdummy), min=1e-10))) *
            repmat_torch(B_in, len(z)))

def final_readoutKernel_torch(zdummy, w, Ec, B_in, kval=1):
    #  This is the same Kernel as the readin kernel, but with K(z, w) switched to K(w, z).    
    out = Ec * kval * torch.special.bessel_j0(2* Ec * kval * torch.sqrt(torch.ger(w, (1 - zdummy))))*repmat_torch(B_in, len(w))
    return out

def repmat_torch(arr, num_reps):
    return arr.view(1, -1).repeat(num_reps, 1)

def steep_sigmoid_torch(x, k=50):
    return 1.0 / (1.0 + torch.exp(-k*x))

def complex_trap_torch(z, xvals, axisdim):
    real_values = torch.real(z)
    imaginary_values = torch.imag(z)
    real_integral = torch.trapz(real_values, x=xvals, dim=axisdim)
    imaginary_integral = torch.trapz(imaginary_values, x=xvals, dim=axisdim)
    complexout = real_integral + 1j * imaginary_integral
    return complexout

def spinwave_recursive_calculation_torch(B_in, z_values, w_values, Ec, Ep, c_per_mode = 1):

    readin_values = readinKernel_torch(w_values, z_values, Ec, Ep, kval = c_per_mode)
    readout_values = readoutKernel_torch(z_values, z_values, B_in, Ec, kval = c_per_mode)

    readin_integrals = complex_trap_torch(readin_values, xvals=w_values, axisdim=1)
    readout_integrals = complex_trap_torch(readout_values, xvals=z_values, axisdim=1) 

    spinwave = readin_integrals - readout_integrals + B_in
    return spinwave



def input_output_nonlinearity_torch(x, w, numpoints = 100, C_total = 1, readoutStrength = 1):
    z_values =  torch.linspace(1e-10, 1-1e-10, numpoints)
    w_values =  torch.linspace(1e-10, 1-1e-10, numpoints)

    Bin = torch.zeros(len(z_values), dtype=torch.complex128)
    BoutMatrix = repmat_torch(Bin, len(w))

    c_per_mode = C_total/len(w)

    for i in range(len(w)):
        E_c_val = w[i]
        E_p_val = x[i]
        # print('E_p_val', E_p_val)
        # print('x', x)
        BoutMatrix[i, :] = spinwave_recursive_calculation_torch(Bin, z_values, w_values, E_c_val, E_p_val, c_per_mode)
        Bin = BoutMatrix[i, :]
    Bout = BoutMatrix[-1, :]

    output_Efield_w_z = final_readoutKernel_torch(z_values, w_values, readoutStrength, Bout, kval=1)
    output_Efield_w = torch.trapz(torch.real(output_Efield_w_z), x = z_values, dim =1)
    output_Efield = torch.trapz(torch.real(output_Efield_w), x = w_values, dim = 0)

    return output_Efield

（再次，为冗长的代码道歉，但我问题的关键困难恰恰在于很难将如此复杂的事情矢量化。如果我尝试编写一个更简单的示例，那么它可能会更清楚如何矢量化它，而用户对该简化问题的回答对我没有帮助。

Python PyTorch 神经网络矢量化感知器

0赞 Nick ODell 9/1/2023

一般来说，人们通常使用由可训练的线性层和具有简单导数的非线性传递函数组成的神经网络。除非您有理由认为更改表示是有帮助的，否则使用简单的传递函数和大量层可能比使用复杂的传递函数和少量层更好。

1赞 Steven Sagona 9/1/2023

@NickODell，是的，我很清楚这一点，我有考虑这个系统的具体原因。

0赞 Yakov Dan 9/8/2023

是否可以将批处理大小限制为 1？另外，是否必须使用 PyTorch 解决？

1赞 Steven Sagona 9/8/2023

@YakovDan，批量大小为 1 的唯一原因是我无法弄清楚如何以一种一次处理多个输入的方式编写我的非线性函数。同时处理多个批次将非常有用。另外，我选择了 pytorch 是因为它易于使用，但这不是我使用 pytorch 的要求。

0赞 Juliano Negri 9/10/2023

我认为问题不在于函数慢，而在于 for 循环巨大。您的内部函数每次传递运行 100.000.000...我没看错你的代码吗？2 for 循环，neuron_weights为 100 x 100 = 10.000

答： 暂无答案

上一个：对于给定数据集，感知器是否总是收敛到相同的权重

下一个：sklearn 感知器无法对 NAND 函数进行分类

加速 pytorch 代码的困难：使用复杂的多对一非线性函数训练 MLP

Difficulty speeding up pytorch code: training a MLP using a complicated many-to-one nonlinear function

评论