在训练 AI 时，如何让我的数据生成器使用更少的 RAM？-解网

问：

我目前正在使用 Kaggle 的笔记本和环境，因此我仅限于 13 GB 的 RAM。在此之前，它使用的是一个小型数据集，但现在它已超过一千兆字节。我确实有一个数据生成器，但现在它不起作用。当我做任何事情时，RAM 目前会达到最大值。我正在使用带有 keras 的 Python 和一个充满我的数据的文本语料库数据集。

我尝试将所有batch_size、步骤hidden_size都调到最低（除了我做得更高的步骤，因为这会降低 RAM 的使用率）。我尝试在谷歌上寻找解决方案，甚至求助于 ChatGPT 寻求帮助。他们都没有奏效。我将不胜感激。法典：

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
from keras.utils import Sequence

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.callbacks import LambdaCallback, ModelCheckpoint, ReduceLROnPlateau
import random
import sys

class TextDataGenerator(Sequence):
    def __init__(self, text, vocabulary, char_to_indices, indices_to_char, max_length, batch_size):
        self.text = text
        self.vocabulary = vocabulary
        self.char_to_indices = char_to_indices
        self.indices_to_char = indices_to_char
        self.max_length = max_length
        self.batch_size = batch_size
        self.steps = (len(text) - max_length) // batch_size
        
    def __len__(self):
        return self.steps
    
    def __getitem__(self, idx):
        batch_start = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size
        batches = self.text[batch_start:batch_end]
        X = np.zeros((self.batch_size, self.max_length, len(self.vocabulary)), dtype=bool)
        y = np.zeros((self.batch_size, len(self.vocabulary)), dtype=bool)
        for i, batch in enumerate(batches):
            for t, char in enumerate(batch[:-1]):
                X[i, t, self.char_to_indices[char]] = 1
            y[i, self.char_to_indices[batch[-1]]] = 1
        return X, y
    
    def on_epoch_end(self):
        random.shuffle(self.text)

with open('/kaggle/input/crptic-python/python.txt', 'r') as file:
    text = file.read()

# A preview of the text file
vocabulary = sorted(list(set(text)))

char_to_indices = dict((c, i) for i, c in enumerate(vocabulary))
indices_to_char = dict((i, c) for i, c in enumerate(vocabulary))

# Dividing the text into subsequences of length max_length
# So that at each time step the next max_length characters
# are fed into the network
max_length = 100
batch_size = 32
steps = 10
sentences = []
next_chars = []
for i in range(0, len(text) - max_length, steps):
    sentences.append(text[i: i + max_length + 1])
    next_chars.append(text[i + max_length + 1])

# Building the LSTM network for the task
model = Sequential()
model.add(LSTM(128, input_shape=(max_length, len(vocabulary))))
model.add(Dense(len(vocabulary)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


# Helper function to sample an index from a probability array
def sample_index(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


# Helper function to generate text after the end of each epoch
def on_epoch_end(epoch, logs):
    if epoch % 1 == 0:
        print()
        print('----- Generating text after Epoch: % d' % epoch)

        start_index = random.randint(0, len(text) - max_length - 1)
        for diversity in [0.1, 0.3, 0.5]:
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + max_length]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                x_pred = np.zeros((1, max_length, len(vocabulary)))
                for t, char in enumerate(sentence):
                    x_pred[0, t, char_to_indices[char]] = 1.

                preds = model.predict(x_pred, verbose=0)[0]
                next_index = sample_index(preds, diversity)
                next_char = indices_to_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Defining a helper function to save the model after each epoch
# in which the loss decreases
filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',
                             verbose=1, save_best_only=True,
                             mode='min')

# Defining a helper function to reduce the learning rate each time
# the learning plateaus
reduce_alpha = ReduceLROnPlateau(monitor='loss', factor=0.2,
                                 patience=1, min_lr=0.001)
callbacks = [print_callback, checkpoint, reduce_alpha]

# Training the LSTM model
data_generator = TextDataGenerator(sentences, vocabulary, char_to_indices, indices_to_char, max_length, batch_size)
model.fit(data_generator, epochs=2, callbacks=callbacks)

def generate_text(length, diversity):
    # Get random starting text
    start_index = random.randint(0, len(text) - max_length - 1)
    generated = ''
    sentence = text[start_index: start_index + max_length]
    generated += sentence
    for i in range(length):
        x_pred = np.zeros((1, max_length, len(vocabulary)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_to_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample_index(preds, diversity)
        next_char = indices_to_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char
    return generated


print(generate_text(500, 0.5))

Python TensorFlow Keras 人工智能数据生成

0赞 Mike 'Pomax' Kamermans 6/22/2023

完整的旁注，但是：当我们非常依赖 3.8 作为仍然支持的绝对最古老的 Python 版本时，为什么您要使用看起来像 python 2.x 的语句？（当然，3.7 还有 6 天，但 6 天不是天）。您是否正在用 Python 的死版本编写新代码？__future__

0赞 ProgrammerGuy 6/22/2023

@Mike'Pomax'Kamermans是的，即使我不知道为什么我需要它，但是当我移除它时，on_epoch_end中出现了一些东西。

答： 暂无答案

上一个：将 CustomDataGenerator 与 Keras 模型一起使用时输入数错误

下一个：为深度学习模型生成示例数据

在训练 AI 时，如何让我的数据生成器使用更少的 RAM？

How do I make my data generator use less RAM when training my AI?

评论