错误说我的行不一致,但检查它显示其他情况

error says my rows are inconsistent but checking it shows otherwise

提问人:Anon_name 提问时间:10/14/2023 最后编辑:Anon_name 更新时间:10/15/2023 访问量:48

问:

我正在尝试训练一个 SVM,我的数据由 3 列组成(str 类型的文本、int 的文本长度以及分别由 1(表示文本是幽默的)和 0 表示的“幽默”标签)。错误返回,我将其理解为我的X_train和y_train的长度差异,但检查它表明它没有不同。以下是到目前为止的代码片段:


data = '/kaggle/input/200k-short-texts-for-humor-detection/dataset.csv'
# https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection/data
df = pd.DataFrame(pd.read_csv(data,nrows=10000))

# preprocessing {True:1, False:0} Label Encoding
# True (i.e Funny) becomes 1, & vice versa
df['humor'] = np.where(df['humor'],1,0) 

import nltk

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# nltk.download()

from nltk.corpus import stopwords 
nltk.download('stopwords')

sw = set (stopwords.words('english'))
stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

def stemfilter(tokens,stemmer):
    # list argument
    stemmed = []
    for token in tokens:
        stemmed.append(stemmer.stem(token))
    return stemmed
    

def preproc(sentence):
    # sentence is given as a list of tokenized words
    sentence = [w.lower() for w in sentence if w.isalpha()] # removing uppercasing

# testing to see what works best
#     sentence = [word for word in sentence if word not in sw] #stopword removal
    sentence = [stemmer.stem(word) for word in sentence] # changing words into their rootwords eg. achieving n achievment become achiev
#     sentence = [lemmatizer.lemmatize(word) for word in sentence]
    return ' '.join(sentence)    
    
new_data = []
for sentence in df['text']:
    sentence = word_tokenize(sentence)
    sentence = preproc(sentence)
    new_data.append(sentence)

new_data = pd.DataFrame(new_data)
new_data = new_data.rename(columns = {0:"text"})
new_data['humor'] = [a for a in df['humor']]

new_data['length'] = new_data['text'].apply(len)

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

X = new_data[['text','length']]

X_train, X_test, y_train, y_test = train_test_split(X,new_data['humor'],test_size=0.2, random_state=42)

# by the way, I do not think that it is attributed to the preprocessing done, because the code runs when X = new_data['text'], without the ['humor'] column

pipe_new = make_pipeline(TfidfVectorizer(), SVC())
pipe_new.fit(X_train , y_train)


但是我收到此错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[42], line 10
      7 # print(X_train.info, y_train.info)
      9 pipe = make_pipeline(TfidfVectorizer(), SVC())
---> 10 pipe.fit(X_train , y_train)
     11 # pipe.score(X_test , y_test)

File /opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
    403     if self._final_estimator != "passthrough":
    404         fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405         self._final_estimator.fit(Xt, y, **fit_params_last_step)
    407 return self

File /opt/conda/lib/python3.10/site-packages/sklearn/svm/_base.py:192, in BaseLibSVM.fit(self, X, y, sample_weight)
    190     check_consistent_length(X, y)
    191 else:
--> 192     X, y = self._validate_data(
    193         X,
    194         y,
    195         dtype=np.float64,
    196         order="C",
    197         accept_sparse="csr",
    198         accept_large_sparse=False,
    199     )
    201 y = self._validate_targets(y)
    203 sample_weight = np.asarray(
    204     [] if sample_weight is None else sample_weight, dtype=np.float64
    205 )

File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    582         y = check_array(y, input_name="y", **check_y_params)
    583     else:
--> 584         X, y = check_X_y(X, y, **check_params)
    585     out = X, y
    587 if not no_val_X and check_params.get("ensure_2d", True):

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:1124, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1106 X = check_array(
   1107     X,
   1108     accept_sparse=accept_sparse,
   (...)
   1119     input_name="X",
   1120 )
   1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
-> 1124 check_consistent_length(X, y)
   1126 return X, y

File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:397, in check_consistent_length(*arrays)
    395 uniques = np.unique(lengths)
    396 if len(uniques) > 1:
--> 397     raise ValueError(
    398         "Found input variables with inconsistent numbers of samples: %r"
    399         % [int(l) for l in lengths]
    400     )

ValueError: Found input variables with inconsistent numbers of samples: [2, 8000]

我认为这意味着我的X_train和y_train数据集的行长不同,所以我检查了一下。[打印(X_train.info, y_train.info) (https://i.stack.imgur.com/hkqE6.png)

唯一不同的是列数,这应该无关紧要。另外,如果我删除了该列,代码就可以工作。'length'

Pandas 机器学习 scikit-learn SVM

评论

0赞 Nosa aikodon 10/15/2023
根据共享的图像,数据帧中的幽默列似乎由另一个两行系列组成,我怀疑在共享此代码摘录之前存在预处理问题。您能否共享数据集和完整代码,以便我们可以重现该问题。谢谢
0赞 Anon_name 10/15/2023
嘿,感谢您的回复!我更新了代码以反映其他所有内容,并添加了指向我从何处获取数据集的链接。

答: 暂无答案