提问人:Anon_name 提问时间:10/14/2023 最后编辑:Anon_name 更新时间:10/15/2023 访问量:48
错误说我的行不一致,但检查它显示其他情况
error says my rows are inconsistent but checking it shows otherwise
问:
我正在尝试训练一个 SVM,我的数据由 3 列组成(str 类型的文本、int 的文本长度以及分别由 1(表示文本是幽默的)和 0 表示的“幽默”标签)。错误返回,我将其理解为我的X_train和y_train的长度差异,但检查它表明它没有不同。以下是到目前为止的代码片段:
data = '/kaggle/input/200k-short-texts-for-humor-detection/dataset.csv'
# https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection/data
df = pd.DataFrame(pd.read_csv(data,nrows=10000))
# preprocessing {True:1, False:0} Label Encoding
# True (i.e Funny) becomes 1, & vice versa
df['humor'] = np.where(df['humor'],1,0)
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# nltk.download()
from nltk.corpus import stopwords
nltk.download('stopwords')
sw = set (stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def stemfilter(tokens,stemmer):
# list argument
stemmed = []
for token in tokens:
stemmed.append(stemmer.stem(token))
return stemmed
def preproc(sentence):
# sentence is given as a list of tokenized words
sentence = [w.lower() for w in sentence if w.isalpha()] # removing uppercasing
# testing to see what works best
# sentence = [word for word in sentence if word not in sw] #stopword removal
sentence = [stemmer.stem(word) for word in sentence] # changing words into their rootwords eg. achieving n achievment become achiev
# sentence = [lemmatizer.lemmatize(word) for word in sentence]
return ' '.join(sentence)
new_data = []
for sentence in df['text']:
sentence = word_tokenize(sentence)
sentence = preproc(sentence)
new_data.append(sentence)
new_data = pd.DataFrame(new_data)
new_data = new_data.rename(columns = {0:"text"})
new_data['humor'] = [a for a in df['humor']]
new_data['length'] = new_data['text'].apply(len)
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
X = new_data[['text','length']]
X_train, X_test, y_train, y_test = train_test_split(X,new_data['humor'],test_size=0.2, random_state=42)
# by the way, I do not think that it is attributed to the preprocessing done, because the code runs when X = new_data['text'], without the ['humor'] column
pipe_new = make_pipeline(TfidfVectorizer(), SVC())
pipe_new.fit(X_train , y_train)
但是我收到此错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[42], line 10
7 # print(X_train.info, y_train.info)
9 pipe = make_pipeline(TfidfVectorizer(), SVC())
---> 10 pipe.fit(X_train , y_train)
11 # pipe.score(X_test , y_test)
File /opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py:405, in Pipeline.fit(self, X, y, **fit_params)
403 if self._final_estimator != "passthrough":
404 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 405 self._final_estimator.fit(Xt, y, **fit_params_last_step)
407 return self
File /opt/conda/lib/python3.10/site-packages/sklearn/svm/_base.py:192, in BaseLibSVM.fit(self, X, y, sample_weight)
190 check_consistent_length(X, y)
191 else:
--> 192 X, y = self._validate_data(
193 X,
194 y,
195 dtype=np.float64,
196 order="C",
197 accept_sparse="csr",
198 accept_large_sparse=False,
199 )
201 y = self._validate_targets(y)
203 sample_weight = np.asarray(
204 [] if sample_weight is None else sample_weight, dtype=np.float64
205 )
File /opt/conda/lib/python3.10/site-packages/sklearn/base.py:584, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
582 y = check_array(y, input_name="y", **check_y_params)
583 else:
--> 584 X, y = check_X_y(X, y, **check_params)
585 out = X, y
587 if not no_val_X and check_params.get("ensure_2d", True):
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:1124, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1106 X = check_array(
1107 X,
1108 accept_sparse=accept_sparse,
(...)
1119 input_name="X",
1120 )
1122 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
-> 1124 check_consistent_length(X, y)
1126 return X, y
File /opt/conda/lib/python3.10/site-packages/sklearn/utils/validation.py:397, in check_consistent_length(*arrays)
395 uniques = np.unique(lengths)
396 if len(uniques) > 1:
--> 397 raise ValueError(
398 "Found input variables with inconsistent numbers of samples: %r"
399 % [int(l) for l in lengths]
400 )
ValueError: Found input variables with inconsistent numbers of samples: [2, 8000]
我认为这意味着我的X_train和y_train数据集的行长不同,所以我检查了一下。[打印(X_train.info, y_train.info) (https://i.stack.imgur.com/hkqE6.png)
唯一不同的是列数,这应该无关紧要。另外,如果我删除了该列,代码就可以工作。'length'
答: 暂无答案
评论