Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
331 views
in Technique[技术] by (71.8m points)

Error in XGBoost training on a large dataset in python jupyter

I have a pickle file with all the extracted features from raw dataset and now I am trying to train a XGBoost model on it.

In [4]: df.shape
Out[4]:(8474661, 70)

X = df[x_cols]
y = df[label]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.25)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
Out[8]:(6355995, 69) (6355995,)
       (2118666, 69) (2118666,)

y_train.value_counts()
Out[9]:0.0    5734377
       1.0     621618
       dtype: int64

y_test.value_counts()
Out[10]:0.0    1911460
       1.0     207206
       dtype: int64

D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)

Here, I am getting memory error-

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-11-0a198674af60> in <module>
----> 1 D_train = xgb.DMatrix(X_train, label=y_train)
      2 D_test = xgb.DMatrix(X_test, label=y_test)

~anaconda3envsPy-37libsite-packagesxgboostcore.py in __init__(self, data, label, missing, weight, silent, feature_names, feature_types, nthread)
    378         data, feature_names, feature_types = _maybe_pandas_data(data,
    379                                                                 feature_names,
--> 380                                                                 feature_types)
    381 
    382         data, feature_names, feature_types = _maybe_dt_data(data,

~anaconda3envsPy-37libsite-packagesxgboostcore.py in _maybe_pandas_data(data, feature_names, feature_types)
    251         feature_types = [PANDAS_DTYPE_MAPPER[dtype.name] for dtype in data_dtypes]
    252 
--> 253     data = data.values.astype('float')
    254 
    255     return data, feature_names, feature_types

MemoryError: Unable to allocate 3.27 GiB for an array with shape (6355995, 69) and data type float64

How should I train this data?

using xgboost version - 0.90


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.6k users

...