python - Memory efficient way to split large numpy array into train and test -
i have large numpy array , when run scikit learn's train_test_split split array training , test data, run memory errors. more memory efficient method of splitting train , test, , why train_test_split cause this?
the follow code results in memory error , causes crash
import numpy np sklearn.cross_validation import train_test_split x = np.random.random((10000,70000)) y = np.random.random((10000,)) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state=42)
one method i've tried works store x in pandas dataframe , shuffle
x = x.reindex(np.random.permutation(x.index))
since arrive @ same memory error when try
np.random.shuffle(x)
then, convert pandas dataframe numpy array , using function, can obtain train test split
#test_proportion of 3 means 1/3 33% test , 67% train def shuffle(matrix, target, test_proportion): ratio = matrix.shape[0]/test_proportion x_train = matrix[ratio:,:] x_test = matrix[:ratio,:] y_train = target[ratio:,:] y_test = target[:ratio,:] return x_train, x_test, y_train, y_test x_train, x_test, y_train, y_test = shuffle(x, y, 3)
this works now, , when want k-fold cross-validation, can iteratively loop k times , shuffle pandas dataframe. while suffices now, why numpy , sci-kit learn's implementations of shuffle , train_test_split result in memory errors big arrays?
Comments
Post a Comment