python - Memory efficient way to split large numpy array into train and test -


i have large numpy array , when run scikit learn's train_test_split split array training , test data, run memory errors. more memory efficient method of splitting train , test, , why train_test_split cause this?

the follow code results in memory error , causes crash

import numpy np sklearn.cross_validation import train_test_split  x = np.random.random((10000,70000)) y = np.random.random((10000,)) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state=42) 

one method i've tried works store x in pandas dataframe , shuffle

x = x.reindex(np.random.permutation(x.index)) 

since arrive @ same memory error when try

np.random.shuffle(x) 

then, convert pandas dataframe numpy array , using function, can obtain train test split

#test_proportion of 3 means 1/3 33% test , 67% train def shuffle(matrix, target, test_proportion):     ratio = matrix.shape[0]/test_proportion     x_train = matrix[ratio:,:]     x_test =  matrix[:ratio,:]     y_train = target[ratio:,:]     y_test =  target[:ratio,:]     return x_train, x_test, y_train, y_test  x_train, x_test, y_train, y_test = shuffle(x, y, 3) 

this works now, , when want k-fold cross-validation, can iteratively loop k times , shuffle pandas dataframe. while suffices now, why numpy , sci-kit learn's implementations of shuffle , train_test_split result in memory errors big arrays?


Comments

Popular posts from this blog

python - argument must be rect style object - Pygame -

webrtc - Which ICE candidate am I using and why? -

c# - Better 64-bit byte array hash -