Saturday, 15 September 2012

python - Efficient combined in-place adding/removing of rows of a huge 2D numpy array -



python - Efficient combined in-place adding/removing of rows of a huge 2D numpy array -

i have 2d numpy array , it's huge. have computer memory, not huge. single re-create of array fits snugly in computer memory. sec re-create of array brings computer knees crying.

before can cutting matrix smaller, more manageable, chunks need add together few rows , remove some. luckily need remove more rows add together new ones, in theory done in-place. i'm working on function accomplish this, i'm curious advice of can give me.

the plan far:

make list of rows remove make matrix of rows add replace rows remove rows add together (one one, cannot utilize fancy indexing here?) move rows still need removed end of matrix call .resize() on matrix resize in memory

specially step 4 hard implement efficiently.

code far:

import numpy np n_rows = 100 n_columns = 1000000 n_rows_to_drop = 20 n_rows_to_add = 10 # init huge array info = np.random.rand(n_rows, n_columns) # rows drop to_drop = np.arange(n_rows) np.random.shuffle(to_drop) to_drop = to_drop[:n_rows_to_drop] # rows add together new_data = np.random.rand(n_rows_to_add, n_columns) # start replacing rows new rows new_data_idx, to_drop_idx in enumerate(to_drop): if new_data_idx >= n_rows_to_add: break # no more new info add together # replace row drop new row data[to_drop_idx] = new_data[new_data_idx] # these should still dropped to_drop = to_drop[n_rows_to_add:] to_drop.sort() # create list of row indices keep, lastly rows first to_keep = set(range(n_rows)) - set(to_drop) to_keep = list(to_keep) to_keep.sort() to_keep = to_keep[::-1] # replace rows drop rows @ end of matrix to_drop_idx, to_keep_idx in zip(to_drop, to_keep): if to_drop_idx > to_keep_idx: # remaining rows drop @ end of matrix break data[to_drop_idx] = data[to_keep_idx] # resize matrix in memory data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

this seems work, there way create more elegant/efficient? way check whether re-create of huge array made @ point?

this seems perform same code little more brief. i'm relatively sure no copies of big array made here - fancy indexing work views.

import numpy np n_rows = 100 n_columns = 100000 n_rows_to_drop = 20 n_rows_to_add = 10 # init huge array info = np.random.rand(n_rows, n_columns) # rows drop to_drop = np.random.randint(0, n_rows, n_rows_to_drop) to_drop = np.unique(to_drop) # rows add together new_data = np.random.rand(n_rows_to_add, n_columns) # start replacing rows new rows data[to_drop[:n_rows_to_add]] = new_data # these should still dropped to_drop = to_drop[:n_rows_to_add] # create list of row indices keep, lastly rows first to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=true)[-n_rows_to_add:] # replace rows drop rows @ end of matrix to_drop_i, to_keep_i in zip(to_drop, to_keep): data[to_drop_i] = data[to_keep_i] # resize matrix in memory data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

python numpy

No comments:

Post a Comment