Monday, 15 February 2010

python - Pandas Series : faster way of computing periods back to preceding high -



python - Pandas Series : faster way of computing periods back to preceding high -

i have timeseries of cost info stored open, high, low, close values in dataframe

i want create new column in each element records count of how many days need find high higher in source array.

so series this

import pandas pd import numpy np my_vals = pd.series([10.1, 9.0, 2.4, 8.2, 7.0, 6.1, 5.4, 9.4, 8.7, 11.8, 3.5, 4.7, 5.4, 6.4, 7.8, 8.0, 9.1, 10.2, 11.0, 2.0])

we these values [nan, 1, 1, 2, 1, 1, 7, 1, nan, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1]

i wrote code using rolling_apply, works, really slow , i'm convinced there's far improve way this.

def countdayssincehigherhigh(x): aaa = pd.series(x) zzz = x[-1] #looking values higher zzz bbb = aaa[:-1:] #array without lastly element ccc = bbb[bbb>zzz] #boolean array elements higher zzz ddd = ccc.last_valid_index() if ddd == none: homecoming np.nan #or homecoming 10000 match window length else: homecoming aaa.last_valid_index() - ddd

and compute new column do

new_col = pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 )

any advice appreciated :)

you can 2 loop, worst time complexity maybe o(n**2). here method can in o(n*log(n)):

the algorithm:

argsort() array index array for every element in index @ idx, find largest element in index after idx, largest 1 less idx. quickly, can utilize sortedlist. here 2 library implement sorted list:

http://www.grantjenks.com/docs/sortedcontainers/sortedlist.html

http://stutzbachenterprises.com/blist/sortedlist.html

here code:

import numpy np sortedcontainers import sortedlist def nearest_hi_value(my_vals): index = np.argsort(my_vals) sl = sortedlist(range(len(index)), load=100) res = [] idx in index.tolist(): sl.remove(idx) idx2 = sl.bisect_left(idx) if idx2 > 0: res.append(idx - sl[idx2-1]) else: res.append(0) result = np.zeros_like(index) result[index] = res homecoming result

if 2 continuous elements in array same, nearest_hi_value() may homecoming 1, can fixed easily.

here result check:

my_vals = np.random.rand(1000) res1 = pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 ) res2 = nearest_hi_value(my_vals) np.allclose(res1, res2)

here timeit result:

%timeit pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 ) %timeit nearest_hi_value(my_vals)

output:

1 loops, best of 3: 489 ms per loop 100 loops, best of 3: 10.4 ms per loop

python numpy pandas

No comments:

Post a Comment