python - Pandas Series : faster way of computing periods back to preceding high -
i have timeseries of cost info stored open, high, low, close values in dataframe
i want create new column in each element records count of how many days need find high higher in source array.
so series this
import pandas pd import numpy np my_vals = pd.series([10.1, 9.0, 2.4, 8.2, 7.0, 6.1, 5.4, 9.4, 8.7, 11.8, 3.5, 4.7, 5.4, 6.4, 7.8, 8.0, 9.1, 10.2, 11.0, 2.0])
we these values [nan, 1, 1, 2, 1, 1, 7, 1, nan, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1]
i wrote code using rolling_apply, works, really slow , i'm convinced there's far improve way this.
def countdayssincehigherhigh(x): aaa = pd.series(x) zzz = x[-1] #looking values higher zzz bbb = aaa[:-1:] #array without lastly element ccc = bbb[bbb>zzz] #boolean array elements higher zzz ddd = ccc.last_valid_index() if ddd == none: homecoming np.nan #or homecoming 10000 match window length else: homecoming aaa.last_valid_index() - ddd
and compute new column do
new_col = pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 )
any advice appreciated :)
you can 2 loop, worst time complexity maybe o(n**2)
. here method can in o(n*log(n))
:
the algorithm:
argsort()
array index
array for every element in index
@ idx
, find largest element in index
after idx
, largest 1 less idx
. quickly, can utilize sortedlist
. here 2 library implement sorted list: http://www.grantjenks.com/docs/sortedcontainers/sortedlist.html
http://stutzbachenterprises.com/blist/sortedlist.html
here code:
import numpy np sortedcontainers import sortedlist def nearest_hi_value(my_vals): index = np.argsort(my_vals) sl = sortedlist(range(len(index)), load=100) res = [] idx in index.tolist(): sl.remove(idx) idx2 = sl.bisect_left(idx) if idx2 > 0: res.append(idx - sl[idx2-1]) else: res.append(0) result = np.zeros_like(index) result[index] = res homecoming result
if 2 continuous elements in array same, nearest_hi_value()
may homecoming 1, can fixed easily.
here result check:
my_vals = np.random.rand(1000) res1 = pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 ) res2 = nearest_hi_value(my_vals) np.allclose(res1, res2)
here timeit result:
%timeit pd.rolling_apply(my_vals, 10000, countdayssincehigherhigh, min_periods = 0 ) %timeit nearest_hi_value(my_vals)
output:
1 loops, best of 3: 489 ms per loop 100 loops, best of 3: 10.4 ms per loop
python numpy pandas
No comments:
Post a Comment