Saturday, 15 June 2013

python - Why is pandas DataFrame more expensive than numpy ndarray? -



python - Why is pandas DataFrame more expensive than numpy ndarray? -

i benchmarking pandas dataframe creation , found more expensive numpy ndarray creation.

benchmark code

from timeit import timer setup = """ import numpy np import pandas pd """ numpy_code = """ info = np.zeros(shape=(360,),dtype=[('a', 'f4'),('b', 'f4'),('c', 'f4')]) """ pandas_code =""" df =pd.dataframe(np.zeros(shape=(360,),dtype=[('a', 'f4'),('b', 'f4'),('c', 'f4')])) """ print "numpy",min(timer(numpy_code,setup=setup).repeat(10,10))*10**6,"micro-seconds" print "pandas",min(timer(pandas_code,setup=setup).repeat(10,10))*10**6,"micro-seconds"

the output

numpy 17.5073728315 micro-seconds pandas 1757.9817013 micro-seconds

i wondering if help me understand why pandas dataframe creation more expensive ndarray construction. , if doing wrong, can please help me improve performance.

system details

pandas version: 0.12.0 numpy version: 1.9.0 python 2.7.6 (32-bit) running on windows 7

for homogeneous dtyped numpy array, performance difference creations quite miniscule , no copying done, , array passed thru.

however heteregenous dtyped numpy arrays, info segregated dtype (which may involve copying, esp if input has non-contiguous dtypes) separate blocks each holding single dtype (as numpy array).

other types of info trigger different amounts of checks (e.g. lists scrutinized if 1-d, 2-d etc), , various checks relating coercions of datetime-likes occur.

the reasons upfront dtype separation simple. can perform operations operate differently on different dtypes without run-time separation (and correspondent slicing performance issues).

to honest very-very slight perf nail take of attendent advantages of using dataframe, namely consistent intuitive api handles null-data , different dtypes intelligently.

homogeous case, involves no copying

in [41]: %timeit np.ones((10000,100)) 1000 loops, best of 3: 399 per loop in [42]: arr = np.ones((10000,100)) in [43]: %timeit dataframe(arr) 10000 loops, best of 3: 65.9 per loop

python numpy pandas

No comments:

Post a Comment