Python Pandas speed

phansi.work phansi.work at gmail.com
Thu Jul 9 02:09:50 PDT 2020


Hello,

I wonder whether I am missing some optimisation libs? 


When I calculate covariance matrix using pandas, the performance is not great.


Results: Size of series is always 250, number varies

      <----DFLY------->     <----LINUX---->
num     init     cov           init   cov
1000  0.609434 0.335977     0.392807 0.012132
3000  2.248877 3.412862     1.375551 0.062324
5000  4.797861 9.287197     2.690005 0.161746
7000  8.190682 18.66528     4.382373 0.29853
10000 14.64084 38.76979     7.367079 0.604834

Hope the formatting lasts. 

1. The first number is the create data, dragonfly is slower, but this is not something I am worried about,
2. The second is the covariance, this does not look good.  

I use a virtual environment, pandas 1.0.5 and numpy 1.19.0 in both cases. 

The linux mc has 16GB RAM and dragonfly has 8 GB but top did not show any swap space being used. 

Both CPUs are i5, and both around 4 to 5 years old.



Code below:


import numpy as np
import pandas as pd
import datetime
import pickle

def timeme(nvals, nseries):
    t1 = datetime.datetime.now()
    # initialise data
    df = pd.DataFrame()
    for i in range(nseries):
        df[str(i)] = np.random.random_sample(size=nvals)
    t2 = datetime.datetime.now()
    # calculate covariance
    s = df.cov()
    t3 = datetime.datetime.now()
    return (t2 - t1).total_seconds(), (t3 - t2).total_seconds()

def main():
    nvals = 250
    x = {}
    for nseries in [ 1000, 3000, 5000, 7000, 10000 ]:
        init_time, calc_time = timeme(nvals, nseries)
        x[(nvals, nseries)] = (init_time, calc_time)
    return x
                                                            
x = main()
with open("data.pickle", "wb") as fpw:
    pickle.dump(x, fpw)

cheers
phansi
<phansi.work at gmail.com>



More information about the Users mailing list