Python Pandas speed

Thu Jul 9 06:05:38 PDT 2020

Hello

I do not know what causes the difference but one of the likely issues
is that the Linux libraries have been built using optimizations that
DragonFly is missing. I do not mean optimizations as in compiler flags
(although these count as well), I mean different code paths. The
challenge is to identify which part in the whole stack is responsible
for the optimization; and then find the relevant locations in source
code; typically you are looking for #ifdef blocks in the source and
also build part of the software.

It would be very desirable to fix issues like these. Pandas and numpy
are both fundamental in the python scientific computing stack.

Could you fix your example code so it prints out the exact table you
list above so it is easier to test if anyone is willing to pick this
up? Also, can you specify whether larger or smaller numbers are
better? In the current case smaller is better, but can you put this in
the description so it is clear.

So as the first step you can try narrow down which part of the
software is responsible for slow performance; perhaps by writing more
test cases concerning, say, only numpy? And second, you can dig
further and try optimize that dependency.

Cheers

Peeter

--

On Thu, Jul 9, 2020 at 12:10 PM phansi.work <phansi.work at gmail.com> wrote:
>
> Hello,
>
> I wonder whether I am missing some optimisation libs?
>
>
> When I calculate covariance matrix using pandas, the performance is not great.
>
>
> Results: Size of series is always 250, number varies
>
>       <----DFLY------->     <----LINUX---->
> num     init     cov           init   cov
> 1000  0.609434 0.335977     0.392807 0.012132
> 3000  2.248877 3.412862     1.375551 0.062324
> 5000  4.797861 9.287197     2.690005 0.161746
> 7000  8.190682 18.66528     4.382373 0.29853
> 10000 14.64084 38.76979     7.367079 0.604834
>
> Hope the formatting lasts.
>
> 1. The first number is the create data, dragonfly is slower, but this is not something I am worried about,
> 2. The second is the covariance, this does not look good.
>
> I use a virtual environment, pandas 1.0.5 and numpy 1.19.0 in both cases.
>
> The linux mc has 16GB RAM and dragonfly has 8 GB but top did not show any swap space being used.
>
> Both CPUs are i5, and both around 4 to 5 years old.
>
>
>
> Code below:
>
>
> import numpy as np
> import pandas as pd
> import datetime
> import pickle
>
> def timeme(nvals, nseries):
>     t1 = datetime.datetime.now()
>     # initialise data
>     df = pd.DataFrame()
>     for i in range(nseries):
>         df[str(i)] = np.random.random_sample(size=nvals)
>     t2 = datetime.datetime.now()
>     # calculate covariance
>     s = df.cov()
>     t3 = datetime.datetime.now()
>     return (t2 - t1).total_seconds(), (t3 - t2).total_seconds()
>
> def main():
>     nvals = 250
>     x = {}
>     for nseries in [ 1000, 3000, 5000, 7000, 10000 ]:
>         init_time, calc_time = timeme(nvals, nseries)
>         x[(nvals, nseries)] = (init_time, calc_time)
>     return x
>
> x = main()
> with open("data.pickle", "wb") as fpw:
>     pickle.dump(x, fpw)
>
> cheers
> phansi
> <phansi.work at gmail.com>