python - faster way to calculate a rolling sum in a dataframe

Question

Welcome To Ask or Share your Answers For Others

python - faster way to calculate a rolling sum in a dataframe

asked Jan 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - faster way to calculate a rolling sum in a dataframe

To calculate a volume weighted moving average (VWMA) I am collecting a sum(price*volume) and dividing it by the sum(volume).

I need a faster way to get a value from the previous row and add it to a value on the current row.

I have the following dataframe:

import pandas as pd
from itertools import repeat

df = pd.DataFrame({'dtime': ['16:00', '15:00', '14:00', '13:00', '12:00', '11:00', '10:00', '09:00', '08:00', '07:00', '06:00', '05:00', '04:00', '03:00', '02:00', '01:00'],
'time': [1800, 1740, 1680, 1620, 1560, 1500, 1440, 1380, 1320, 1260, 1200, 1140, 1080, 1020, 960, 900],
'price': [100.1, 102.7, 108.5, 105.3, 107.1, 103.4, 101.8, 102.7, 101.6, 99.8, 100.2, 97.7, 99.3, 100.1, 102.5, 103.9],
'volume': [6.0, 6.5, 5.4, 6.3, 6.4, 7.1, 6.7, 6.2, 5.7, 1.2, 2.4, 3.9, 5.2, 8.9, 7.2, 6.5]
}, columns = ['dtime', 'time', 'price', 'volume']).set_index('dtime')

df.insert(df.shape[1], "PV", df['price']*df['volume'])
df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))

Which is

df = 
       time  price  volume      PV   flag  PVsum_2  Vsum_2  VWMA_2
dtime                                                             
16:00  1800  100.1     6.0  600.60    0.0      0.0     0.0     0.0
15:00  1740  102.7     6.5  667.55    0.0      0.0     0.0     0.0
14:00  1680  108.5     5.4  585.90    0.0      0.0     0.0     0.0
13:00  1620  105.3     6.3  663.39    0.0      0.0     0.0     0.0
12:00  1560  107.1     6.4  685.44    0.0      0.0     0.0     0.0
11:00  1500  103.4     7.1  734.14    0.0      0.0     0.0     0.0
10:00  1440  101.8     6.7  682.06    0.0      0.0     0.0     0.0
09:00  1380  102.7     6.2  636.74    0.0      0.0     0.0     0.0
08:00  1320  101.6     5.7  579.12    0.0      0.0     0.0     0.0
07:00  1260   99.8     1.2  119.76    0.0      0.0     0.0     0.0
06:00  1200  100.2     2.4  240.48    0.0      0.0     0.0     0.0
05:00  1140   97.7     3.9  381.03    0.0      0.0     0.0     0.0
04:00  1080   99.3     5.2  516.36    0.0      0.0     0.0     0.0
03:00  1020  100.1     8.9  890.89    0.0      0.0     0.0     0.0
02:00   960  102.5     7.2  738.00    0.0      0.0     0.0     0.0
01:00   900  103.9     6.5  675.35    0.0      0.0     0.0     0.0

Right now I am using a for loop to check each row if 'flag' is set.

#----pseudo code----
#for each row in df (from bottom to top, excluding the very bottom row)
#   if flag[row] is not set:
#      PVsum_2[row] = PV[row] + PV[row + 1]
#      Vsum_2[row]  = volume[row] + volume[row + 1]
#      VWMA_2[row]  = PVsum_2[row] / Vsum_2[row]
#      flag[row] = 1.0
#----pseudo code----

my_dict = {'dtime'  :   0,
           'time'   :   1,
           'price'  :   2,
           "volume" :   3,
           'PV'     :   4,
           'check'  :   5,
           'PVsum_2':   6,
           'Vsum_2' :   7,
           'VWMA_2' :   8}

for row in reversed(range(len(df)-1)):
    # if flag value is not set (i.e. flag == 0)
    if not df['flag'][row]:
        # sum of current and previous PV (price*volume) values
        a = df['PV'][row] + df['PV'][row+1]
        df.iloc[row, my_dict['PVsum_2']-1] = a
        
        # sum of current and previous volumes
        b = df['volume'][row] + df['volume'][row+1]
        df.iloc[row, my_dict['Vsum_2']-1] = b
        
        # PVsum_2 / Vsum_2
        c = (a / b) if b != 0.0 else 0.0                                                 
        df.iloc[row, my_dict['VWMA_2']-1] = c
        
        # set check value to 1.0
        df.iloc[row, my_dict['flag']-1] = 1.0

but this takes too long on large sets of data (500+ rows)

I'm looking for something faster and more elegant.

The dataframe should look like this when it is done (notice the bottom row has not been altered):

df = 

       time  price  volume      PV  flag  PVsum_2  Vsum_2      VWMA_2
dtime                                                                
16:00  1800  100.1     6.0  600.60   1.0  1268.15    12.5  101.452000
15:00  1740  102.7     6.5  667.55   1.0  1253.45    11.9  105.331933
14:00  1680  108.5     5.4  585.90   1.0  1249.29    11.7  106.776923
13:00  1620  105.3     6.3  663.39   1.0  1348.83    12.7  106.207087
12:00  1560  107.1     6.4  685.44   1.0  1419.58    13.5  105.154074
11:00  1500  103.4     7.1  734.14   1.0  1416.20    13.8  102.623188
10:00  1440  101.8     6.7  682.06   1.0  1318.80    12.9  102.232558
09:00  1380  102.7     6.2  636.74   1.0  1215.86    11.9  102.173109
08:00  1320  101.6     5.7  579.12   1.0   698.88     6.9  101.286957
07:00  1260   99.8     1.2  119.76   1.0   360.24     3.6  100.066667
06:00  1200  100.2     2.4  240.48   1.0   621.51     6.3   98.652381
05:00  1140   97.7     3.9  381.03   1.0   897.39     9.1   98.614286
04:00  1080   99.3     5.2  516.36   1.0  1407.25    14.1   99.804965
03:00  1020  100.1     8.9  890.89   1.0  1628.89    16.1  101.173292
02:00   960  102.5     7.2  738.00   1.0  1413.35    13.7  103.164234
01:00   900  103.9     6.5  675.35   0.0     0.00     0.0    0.000000

Eventually new data will be added to the top of the data frame as seen below, and will need to be updated again.

df = 
           time  price  volume      PV  flag  PVsum_2  Vsum_2  VWMA_2
    dtime                                                             
    19:00  1980  100.1     6.0  600.60   0.0     0.0      0.0    0.0
    18:00  1920  102.7     6.5  667.55   0.0     0.0      0.0    0.0
    17:00  1860  108.5     5.4  585.90   0.0     0.0      0.0    0.0
    16:00  1800  100.1     6.0  600.60   1.0  1268.15    12.5  101.452000
    15:00  1740  102.7     6.5  667.55   1.0  1253.45    11.9  105.331933
    14:00  1680  108.5     5.4  585.90   1.0  1249.29    11.7  106.776923
    13:00  1620  105.3     6.3  663.39   1.0  1348.83    12.7  106.207087
    12:00  1560  107.1     6.4  685.44   1.0  1419.58    13.5  105.154074
    11:00  1500  103.4     7.1  734.14   1.0  1416.20    13.8  102.623188
    10:00  1440  101.8     6.7  682.06   1.0  1318.80    12.9  102.232558
    09:00  1380  102.7     6.2  636.74   1.0  1215.86    11.9  102.173109
    08:00  1320  101.6     5.7  579.12   1.0   698.88     6.9  101.286957
    07:00  1260   99.8     1.2  119.76   1.0   360.24     3.6  100.066667
    06:00  1200  100.2     2.4  240.48   1.0   621.51     6.3   98.652381
    05:00  1140   97.7     3.9  381.03   1.0   897.39     9.1   98.614286
    04:00  1080   99.3     5.2  516.36   1.0  1407.25    14.1   99.804965
    03:00  1020  100.1     8.9  890.89   1.0  1628.89    16.1  101.173292
    02:00   960  102.5     7.2  738.00   1.0  1413.35    13.7  103.164234
    01:00   900  103.9     6.5  675.35   0.0     0.00     0.0    0.000000

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-24T00:22:15+0000

It looks like you're not using pandas in the right way. I'd recommend taking a quick look at a tutorial.

For starters, the following lines

df.insert(df.shape[1], "flag", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "PVsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "Vsum_2", list(repeat(0.0,len(df))))
df.insert(df.shape[1], "VWMA_2", list(repeat(0.0,len(df))))

can be much easier written as:

df['flag'] = 0
df['PVsum_2'] = 0
df['Vsum_2'] = 0
df['VWMA_2'] = 0

But it seems you don't even need to initialise those columns really.

You also don't need the for loop because you can align 2 dataframes, one being your original and another one is one where you've shifted all rows. For example:

df_shift = df.shift(-1)

You can then use normal vectorised calculations to achieve what you want, e.g.:

df['PVsum_2'] = df['PV'] + df_shift['PV']
df['Vsum_2'] = df['volume'] + df_shift['volume']
idx = df['Vsum_2'] != 0   # this is your check whether that value is different from 0 
df.loc[idx, 'VWMA_2'] = df.loc[idx, 'PVsum_2'] / df.loc[idx, 'VSum_2']  # and now use that index to only calculate VWMA_2 where the Vsum_2 was 0

Hopefully you get the idea and can make small adjustments to make it work exactly as you want.

Categories

python - faster way to calculate a rolling sum in a dataframe

python - faster way to calculate a rolling sum in a dataframe

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags