I have a pivot table of approximately 2 millions lines coming from a dataframe with the same structure as below:
raw = pd.DataFrame([[123456,datetime(2020,7,1),'A',10 ], [123456,datetime(2020,7,1),'B',25 ], [123456,datetime(2020,7,1),'C',0 ], [123456,datetime(2020,7,2),'A',17 ], [123456,datetime(2020,7,2),'B',23 ], [123456,datetime(2020,7,2),'C',float('NaN') ], [789012,datetime(2020,7,2),'A',11 ], [789012,datetime(2020,7,2),'B',19 ], [789012,datetime(2020,7,3),'A',8 ], [789012,datetime(2020,7,3),'B',21 ]], columns=['GROUP_ID','DATE', 'NAME', 'VALUE']) GROUP_ID DATE NAME VALUE0 123456 2020-07-01 A 10.01 123456 2020-07-01 B 25.02 123456 2020-07-01 C 0.03 123456 2020-07-02 A 17.04 123456 2020-07-02 B 23.05 123456 2020-07-02 C NaN6 789012 2020-07-02 A 11.07 789012 2020-07-02 B 19.08 789012 2020-07-03 A 8.09 789012 2020-07-03 B 21.0
As you can see, the VALUE
column can be Nan
.The pivot table is created like this:
pt = raw.pivot_table(index=['GROUP_ID', 'DATE'], columns=['NAME'], values=['VALUE']) VALUE NAME A B CGROUP_ID DATE 123456 2020-07-01 10.0 25.0 0.0 2020-07-02 17.0 23.0 NaN789012 2020-07-02 11.0 19.0 NaN 2020-07-03 8.0 21.0 NaN
The idea is to create a level 0 column VALUE_PREV
where I can have the value of C
for the day before.I first did this, and it took 10 seconds:
dfA = pt.stack().unstack(level='DATE').shift(1, axis=1).stack(level='DATE')dfA = dfA[dfA.index.get_level_values('NAME') == 'C']dfA = dfA.unstack(level='NAME').rename(columns={'VALUE':'VALUE_PREV'})ptA = pt.merge(dfA, how='outer', on=['GROUP_ID', 'DATE']) VALUE VALUE_PREV NAME A B C CGROUP_ID DATE 123456 2020-07-01 10.0 25.0 0.0 NaN 2020-07-02 17.0 23.0 NaN 0.0789012 2020-07-02 11.0 19.0 NaN NaN 2020-07-03 8.0 21.0 NaN NaN
So I was wondering if there is a quicker way to do this or at least something less heavy to write / understand?
Edit : if the VALUE C
is NaN
at t, VALUE_PREV C
at t+1 MUST be NaN
and not 0