I need your support
I have the following code fragment where the delay occurs and I think the "groupBy" does not work for large dataframes:
variable=table1.drop('codcc','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
.join(table2,['codsap'],'left')
.groupBy('codcountry','codebelista','aniocampana','codcc','codsap','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
.agg(sum(col('realuusales')+col('realuumissing')).alias('units'))
.where(col('units') > 0)
.cache()
It takes more than an hour and I have no results. Consume server memory.
I have basic knowledge of Pyspark and I was asked to try to optimize this piece of code.
Thanks
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…