python - Group by takes too long with Spark y Pyspark - OStack Q&A-Knowledge Sharing Community

I need your support I have the following code fragment where the delay occurs and I think the "groupBy" does not work for large dataframes:

variable=table1.drop('codcc','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
        .join(table2,['codsap'],'left')
        .groupBy('codcountry','codebelista','aniocampana','codcc','codsap','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
        .agg(sum(col('realuusales')+col('realuumissing')).alias('units'))
        .where(col('units') > 0)
        .cache()

It takes more than an hour and I have no results. Consume server memory. I have basic knowledge of Pyspark and I was asked to try to optimize this piece of code.

Thanks

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Group by takes too long with Spark y Pyspark

python - Group by takes too long with Spark y Pyspark

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags