Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
409 views
in Technique[技术] by (71.8m points)

python - Group by takes too long with Spark y Pyspark

I need your support I have the following code fragment where the delay occurs and I think the "groupBy" does not work for large dataframes:

variable=table1.drop('codcc','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
        .join(table2,['codsap'],'left')
        .groupBy('codcountry','codebelista','aniocampana','codcc','codsap','codsubcategory_homo','dessubcategorysolo','destiposolo','codclass','codunidadnegocio','dessubtiposolo','codmarc','codcategory_homo')
        .agg(sum(col('realuusales')+col('realuumissing')).alias('units'))
        .where(col('units') > 0)
        .cache()

It takes more than an hour and I have no results. Consume server memory. I have basic knowledge of Pyspark and I was asked to try to optimize this piece of code.

Thanks


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...