Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

scala - Why Apache Spark take function not parallel?

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :

enter image description here

Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Actually, while take is not entirely parallel, it's not entirely sequential either.

For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.

Check out the code, it's well documented: https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049

I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...