Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

scala - Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

I have a large data called "edges"

org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52

When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error

edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
  (10485760 bytes). Consider using broadcast variables for large values.

Same with .saveAsTextFile("edges")

This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"

But when I do that, it just hangs indefinitely. Any help would be appreciated.

** EDIT **

My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.

Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like

val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()

which causes someBigObject to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.

In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.

Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize setting in SparkConf prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16 when launching spark-shell. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16" instead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...