Spark Streaming jobs, when submitted to YARN, often fail to start because the YARN cluster manager can’t find the Spark application JARs.
Common Causes and Fixes
-
Spark Application JAR not accessible by YARN NodeManagers:
- Diagnosis: Check the YARN ResourceManager logs (
yarn-site.xmlpoints to the log directory) forContainerLaunchExceptionorFileNotFoundExceptionrelated to your application JAR. - Cause: The
spark.yarn.jarsconfiguration is missing or incorrect, meaning YARN doesn’t know where to find the JARs containing your Spark code. - Fix: Upload your application JAR (e.g.,
my-spark-app.jar) to a location accessible by all YARN nodes, typically HDFS. Then, setspark.yarn.jarsto point to this location.# Upload to HDFS hdfs dfs -put my-spark-app.jar /user/spark/jars/ # Submit with SparkConf spark-submit \ --class com.example.MyStreamingApp \ --master yarn \ --deploy-mode cluster \ --conf spark.yarn.jars=hdfs:///user/spark/jars/my-spark-app.jar \ my-spark-app.jar - Why it works: YARN’s NodeManagers download these JARs to the containers where your application code will run. If they can’t find them, the containers can’t start.
- Diagnosis: Check the YARN ResourceManager logs (
-
Missing Spark Assembly JAR:
- Diagnosis: Similar to above, YARN logs will show missing Spark core, SQL, or streaming dependencies.
- Cause: You’re submitting a fat JAR that doesn’t include Spark’s own libraries, and
spark.yarn.jarsis not configured to point to the Spark installation on the cluster. - Fix:
- Option A (Recommended): Build a fat JAR that includes all Spark dependencies. Then, set
spark.yarn.jarsto point to this fat JAR on HDFS (as in Cause 1). - Option B: Configure
spark.yarn.jarsto point to the Spark assembly JARs on the cluster’s local filesystem or HDFS. This is oftenhdfs:///user/spark/spark-assembly-*.jaror/usr/lib/spark/jars/*.jar.
# Example for Option B if Spark is installed locally on nodes spark-submit \ --class com.example.MyStreamingApp \ --master yarn \ --deploy-mode cluster \ --conf spark.yarn.jars=/usr/lib/spark/jars/*.jar \ my-spark-app.jar - Option A (Recommended): Build a fat JAR that includes all Spark dependencies. Then, set
- Why it works: Spark applications depend on Spark’s own libraries. YARN needs to ensure these are available to your application’s containers.
-
Incorrect
spark.yarn.archivefor Dependencies:- Diagnosis: Errors in YARN logs indicating missing Python files,
.solibraries, or other non-JAR dependencies. - Cause: If your application has Python or other non-JAR dependencies, they need to be packaged as a ZIP or TAR.GZ archive and distributed. The
spark.yarn.archiveconfiguration is not set or points to the wrong archive. - Fix: Package your non-JAR dependencies (e.g., Python modules, native libraries) into a ZIP or TAR.GZ file, upload it to HDFS, and set
spark.yarn.archive.# Assuming you have a dependency archive named dependencies.zip hdfs dfs -put dependencies.zip /user/spark/archives/ spark-submit \ --class com.example.MyStreamingApp \ --master yarn \ --deploy-mode cluster \ --conf spark.yarn.archive=hdfs:///user/spark/archives/dependencies.zip \ my-spark-app.jar - Why it works: YARN unpacks this archive into the container’s working directory, making the dependencies available at runtime.
- Diagnosis: Errors in YARN logs indicating missing Python files,
-
YARN ResourceManager/NodeManager Permissions Issue:
- Diagnosis: YARN logs showing
Permission deniederrors when trying to access HDFS paths specified inspark.yarn.jarsorspark.yarn.archive. - Cause: The YARN user (often
yarnor the user runningspark-submit) does not have read permissions on the HDFS directory or files. - Fix: Grant read permissions to the YARN user on the HDFS paths.
# Example: Assuming YARN user is 'yarn' hdfs dfs -chmod -R a+r /user/spark/jars hdfs dfs -chmod -R a+r /user/spark/archives - Why it works: YARN services need to read these files to distribute them to application containers.
- Diagnosis: YARN logs showing
-
Incorrect
--deploy-mode clusterConfiguration:- Diagnosis: Application JAR not found on the client machine, but YARN logs show it trying to fetch from HDFS successfully. The error might be a
ClassNotFoundExceptionduring application startup on the cluster. - Cause: When using
clusterdeploy mode, thespark.yarn.jarspath must be accessible from the YARN cluster nodes (usually via HDFS). If you specify a local path on the client machine forspark.yarn.jars(e.g.,file:///path/to/my-spark-app.jar), the cluster nodes won’t be able to find it. - Fix: Always use HDFS or a distributed filesystem path for
spark.yarn.jarswhen inclusterdeploy mode.spark-submit \ --class com.example.MyStreamingApp \ --master yarn \ --deploy-mode cluster \ --conf spark.yarn.jars=hdfs:///user/spark/jars/my-spark-app.jar \ # Note: my-spark-app.jar itself MUST be on HDFS for cluster mode hdfs:///user/spark/jars/my-spark-app.jar - Why it works: In cluster mode, the driver runs on a YARN worker node, not the client. It needs access to the application JARs from the cluster’s perspective.
- Diagnosis: Application JAR not found on the client machine, but YARN logs show it trying to fetch from HDFS successfully. The error might be a
-
Network Issues or Firewall Blocking HDFS Access:
- Diagnosis: YARN logs show timeouts or connection refused errors when trying to connect to HDFS NameNode or DataNodes.
- Cause: Network configuration or firewalls prevent YARN NodeManagers from accessing the HDFS cluster.
- Fix: Ensure that the network paths between YARN NodeManagers and HDFS NameNodes/DataNodes are open and that DNS resolution is working correctly. This is an infrastructure-level fix.
- Why it works: YARN’s ability to fetch JARs and application code from HDFS is fundamental.
The next error you’ll likely encounter after fixing these is related to the YARN application submission itself, such as the application master container failing to launch due to insufficient memory or CPU.