Spark Streaming jobs, when submitted to YARN, often fail to start because the YARN cluster manager can’t find the Spark application JARs.

Common Causes and Fixes

  1. Spark Application JAR not accessible by YARN NodeManagers:

    • Diagnosis: Check the YARN ResourceManager logs (yarn-site.xml points to the log directory) for ContainerLaunchException or FileNotFoundException related to your application JAR.
    • Cause: The spark.yarn.jars configuration is missing or incorrect, meaning YARN doesn’t know where to find the JARs containing your Spark code.
    • Fix: Upload your application JAR (e.g., my-spark-app.jar) to a location accessible by all YARN nodes, typically HDFS. Then, set spark.yarn.jars to point to this location.
      # Upload to HDFS
      hdfs dfs -put my-spark-app.jar /user/spark/jars/
      
      # Submit with SparkConf
      spark-submit \
        --class com.example.MyStreamingApp \
        --master yarn \
        --deploy-mode cluster \
        --conf spark.yarn.jars=hdfs:///user/spark/jars/my-spark-app.jar \
        my-spark-app.jar
      
    • Why it works: YARN’s NodeManagers download these JARs to the containers where your application code will run. If they can’t find them, the containers can’t start.
  2. Missing Spark Assembly JAR:

    • Diagnosis: Similar to above, YARN logs will show missing Spark core, SQL, or streaming dependencies.
    • Cause: You’re submitting a fat JAR that doesn’t include Spark’s own libraries, and spark.yarn.jars is not configured to point to the Spark installation on the cluster.
    • Fix:
      • Option A (Recommended): Build a fat JAR that includes all Spark dependencies. Then, set spark.yarn.jars to point to this fat JAR on HDFS (as in Cause 1).
      • Option B: Configure spark.yarn.jars to point to the Spark assembly JARs on the cluster’s local filesystem or HDFS. This is often hdfs:///user/spark/spark-assembly-*.jar or /usr/lib/spark/jars/*.jar.
      # Example for Option B if Spark is installed locally on nodes
      spark-submit \
        --class com.example.MyStreamingApp \
        --master yarn \
        --deploy-mode cluster \
        --conf spark.yarn.jars=/usr/lib/spark/jars/*.jar \
        my-spark-app.jar
      
    • Why it works: Spark applications depend on Spark’s own libraries. YARN needs to ensure these are available to your application’s containers.
  3. Incorrect spark.yarn.archive for Dependencies:

    • Diagnosis: Errors in YARN logs indicating missing Python files, .so libraries, or other non-JAR dependencies.
    • Cause: If your application has Python or other non-JAR dependencies, they need to be packaged as a ZIP or TAR.GZ archive and distributed. The spark.yarn.archive configuration is not set or points to the wrong archive.
    • Fix: Package your non-JAR dependencies (e.g., Python modules, native libraries) into a ZIP or TAR.GZ file, upload it to HDFS, and set spark.yarn.archive.
      # Assuming you have a dependency archive named dependencies.zip
      hdfs dfs -put dependencies.zip /user/spark/archives/
      
      spark-submit \
        --class com.example.MyStreamingApp \
        --master yarn \
        --deploy-mode cluster \
        --conf spark.yarn.archive=hdfs:///user/spark/archives/dependencies.zip \
        my-spark-app.jar
      
    • Why it works: YARN unpacks this archive into the container’s working directory, making the dependencies available at runtime.
  4. YARN ResourceManager/NodeManager Permissions Issue:

    • Diagnosis: YARN logs showing Permission denied errors when trying to access HDFS paths specified in spark.yarn.jars or spark.yarn.archive.
    • Cause: The YARN user (often yarn or the user running spark-submit) does not have read permissions on the HDFS directory or files.
    • Fix: Grant read permissions to the YARN user on the HDFS paths.
      # Example: Assuming YARN user is 'yarn'
      hdfs dfs -chmod -R a+r /user/spark/jars
      hdfs dfs -chmod -R a+r /user/spark/archives
      
    • Why it works: YARN services need to read these files to distribute them to application containers.
  5. Incorrect --deploy-mode cluster Configuration:

    • Diagnosis: Application JAR not found on the client machine, but YARN logs show it trying to fetch from HDFS successfully. The error might be a ClassNotFoundException during application startup on the cluster.
    • Cause: When using cluster deploy mode, the spark.yarn.jars path must be accessible from the YARN cluster nodes (usually via HDFS). If you specify a local path on the client machine for spark.yarn.jars (e.g., file:///path/to/my-spark-app.jar), the cluster nodes won’t be able to find it.
    • Fix: Always use HDFS or a distributed filesystem path for spark.yarn.jars when in cluster deploy mode.
      spark-submit \
        --class com.example.MyStreamingApp \
        --master yarn \
        --deploy-mode cluster \
        --conf spark.yarn.jars=hdfs:///user/spark/jars/my-spark-app.jar \
        # Note: my-spark-app.jar itself MUST be on HDFS for cluster mode
        hdfs:///user/spark/jars/my-spark-app.jar
      
    • Why it works: In cluster mode, the driver runs on a YARN worker node, not the client. It needs access to the application JARs from the cluster’s perspective.
  6. Network Issues or Firewall Blocking HDFS Access:

    • Diagnosis: YARN logs show timeouts or connection refused errors when trying to connect to HDFS NameNode or DataNodes.
    • Cause: Network configuration or firewalls prevent YARN NodeManagers from accessing the HDFS cluster.
    • Fix: Ensure that the network paths between YARN NodeManagers and HDFS NameNodes/DataNodes are open and that DNS resolution is working correctly. This is an infrastructure-level fix.
    • Why it works: YARN’s ability to fetch JARs and application code from HDFS is fundamental.

The next error you’ll likely encounter after fixing these is related to the YARN application submission itself, such as the application master container failing to launch due to insufficient memory or CPU.

Want structured learning?

Take the full Spark-streaming course →