SparkSession vs SparkContextThe SparkSession is created when you build your session, as in the following:
SparkSession spark = SparkSession.builder()
.appName("An app")
.master("local[*]")
.getOrCreate();
As part of your session, you will also get a context: SparkContext. The context was your only way to deal with Spark prior to v2. You mostly do not need to interact with the SparkContext, but when you do (accessing infrastructure information, creating accumulators, and more, described in chapter 17), this is how you access it:
SparkContext sc = spark.sparkContext();
System.out.println("Running Spark v" + sc.version());
Executors are JVM processes that run computations and store data for your application.
The sparkSession which resides in master, sends tasks to the available nodes.
As with the shared drive, I highly recommend that the mount point is the same on every worker.
Exclusions are key; you do not want to carry all your dependencies. If you do not have exclusions, all your dependent classes will be transferred in your uber JAR. This includes all the Spark classes and artifacts. They are indeed needed, but because they are included with Spark, they will be available in your target system. If you bundle them in your uber JAR, that uber JAR will become really big, and you may encounter conflicts between libraries.
Because Spark comes with more than 220 libraries, you don’t need to bring, in your uber JAR, the dependencies that are already available on the target system.
minimizeJa Removes all classes that are not used by the project, reducing the size of jar.
<configuration>
<minimizeJAR>true</minimizeJAR>
...
</configuration>
✅ Answer: spark master makes jar available for download by the workers. consider that for high volume files and large number of workers they may push the master to edge. example of log spark-submit doing it.
2018-08-20 11:52:14 INFO SparkContext:54 - Added JAR file:/.../app.JAR at
spark://un.oplo.io:42805/JARs/app.JAR with timestamp 1534780334746
stdout is always empty?