Spark

Spark is a tool for managing and coordinating the execution of tasks on large data across a cluster of computers. It is written in Scala.

Spark libraries exist for other languages, such as Python. Spark translates such code into code that can run on executor JVMs.

Concepts

The cluster that Spark tasks execute on is managed by a cluster manager such as Spark’s, YARN, or Mesos. These cluster managers provision resources to applications for them to do their work.

The driver process runs the main() function on a node in the cluster. It maintains information about the Spark Application, responds to the user’s program, and analyzes, distributes, and schedules the work across executors.

An executor process actually executes the code assigned to it by the driver and reports the state of its computation back to the driver node. The number of executors per node can be configured.

Spark also has a local mode where the driver and executors are simply processes on the same machine.

The SparkSession object is the entry point to running Spark code. Standalone applications must create the SparkSession object themselves, while it’s otherwise created implicitly in interactive mode consoles.

A distributed collection is one that is partitioned across the various executors. The core Spark abstractions are DataSets, DataFrames, SQL Tables, and Resilient Distributed Datasets (RDDs). All of these are distributed collections.

A DataFrame represents a table of data with rows and columns. A DataFrame’s schema is a list defining the columns and their types. A parallelism of one is had when either there is one partition and many executors, or many partitions and one executor. DataFrames are high-level and can’t manipulate partitions manually, but there are lower-level APIs that can do this, such as RDDs.

Spark’s core data structures are immutable and are manipulated through specifying transformations to be applied to them. Spark doesn’t act on transformations until an action is called.

There are transformations with both narrow and wide dependencies.

Transformations with narrow dependencies, narrow transformations, contribute a single output partition for every input partition. Spark automatically performs pipelining with narrow transformations, so that multiple operations are performed in-memory.

Transformations with wide dependencies, wide transformations, contribute many output partitions for every input partition. This is often referred to as a shuffle, where Spark exchanges partitions across the cluster. Shuffles cause Spark to write results to disk.

Spark SQL

There is a Spark SQL console via spark-sql.

April 18, 2019
57fed1c — March 15, 2024