Apache Spark Assembly
English
English
  • Apache Spark Assembly
  • Ways to Read This Book
  • Core - Operation related
    • SparkContext
    • SparkConf
    • SparkEnv
    • Heartbeat
      • HeartbeatReceiver
      • Untitled
    • Scheduler
  • CORE - Execution Related
    • RDD
      • RDD Design
      • Principles of Overriding RDD
      • Default RDDs
        • ShuffledRDD
      • Transformations and Their Design
        • map / flatMap
        • filter
        • repartition / coalesce
        • sample / randomSplit / takeSample
        • union / ++ / intersection
        • sortBy
        • glom
        • cartesian
        • groupBy
        • pipe
        • mapPartitions / mapPartitionsWithIndex
        • zip / zipPartitions
        • Extra
      • Actions and Their Design
        • forach / foreachPartition
        • collect
        • toLocalIterator
        • subtract
        • reduce / treeReduce
        • fold
        • aggregate / treeAggregate
        • count / countApprox
        • countByValue / countByValueApprox
        • countApproxDistinct
        • take / first / top / takeOrdered
        • max / min
        • isEmpty
        • saveAsTextFile / saveAsObjectFile
        • keyBy
        • checkpoint / localCheckpoint / isCheckpointed / getCheckpointFile
        • zipWithIndex / zipWithUniqueId
        • Extra
      • Cache & Persist
      • RDD Operation Scope
      • RDD Checkpointing
    • Shuffle
    • Serializer
    • Partitioner
    • Broadcast
    • Aggregator
    • Memory
    • Storage
  • Running Spark App
    • Starting point
    • Mastering SparkConf
    • Web UI
  • Untitled
  • Programming Spark
    • Debugging
Powered by GitBook
On this page

Was this helpful?

  1. CORE - Execution Related
  2. RDD

Principles of Overriding RDD

what should we override?

  /*
   RDD.scala
  */
  
  // =======================================================================
  // Methods that should be implemented by subclasses of RDD
  // =======================================================================

  /**
   * :: DeveloperApi ::
   * Implemented by subclasses to compute a given partition.
   */
  @DeveloperApi
  def compute(split: Partition, context: TaskContext): Iterator[T]

  /**
   * Implemented by subclasses to return the set of partitions in this RDD. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   *
   * The partitions in this array must satisfy the following property:
   *   `rdd.partitions.zipWithIndex.forall { case (partition, index) => partition.index == index }`
   */
  protected def getPartitions: Array[Partition]

  /**
   * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
   * be called once, so it is safe to implement a time-consuming computation in it.
   */
  protected def getDependencies: Seq[Dependency[_]] = deps

  /**
   * Optionally overridden by subclasses to specify placement preferences.
   */
  protected def getPreferredLocations(split: Partition): Seq[String] = Nil

  /** Optionally overridden by subclasses to specify how they are partitioned. */
  @transient val partitioner: Option[Partitioner] = None

first 3 are compulsory

compute 같은거는 경우에 따라 parent로부터 받거나 아니면 새로 만들거나 할 듯.

last two are optional

그 외 , 필요에 따라 clearDependencies(), 등 과 같은 함수를 override하기도 함.

write examples (e.g. ShuffledRDD.scala)

Partition definition

trait partition 소개

trait Partition extends Serializable {
  /**
   * Get the partition's index within its parent RDD
   */
  def index: Int

  // A better default implementation of HashCode
  override def hashCode(): Int = index

  override def equals(other: Any): Boolean = super.equals(other)
}

normally, each RDD has its own Partition class (ex. ShuffledRDD + ShuffledRDDPartition )

PreviousRDD DesignNextDefault RDDs

Last updated 4 years ago

Was this helpful?