Number of tasks not equal to number of partitions in Spark -


i have spark application following

  1. read files s3
  2. group above data 'key' , generate counts per-key
  3. persist key-value pairs db

i have modeled problem follows

  1. obtain list of files in driver program , use sc.parallelize generate rdd of filenames. trying control numberofpartitions here using sc.parallelize(filenamearray, sizeoffilenamearray) - let's call filenamesrdd
  2. download contents each file s3 in parallel , map user defined objects - let's call rdd objectsrdd
  3. generate pairrdd objectsrdd
  4. use reducebykey obtain counts per key - let's call rdd countsrdd. currently due bug, have numberofpartitions countsrdd set 1
  5. use foreachpartition persist countsrdd db

i have 2 environments running application

  • test - 1 machine 2 cpus. value spark.default.parallelism = 4
  • prod - 2 machines 16 cpus each. value spark.default.parallelism = 32

as expected, job executes in 2 stages

  • stage 1 : filenamesrdd -> objectsrdd -> pairrdd
  • stage 2 : pairrdd -> countsrdd -> persisttodb

i observing in prod environment, numberoftasks generated both stages 1 , 2 doesn't equal numberofparitions in corresponding rdds. confirmed value numberofpartitions printing out. here example

numberoffiles = 100

test environment

  • stage1

    • expectation : numberoftasks = 100, numberofparitions = 100 objectsrdd , pairrdd
    • observation : matches expectation
  • stage2

    • expectation : numberoftasks = 1, numberofpartitions = 1 countsrdd
    • reality : matches expectation

prod environment

  • stage1

    • expectation : numberoftasks = 100, numberofpartitions = 100 objectsrdd , pairrdd
    • observation : numberoftasks = 16, numberofpartitions = 100 objectsrdd , pairrdd
  • stage2

    • expectation : numberoftasks = 1, numberofpartitions = 1 countsrdd
    • observation : numberoftasks = 16, numberofpartitions = 1 countsrdd

i have read through lot of material , have seen instances , explanations numberofpartitions != numberoftasks. figure out going on.

it possible 2 environments have different configuration values. can view configurations in history page "environment" tab. suggest comparing test , prod environment settings.


Comments