i have spark application following
- read files s3
- group above data 'key' , generate counts per-key
- persist key-value pairs db
i have modeled problem follows
- obtain list of files in driver program , use
sc.parallelize
generate rdd of filenames. trying controlnumberofpartitions
here usingsc.parallelize(filenamearray, sizeoffilenamearray)
- let's callfilenamesrdd
- download contents each file s3 in parallel , map user defined objects - let's call rdd
objectsrdd
- generate
pairrdd
objectsrdd
- use
reducebykey
obtain counts per key - let's call rddcountsrdd
. currently due bug, havenumberofpartitions
countsrdd
set 1 - use
foreachpartition
persistcountsrdd
db
i have 2 environments running application
- test - 1 machine 2 cpus. value
spark.default.parallelism
= 4 - prod - 2 machines 16 cpus each. value
spark.default.parallelism
= 32
as expected, job executes in 2 stages
- stage 1 :
filenamesrdd
->objectsrdd
->pairrdd
- stage 2 :
pairrdd
->countsrdd
-> persisttodb
i observing in prod environment, numberoftasks
generated both stages 1 , 2 doesn't equal numberofparitions
in corresponding rdds. confirmed value numberofpartitions
printing out. here example
numberoffiles
= 100
test environment
stage1
- expectation :
numberoftasks
= 100,numberofparitions
= 100objectsrdd
,pairrdd
- observation : matches expectation
- expectation :
stage2
- expectation :
numberoftasks
= 1,numberofpartitions
= 1countsrdd
- reality : matches expectation
- expectation :
prod environment
stage1
- expectation :
numberoftasks
= 100,numberofpartitions
= 100objectsrdd
,pairrdd
- observation :
numberoftasks
= 16,numberofpartitions
= 100objectsrdd
,pairrdd
- expectation :
stage2
- expectation :
numberoftasks
= 1,numberofpartitions
= 1countsrdd
- observation :
numberoftasks
= 16,numberofpartitions
= 1countsrdd
- expectation :
i have read through lot of material , have seen instances , explanations numberofpartitions != numberoftasks
. figure out going on.
it possible 2 environments have different configuration values. can view configurations in history page "environment" tab. suggest comparing test , prod environment settings.
Comments
Post a Comment