i have spark application following
- read files s3
- group above data 'key' , generate counts per-key
- persist key-value pairs db
i have modeled problem follows
- obtain list of files in driver program , use
sc.parallelizegenerate rdd of filenames. trying controlnumberofpartitionshere usingsc.parallelize(filenamearray, sizeoffilenamearray)- let's callfilenamesrdd - download contents each file s3 in parallel , map user defined objects - let's call rdd
objectsrdd - generate
pairrddobjectsrdd - use
reducebykeyobtain counts per key - let's call rddcountsrdd. currently due bug, havenumberofpartitionscountsrddset 1 - use
foreachpartitionpersistcountsrdddb
i have 2 environments running application
- test - 1 machine 2 cpus. value
spark.default.parallelism= 4 - prod - 2 machines 16 cpus each. value
spark.default.parallelism= 32
as expected, job executes in 2 stages
- stage 1 :
filenamesrdd->objectsrdd->pairrdd - stage 2 :
pairrdd->countsrdd-> persisttodb
i observing in prod environment, numberoftasks generated both stages 1 , 2 doesn't equal numberofparitions in corresponding rdds. confirmed value numberofpartitions printing out. here example
numberoffiles = 100
test environment
stage1
- expectation :
numberoftasks= 100,numberofparitions= 100objectsrdd,pairrdd - observation : matches expectation
- expectation :
stage2
- expectation :
numberoftasks= 1,numberofpartitions= 1countsrdd - reality : matches expectation
- expectation :
prod environment
stage1
- expectation :
numberoftasks= 100,numberofpartitions= 100objectsrdd,pairrdd - observation :
numberoftasks= 16,numberofpartitions= 100objectsrdd,pairrdd
- expectation :
stage2
- expectation :
numberoftasks= 1,numberofpartitions= 1countsrdd - observation :
numberoftasks= 16,numberofpartitions= 1countsrdd
- expectation :
i have read through lot of material , have seen instances , explanations numberofpartitions != numberoftasks. figure out going on.
it possible 2 environments have different configuration values. can view configurations in history page "environment" tab. suggest comparing test , prod environment settings.
Comments
Post a Comment