Service Fabric failed to reconfigure replica -

during load test of application (with dynamic load reporting services) whole application stopped working because 1 replica of stateful partition gives warning.

warning system.rap istatefulservicereplica.changerole(s)duration thu, 21 jul 2016  3:36:03 gmt infinity 131135817636324745 false false start time (utc): 2016-07-21 13:35:43.632

this happens after load balancing of replica, happened 4th replica of partition eventhough target 3. if sf kills application should fine (as primary , 2 other secondaries up). whole thing jams. (from logging can see @ least 10k events still need processed whole thing stops)

in images above can see details of particular replica. differences between replica , other secondary replica's in following values:

read status
write status
current service operation
queue memory size (in replication queue)
first sequence number (in replication queue)
last replication operation received time utc
last copy operation received time utc
last acknowledgement sent time utc

i find odd replica status says: ready , not reconfiguring anymore. read/write status says still reconfiguring i'm running newest sdk (2.1.163, released 18-07-2016). thought bugfix in there eventhough became harder reproduce still occurred. know might causing or how fix this?

edit: screenshot of failing partition

edit: results of debugging, based on answer of vaclav (22-7-2016)

after response of vaclav started log in runasync determine causing problem. part of code did not exit if cancellation requested. vaclav pointed out method did not stop when cancellation requested. seems code section in gets stuck native service fabric.

using(itransaction tx = statemanager.createtransaction()) {   await queue.trydequeueasync(tx, _queuetimeout, cancellationtoken);   await tx.commitasync(); }

the queue reliablequeue, time-out set default 4 seconds , cancelationtoken runasync. after adding logging between each line got following logging pattern

//pre transaction using(itransaction tx = statemanager.createtransaction()) {   //pre dequeue   await queue.trydequeueasync(tx, _queuetimeout, cancellationtoken);   //dequeued   await tx.commitasync();   //committed } //post transaction

at each line logged value of cancelationrequest aswell, , background task log when cancelation request fired. result got example this:

pre transaction: false predequeue: false dequeued: false cancelationtokenfired: true

the precise location vary last log before cancelationtokenfired always

pre transaction
predequeue
dequeued

as stated before done on recent sdk (18-7-2016) supposedly had bug-fix similar problem. problem occurred on older sdk , more frequent then. on new version still reproducable each run.

this warning means service isn't exiting runasync when primary replica of service changing role during reconfiguration (look @ health warning in last screenshot). make sure honor cancellation token in every possible code path. applies communication listeners - make sure responding closeasync().

given you're saying, here's happened:

we built new replica on new node (likely load balancing). @ point, temporarily, have 4 replicas until reconfiguration completes.
we attempt swap primaries new replica.
your current primary told change role, means cancel runasync , close communication listeners.
your current primary isn't completing role change - runasync isn't exiting or communication listeners aren't closing.
reconfiguration stuck waiting current primary finish changing role.
health warnings issued.

once reconfiguration completes, replica set size reduced target of 3.

we won't kill slow replica because don't know application fine - maybe it's taking long time safely process valuable data - don't know. service fabric paranoid safety , won't possibly cause service lose data.

service fabric explorer unfortunately doesn't show reconfiguring state, showing expected end result. if run get-servicefabricpartition in powershell, show reconfiguring state of partition.

Trigger

Search This Blog

Service Fabric failed to reconfigure replica -

edit: results of debugging, based on answer of vaclav (22-7-2016)

Comments

Post a Comment