during load test of application (with dynamic load reporting services) whole application stopped working because 1 replica of stateful partition gives warning.
warning system.rap istatefulservicereplica.changerole(s)duration thu, 21 jul 2016 3:36:03 gmt infinity 131135817636324745 false false start time (utc): 2016-07-21 13:35:43.632
this happens after load balancing of replica, happened 4th replica of partition eventhough target 3. if sf kills application should fine (as primary , 2 other secondaries up). whole thing jams. (from logging can see @ least 10k events still need processed whole thing stops)
in images above can see details of particular replica. differences between replica , other secondary replica's in following values:
- read status
- write status
- current service operation
- queue memory size (in replication queue)
- first sequence number (in replication queue)
- last replication operation received time utc
- last copy operation received time utc
- last acknowledgement sent time utc
i find odd replica status says: ready , not reconfiguring anymore. read/write status says still reconfiguring i'm running newest sdk (2.1.163, released 18-07-2016). thought bugfix in there eventhough became harder reproduce still occurred. know might causing or how fix this?
edit: screenshot of failing partition
edit: results of debugging, based on answer of vaclav (22-7-2016)
after response of vaclav started log in runasync determine causing problem. part of code did not exit if cancellation requested. vaclav pointed out method did not stop when cancellation requested. seems code section in gets stuck native service fabric.
using(itransaction tx = statemanager.createtransaction()) { await queue.trydequeueasync(tx, _queuetimeout, cancellationtoken); await tx.commitasync(); }
the queue reliablequeue, time-out set default 4 seconds , cancelationtoken runasync. after adding logging between each line got following logging pattern
//pre transaction using(itransaction tx = statemanager.createtransaction()) { //pre dequeue await queue.trydequeueasync(tx, _queuetimeout, cancellationtoken); //dequeued await tx.commitasync(); //committed } //post transaction
at each line logged value of cancelationrequest aswell, , background task log when cancelation request fired. result got example this:
pre transaction: false predequeue: false dequeued: false cancelationtokenfired: true
the precise location vary last log before cancelationtokenfired always
- pre transaction
- predequeue
- dequeued
as stated before done on recent sdk (18-7-2016) supposedly had bug-fix similar problem. problem occurred on older sdk , more frequent then. on new version still reproducable each run.
this warning means service isn't exiting runasync when primary replica of service changing role during reconfiguration (look @ health warning in last screenshot). make sure honor cancellation token in every possible code path. applies communication listeners - make sure responding closeasync().
given you're saying, here's happened:
- we built new replica on new node (likely load balancing). @ point, temporarily, have 4 replicas until reconfiguration completes.
- we attempt swap primaries new replica.
- your current primary told change role, means cancel runasync , close communication listeners.
- your current primary isn't completing role change - runasync isn't exiting or communication listeners aren't closing.
- reconfiguration stuck waiting current primary finish changing role.
- health warnings issued.
once reconfiguration completes, replica set size reduced target of 3.
we won't kill slow replica because don't know application fine - maybe it's taking long time safely process valuable data - don't know. service fabric paranoid safety , won't possibly cause service lose data.
service fabric explorer unfortunately doesn't show reconfiguring state, showing expected end result. if run get-servicefabricpartition in powershell, show reconfiguring state of partition.
Comments
Post a Comment