[Css-csts] CSTS FW abort and termination handling

Mon Jan 31 10:10:28 EST 2011

CSTWG colleagues ---

In the process of responding to my action item from the London meeting
regarding Started and Acknowledged procedures, I discovered some
ambiguities, inconsistencies, and (I believe) errors with the way that
aborts and terminations are handled in the CSTS FW, in particular in the
way that they are communicated between the Association Control (AC)
procedure and the other procedures. These problems are more or less that
same across all of the procedures, so the solutions are mostly
consistent across all procedures. Unfortunately, it is not a simple
matter of "this is right, and that is wrong", but more of N things being
mutually inconsistent, and we have to decide which subset is "right" and
make the rest consistent. In my note below I try to identify as many of
the specific problems as I can In several cases I suggest some
solutions. Whether the CSTSWG agrees with those solutions or decides on
different ones, there will be a considerable amount of rewriting to
bring the affected sections of the book into alignment. I will be happy
to help with the specific rewriting, but the WG need to decide on the
general direction of the corrections.

Best regards,

John

ABORT HANDLING

There are lot of little errors regarding abort handling which
cumulatively break the communication between the AC procedure and the
other procedures. It's hard to know where to start, but the approach
that I'll use is to start with the detection of an invalid PDU by a
procedure, and walk through the chain of events and actions that are
triggered. 

All of the stateful procedures (that is, all procedures other than
Information Query) have an Incoming Event 'invalid PDU'. For all of
these stateful procedures, the reference for the 'invalid PDU' event in
the corresponding Event Description Reference table is to section
3.2.3.6, which states "A PDU shall be considered invalid if ...". So
far, so good.

For all of the stateful procedures except Throw Event (which I'll
address in a bit), the response in state 1 is "peer abort 'protocol
error" and in state 2 is "peer abort 'protocol error' -> 1". None of the
stateful procedures  has an explicit "peer abort 'protocol error'"
action, but rather all of them define only a generic 'peer abort 'xxx''
action, which is defined as "raise 'peer abort invocation' event with
diagnostic set to 'xxx' to the Association Control procedure 4.2.2.5".
However, the requirement that is referenced, 4.2.2.5, is not a generic
requirement for aborts in general, but one specific to invalid PDUs:
"Any procedure other than the Association Control procedure shall issue
an 'abort' event to the Association Control procedure if it receives an
invalid PDU (see 3.2.3.6)". 

Problem/Issue #1 is that the diagnostic for an invalid PDU is set to
'protocol error' in the procedure state tables. I seem to recall that at
some point we had both invalid PDU' and 'protocol error' PEER-ABORT
diagnostics, but decided to collapse both events into the single
'protocol error' diagnostic. However, revisiting the definitions of both
of these, there are some problems. The definition of 'protocol error' is
given in 3.6.2.2.(d) as "the local application detected an error in the
sequencing of service operations." The reference for the definition of
'invalid PDU' in the procedure state tables is to section 3.2.3.6, which
states:

"A PDU shall be considered invalid if:

a)      it contains an unrecognized operation type or a parameter of the
wrong type, or

b)      it is otherwise not decodable, or

c)       it is invalid in the current state of the procedure, or

d)      the procedure type is not one of the specified service
procedures."

According to these two definitions (invalid PDU and protocol error),
'protocol error' is only a subset of  'invalid PDU'. E.g., a
correctly-sequenced PDU that contains a parameter of the wrong type is
an invalid PDU according to condition (a), but it's not a "protocol
error" because it's not out of sequence. At first I thought the simple
solution would be to redefine protocol error as invalid PDU is currently
defined (i.e., cases (a) - (d) above) and use "protocol error"  instead
of "invalid PDU" throughout the procedure state tables. However, I then
realized that case (d) can never be detected by an individual procedure
instance - it can only be detected by the AC procedure, in its role as
"traffic cop" for the service. So I think that it would be useful to
split case (d) off as its own error event (say, 'unknown procedure
type') that's only detected by, and used in the state table for, the
Association Control procedure.

Problem #2: As mentioned earlier, for  the Throw Event procedure  the
response in the state table to 'invalid PDU' is not "peer abort
'protocol error'" like the other stateful procedures, but rather in
state 1 is "peer abort 'xxx" and in state 2 is "peer abort 'xxx' -> 1".
The abstract form of the peer abort should not be in the Throw Event
state table. Assuming that we continue to use 'protocol error' as the
event that's raised by procedures in response to an invalid PDU event,
the "xxx" should be changed to "peer abort 'protocol error'" for
consistency with the other procedures.

Problem #3 is that the reference for the generic 'peer abort 'xxx''
action is currently specific to the invalid PDU case, but clearly the
"xxx" form intends for it to be general. Another valid reference is
paragraph 4.2.2.4, which states "If any procedure other than the
Association Control procedure initiates an abort, it shall issue an
'abort' event to the Association Control procedure (see 4.2.2.2)." 

Problem #4 is the inconsistency and ambiguity between the requirement in
4.2.2.5 and the way that the procedure state tables implement it.
4.2.2.5 calls it an 'abort' event', whereas the procedure state tables
call it a 'peer abort invocation' event. The term 'peer abort' is used
as the incoming event in the AC state table, and so that term should be
used in 4.2.2.5 and the descriptions of the 'peer abort xxx' action for
the various procedures ('peer abort invocation' event is particularly
misleading, because it implies that the procedure itself is somehow
invoking the PEER-ABORT invocation rather than just signaling a
peer-abortable event to the AC procedure).

-- Problem #4a: 4.2.2.5 doesn't state what kind of abort/peer abort
invocation event is to be raised. The same incoming event should
uniformly result in the same peer abort diagnostic. All of the stateful
procedures (except Throw Event) use the diagnostic 'protocol error', so
this seems to be the value, and it should be so stated in 4.2.2.5, e.g.,
"Any procedure other than the Association Control procedure shall raise
a 'peer abort' event with diagnostic set to 'protocol error' to the
Association Control procedure if it receives an invalid PDU (see
3.2.3.6)" and set to this value in the Throw Event state table.

Now let's ignore for the moment how the procedure has generated the
'peer abort 'protocol error'", and look at how the AC procedure receives
that event. The state table for the AC procedure (table 4-2) has the
Incoming Event 'peer abort xxx', and so the event 'peer abort 'protocol
error'' would presumably enter the AC state table here. When the AC
procedure is in state 2 (bound), the response is defined as:

{peer abort xxx}

'terminate'

'delete service instance'

-> 1

The compound action {peer abort} in the Compound Action Definitions
table (4-6) for the AC procedure is defined as:

'terminate'

'abort'

Problem #4 is that the compound action in table 4-6 has no argument
("xxx"), which doesn't match the state table.

Problem #6 is that the AC procedure terminates the other procedures
twice - first as part of the sequence of actions in the state table
(4-2) and again as part of the compound action {peer abort}. Note that
this dual termination also exists for rows 1, 2, 3, and 6 of table 4-2.

Problem #7: the simple action 'abort' references (in table 4-5)
paragraph 4.2.2.5, which states "Any procedure other than the
Association Control procedure shall issue an 'abort' event to the
Association Control procedure if it receives an invalid PDU." This is
obviously an incorrect reference - instead of pointing to something that
specifies the action that the AC procedure is supposed to perform when
it aborts the service instance, it points to the action of the other
procedures.

Given that the procedures are terminated and the service instance itself
will be deleted, instead of the ill-defined 'abort', the appropriate
action would seem to be (PeerAbortInvocation  xxx). 

- Problem #8: there is no normative text for the AC procedure that
explicitly states that to "peer abort" includes invoking the PEER-ABORT
operation! As the state table currently stands, when the AC procedure
gets a peer abort event from another procedure, it (the AC procedure)
will shut down the service instance but does not notify the User (via
the PEER-ABORT operation). Paragraph 4.3.3.1.4.4 states "The Association
Control procedure shall peer-abort the association upon receipt of a
peer abort event from any of the other procedures that constitute the
service", but it doesn't define what is included in a peer abort.

The AC procedure state table has a separate Incoming Event 'invalid
protocol data unit'. The two references for this event in the Event
Description References table (4-3) are paragraphs 4.2.2.5 ("Any
procedure other than the Association Control procedure shall issue an
'abort' event to the Association Control procedure if it receives an
invalid PDU.") and 4.2.3 ("On reception of a 'terminate' event all
procedures shall terminate all their activities, shall release their
resources and shall cease to exist (e.g., commit suicide), unless
otherwise specified by the procedures.")

- Problem #9: The reference to paragraph 4.2.2.5 is, again, to peer
aborts caused by other procedures receiving an invalid PDU. But as
discussed above, such events would be raised to the AC procedure via the
'peer abort 'protocol error'' incoming event - none of the other
procedure state table has an action that would map into this Incoming
Event for the AC procedure. However, paragraph 4.3.3.1.4.5 does state
"The Association Control procedure of the Service Provider shall invoke
a PEERABORT[sic] if it receives an invalid PDU." If the reference for
the 'invalid protocol data unit' event points to 4.3.3.1.4.5 instead of
4.2.2.5, this incoming event would be specific to invalid PDUs that are
specifically detected by the AC procedure and not any of the other
procedures.

--Problem #9a: 4.3.3.1.4.5 should state that the Service Provider shall
do more than just invoke the PEER-ABORT operation; i.e., all of the
other things involved in a peer abort (terminate the other procedures,
delete the service instance). 

--Problem/Issue #9b: What is the distinction between invalid PDUs that
are received by the AC procedure and ones received by the other
procedures? It is pretty obvious that the BIND, UNBIND, and PEER-ABORT
are handled by the AC procedure, but what about a PDU with an invalid
procedure instance identifier ? As described under Problem/Issue #1,
this can only be caught by the AC procedure. I think that it would be
useful to distinguish between the invalid PDU conditions that are to be
detected by the AC procedure and the invalid conditions that can be
delegated to the other procedures to detect, and have those separate
definitions referenced by the 'invalid protocol data unit/PDU' event
definitions for the AC and other procedures, respectively. 

Problem #10: The reference to 4.2.3, which specifies what procedures are
supposed to do on reception of a 'terminate' event, doesn't belong here
and should be deleted.

Problem #11: The reference for 'delete service instance' simple action
in table 4-5 is to paragraph 4.3.3.1.3.5, which states "If the UNBIND
invocation is accepted, then the Association Control procedure shall
release the association by issuing a positive UNBIND return." Not only
does the referenced paragraph say nothing about releasing resources, the
'delete service instance' is used not only as part of performing the
UNBIND operation but also (and more numerously) as part of various abort
situations. 

Problem #12: Table 4-5 includes 3 simple actions that are never used in
the state table or by a compound actions: 'release resources', 'unbind',
and 'peer abort' (this last one is different from the compound action
{peer abort}).  

Problem #13: The AC state table has a 'protocol abort' Incoming Event
(row 8 of table 4-2). The reference for 'protocol abort' in table 4-3 is
paragraph 4.3.3.1.4.5, which states "The Association Control procedure
of the Service Provider shall invoke a PEERABORT if it receives an
invalid PDU." The correct reference should be to 4.3.3.1.4.6: "The
occurrence of an underlying communication problem may trigger a protocol
abort (see 1.6.1.4.14)."

-- Problem #13a: The reference to 1.6.1.4.14 in 4.3.3.1.4.6 is incorrect
- it should be to 1.6.1.4.15.

-- Problem #13b: References to 'protocol abort' all point to section
1.6.1.4.15, which is simply the definition of the term and not a
normative specification of what happens. A normative specification of
the conditions that cause a protocol abort should be included. Paragraph
4.3.3.1.1.4 comes the closest ("In case of protocol abort the
association shall be aborted") but the distinction between "aborting"
and "peer aborting" is ambiguous at best. This is in contrast to the SLE
service specifications, each of which has a section 4.1.5,
Communications Failure, that spells out exactly what constitutes a
protocol abort and the exact actions to be performed in response to it.

Problem #14: The reference for the event 'not authenticated PDU' in
table 4-3 is given as "a)". If one follows the hot link, it goes to
3.2.3.6 a): 

"A PDU shall be considered invalid if:

a) it contains an unrecognized operation type or a parameter of the
wrong type, or  ..."

This would seem to define an unauthenticated PDU as an invalid PDU.
However, the behavior is different - an unauthenticated PDU should be
ignored (according to the state table) whereas an invalid PDU should
trigger a peer abort. Instead, the references for 'not authenticated
PDU' should be to 3.2.5.1.1 : "An incoming invocation or return shall be
ignored if the credentials parameter cannot be authenticated when, by
management arrangement, credentials are required."

-- Problem #14a: The "action" for 'not authenticated PDU' in table 4-2
is "[ignore]". This is not formally defined n the action table, and has
an undefined syntax (i.e., [square brackets]). I suggest changing these
to 'ignore' and add 'ignore' to the simple action table with references
3.2.5.1.2 ("If an invocation is ignored, the operation shall not be
performed, and a report of the outcome shall not be returned to the
invoker.') and 3.2.5.1.3 ("If a return is ignored, it shall be as if no
report of the outcome of the operation has been received.")

Before leaving the topic of abort handling, there are a few more issues
unique to the Buffered Data Delivery (BDD) procedure.

In addition to the simple action 'peer abort xxx' that all stateful
procedures have, the BDD procedure has the complex action {peer abort
'xxx'} that has the following component actions:

'stop release timer '

'stop all return timers'

'reinitialize transfer buffer'

'raise 'peer abort 'xxx''

This compound action is cited only one in the BDD state table, as {peer
abort 'protocol error'} when STOP is invoked when the procedure instance
is in the inactive state.

Problem #14: Stopping the release timer and all return timers will occur
as part of the 'terminate' action that the BDD procedure will perform in
response to the 'terminate' event that the AC procedure will issue in
response to the peer abort event raised by the BDD procedure. In other
words, there is no need to perform these actions as part of the {peer
abort 'xxx'} compound action.

Problem # 16: There is no 'reinitialize transfer buffer' action in table
4-15. There is the 'initialize transfer buffer' simple action, but the
reference for that action is specific to initialization following a
START invocation. However, it doesn't seem that this is necessary in any
case - once the procedure is terminated as a result of the peer abort,
the only way to make it active again is through a subsequent BIND and
START. That START will initialize the transfer buffer. So 'reinitialize'
should be deleted from the compound action {peer abort xxx}.

Problem #17: 'raise 'peer abort xxx'' should be simply 'peer abort xxx'
to match the name of the simple action (the action itself is to raise
the event).

In summary, there appears to be no reason for the compound action {peer
abort 'xxx'}. It should be deleted from the Compound Action table and
{peer abort 'protocol error'} should be replaced by 'peer abort
'protocol error'' in the BDD state table.

TERMINATING

Now we turn our attention to the "terminating" threads. 

Most of the state changes in the AC procedure result in the AC procedure
raising the 'terminate' event to all other procedures of the service
instance.

Problem #18: The AC procedure raises the 'terminate' event (see tables
4-2 and 4-5), but all of the other procedures show the Incoming Event
and resulting action both as 'terminate xxx' (that is, with an
argument/diagnostic). All of the other procedure state tables, predicate
description tables, and simple action tables should use simply
'terminate'.

Problem #19: The Event Description References for the 'terminate'
incoming event have errors:

a)    For the BDD procedure (table 4-12), in addition to the correct
reference to section 4.2.3, there is also a reference to section
4.5.3.5. However, that section defines the 'terminate' action, not the
'terminate' event, and should be removed as a reference for the
'terminate' event.

b)    For the Unbuffered Data Delivery procedure (table 4-19), there is
a reference to section 4.5.3.5, which defines the termination behavior
for the BDD procedure and therefore should be removed. There is also a
reference to section 4.6.3.4, but that  section defines the 'terminate'
action for the Unbuffered Data Delivery procedure, not the 'terminate'
event, and should be removed as a reference for the 'terminate' event.
The correct reference to include is one to section 4.2.3.

c)    For the Cyclic Report procedure (table 4-26), in addition to the
correct reference to section 4.2.3, there is also a reference to section
4.7.3.5. However, that section defines the 'terminate' action, not the
'terminate' event, and should be removed as a reference for the
'terminate' event.

d)    For the Throw Event procedure (table 4-33), in addition to the
correct reference to section 4.2.3, there is also a reference to section
4.8.3.4. However, that section defines the 'terminate' action, not the
'terminate' event, and should be removed as a reference for the
'terminate' event.

e)    For the Process Data procedure (table 4-42), in addition to the
correct reference to section 4.2.3, there is also a reference to section
4.9.3.8. However, that section defines the 'terminate' action, not the
'terminate' event, and should be removed as a reference for the
'terminate' event.

f)     For the Notification procedure (table 4-50), there is a reference
to section 4.5.3.5, which defines the termination behavior for the BDD
procedure and therefore should be removed. There is also a reference to
section 4.10.3.5, but that  section defines the 'terminate' action for
the Notification procedure, not the 'terminate' event, and should be
removed as a reference for the 'terminate' event. The correct reference
to include is one to section 4.2.3.

Problem #20: The Simple Action References for the 'terminate' action are
mangled for most of the procedures:

a)    For the Unbuffered Data Delivery procedure (table 4-21), there is
a reference to section 4.5.3.5, which defines the termination behavior
for the BDD procedure and therefore should be removed. (However, the
reference to 4.6.3.4 is correct.)

b)    For the Cyclic Report procedure (table 4-28), there is a reference
to section 4.5.3.5, which defines the termination behavior for the BDD
procedure and therefore should be removed. (However, the reference to
4.6.3.5 is correct.)

c)    For the Throw Event procedure (table 4-35), there is a reference
to section 4.5.3.5, which defines the termination behavior for the BDD
procedure and therefore should be removed. (However, the reference to
4.8.3.4 is correct.)

d)    For the Throw Event procedure (table 4-35), there is a reference
to section 4.2.3, which defines the 'terminate' event raised by the AC
procedure and not the 'terminate' action to be performed,  and therefore
should be removed. (However, the reference to 4.9.3.8 is correct.)

e)    For the Notification procedure (table 4-52), there is a reference
to section 4.5.3.5, which defines the termination behavior for the BDD
procedure and therefore should be removed. (However, the reference to
4.810.3.5 is correct.)

The Terminating behavior sections for the stateful procedures contain
some inaccuracies. Most of these sections begin with word along the
lines of "If the association is aborted or unbound or the procedure
raised a peer-abort, the Service Provider shall terminate with the
following actions ...". 

Problem #21: The phrase "If the association is aborted or unbound"
implies that the procedure somehow "knows" the reason that the AC
procedure has told it to terminate. This wording might be okay in an
informative section, but the only thing that the procedure knows in a
normative sense is that the AC procedure has raised the 'terminate'
event.

Problem #22: The phrase " or the procedure raised a peer-abort" is
incorrect. It implies that the procedure is to self-terminate when it
raises a peer abort event, but of course that is not the case. The
procedure must wait until the AC procedure tells it to terminate before
it terminates.

Problem #23: The phrase "the Service Provider shall terminate ... " is
troubling here, because it is the procedure instance that terminates,
not the overall Service Provider. I realize that the CSTS FW uses
"Service Provider" throughout the procedure specifications to mean
"provider instance", but it seems particularly inappropriate here. I
think the right answer would be to change "Service Provider" to
"procedure instance" throughout the FW, but that would be a big job.

INFORMATION QUERY

The Information Query (IC) procedure is stateless, and so far its
behavior is "defined" by a simple reference to the GET operation.
However, this approach -is mute on the responsibilities (if any) of the
IC procedure with respect to raising peer abort events and terminating.

Can the IC procedure raise a peer abort event to the AC procedure? For
example, if an incoming GET invocation is properly identified as such
and has a valid procedure-instance-identifier but contains a parameter
that has a type that is invalid for the GET operation (a peer-abortable
condition), does the IC procedure raise a peer-abort 'protocol error'
event? Or is this a PDU that will never get "routed" to the IC procedure
in the first place - i.e., it is caught as an invalid PDU by the AC
procedure? With reference to the discussion above for Problems #1 and
#9b, this would imply that the invalid PDU conditions  (a) ("it contains
an unrecognized operation type or a parameter of the wrong type") and
(b) ("it is otherwise not decodable") should be enforced by the AC
procedure and not the individual other procedure instances.

Is the IC procedure expected to terminate when the AC procedure raises
that event, and if so, what does terminating the IC procedure mean? 

If the IC procedure cannot raise 'peer abort' events and/or does nothing
in response to a 'terminate' event from AC, section 4.4 should have some
informative NOTES that say so. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ccsds.org/pipermail/css-csts/attachments/20110131/1f494884/attachment-0001.htm