[Sis-ams] Re: AMS

Wed May 7 11:15:10 EDT 2008

Scott (& Pat),

This is one case where our APL-AMS is a bit non-standard, in that we are
using POSIX message queues as our primary/only transport service.  Using
this method, we have no data from the transport service to indicate a
message's origin.  

An alternative, may be to designate the unused supplementary data in
these query messages as "reserved for optional transport service
extensions," indicating that the PTS may optionally include additional
routing/configuration information (ie: MAMS endpoint) if required.  The
field would otherwise be ignored by other transport services, which
would accordingly set the supplement data length to 0 when populating
these messages. This would free the protocol from additional limitations
on the transport service, without placing any unnecessary burden on the
standard transport services.

The registrar transmits the you_are_dead message in response to a node's
heartbeat timeout, which is what I was referring to.  Presumably the
echo should be the terminated nodes ID, but that is not explicitly
stated.  This is only marginally useful, but more so than populating the
field with a garbage/0 value.

	>From that point of view, what would probably be more useful
would be to deliver a Fault.indication when the absence of the registrar
is first noticed, and then deliver some other sort of indication --
something new -- when the reconnection succeeds, ignoring the timeouts
(since they don't change the registrar connection state).  But that
isn't quite perfect either, since the registrar could actually have been
dead for 30 seconds before AMS detected the third missing heartbeat and
notified the application.

That level of feedback could be a useful addition, but I agree that it's
not necessary for the current version.  During the
reconnection/registration period, any subscription attempts should
return an error message, so the purpose of the added fault may be more
to support external error reporting and/or recovery in an
application-specific manner if a fault condition repeats.  

For now, generating the fault.indication on registration and optionally
on reconnection timeout seems a reasonable approach, but perhaps one
left as an implementation decision.  The parameter given to the
fault.indication call (defined as implementation-specific) may specify
the nature of the error condition, currently an enum value of
AMS_FAULT_TIMEOUT in my implementation, but may be sub-divided into
registration and reconnection timeouts.  

- David

________________________________

From: Donahue, Pat [mailto:pat.donahue at nasa.gov] 
Sent: Tuesday, April 29, 2008 12:16 PM
To: Scott Burleigh; Edell, David J.
Subject: RE: [Sis-ams] Re: AMS

David,

My UDP implementation does as Scott suggests.  When I get the
REGISTRAR_QUERY, I can read from the Packet what the remoteAddress and
remotePort are.  I just return the CELL_SPEC to that
remoteAddress/remotePort.

Pat

Patrick Donahue
(256) 544-5943 office
(256) 721-0726 home
(256) 682-9753 cell 

________________________________

	From: sis-ams-bounces at mailman.ccsds.org
[mailto:sis-ams-bounces at mailman.ccsds.org] On Behalf Of Scott Burleigh
	Sent: Tuesday, April 29, 2008 11:01 AM
	To: Edell, David J.
	Cc: sis-ams at mailman.ccsds.org
	Subject: [Sis-ams] Re: AMS

	Edell, David J. wrote: 

		Scott,

		I'm finishing up our AMS implementation and assembled a
few more miscllaneous comments/questions that I have come across.
Funding on the project is about at an end though (and I'm leaving on
vacation for CA tonight), therefore our first version of AMS can be
considered completed for now.  The APL-AMS implementation is largely,
with the exception of invitations and related functions, compliant with
conformance class 4 as described.  

		I was taking another look at the configuration server
interrogation process.  My current implementation utilizes a shared
memory location for announcing the registrar's location, however I'm
looking into an alternative MAMS-based method to avoid some of the
synchronization and related issues with the shared memory.  

		If I modify the process to use the MAMS registrar_query,
and await a cell_spec/registrar_unknown response though, how would the
CS know where to direct the response as defined?  It seems that the
registrar_query message would require a MAMS endpoint as its
supplemental data, indicating the origin of, and return destination for,
the given message.

	(David, I hope you don't mind, but I think the points you raise
here are important enough to bring forward to the whole working group,
so I'm cc:ing sis-ams on my reply.)

	That's one of the big reasons TCP isn't a really good choice as
a primary transport protocol (which is what is used for MAMS traffic):
it's a bootstrapping problem.  If you use UDP or some other
non-connection-oriented protocol, the transport protocol itself will
give you
	the origin of the registrar_query; that "echo" location
(host/port, for example) can be opaquely passed up to the
registrar_query handler and then passed back down when the cell_spec or
registrar_unknown is returned.

	You're right, though, that including the response MAMS endpoint
in every MAMS message that is of a query nature is a reasonable
alternative: it removes one constraint on what makes a good primary
transport protocol, at the cost of a little extra transmission.  I
dunno.  My inclination is not to change something that currently seems
to work okay unless there's a really compelling reason, and I'm not sure
we've got one yet.   Anybody have any thoughts on this?  

		It's specified that a Fault.Indication is generated if
no response is received from a node_registration request within N2
seconds.  Shouldn't there be a similar condition for a reconnect messge?
I've implemented the same timeout for both states, with the node
generating a fault.indication and then automatically resending the
registration or reconnect message as appropriate.

	I think I see your point.  The node_registration
Fault.indication gives the application an opportunity to decide whether
it wants to keep trying or bail.  Once the node is connected, it is
likely engaged in ongoing application message exchange with its peers;
if the registrar dies while this is going on, the node probably doesn't
want to interrupt its application message exchange activity just because
there's no registrar, so my inclination has been not to bother the node
with reconnection status.  On the other hand, it might be helpful to
tell the node that any new subscriptions it posts aren't going to have
any effect because the registrar is dead.

	>From that point of view, what would probably be more useful
would be to deliver a Fault.indication when the absence of the registrar
is first noticed, and then deliver some other sort of indication --
something new -- when the reconnection succeeds, ignoring the timeouts
(since they don't change the registrar connection state).  But that
isn't quite perfect either, since the registrar could actually have been
dead for 30 seconds before AMS detected the third missing heartbeat and
notified the application.

	I like the idea of giving the application better information,
but I don't know how really substantial the benefit would be.  Because
we're trying to get a Blue Book out relatively soon, so missions can
have something stable to work from, my inclination is to defer making
significant changes until after we've got some operational experience
and then incorporate the lessons learned into a second version. 

		The reference field of a 'you_are_dead' message is noted
as an echo, however technically there is no originating message when the
message is transmitted in response to a heartbeat timeout.  I'm
assumming the targets node ID as the reference value in this case
(although the field is ignored either way by the node)

	Actually the node shouldn't send you_are_dead other than in
response to a message (either reconnect or heartbeat); a heartbeat
timeout just causes an "imputed termination".  This has various effects
depending on what terminated, but itdoesn't produce you_are_dead in any
of those cases.

	You're right, though, that the reference value in a you_are_dead
is only meaningful when the message it's responding to is reconnect, in
which case it's what enables the (nominally blocking) reconnect query to
be unblocked and terminated.  A you_are_dead that is sent in response to
a heartbeat isn't turning off anything that's blocking; its reference
value is the heartbeat source, copied from the heartbeat message, but
that isn't especially useful to the now-officially-dead sender of the
heartbeat.

	Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ccsds.org/pipermail/sis-ams/attachments/20080507/c2e0b402/attachment.htm