[Sis-ams] Predictability concerns with AMS, and other questions too

Tue Jan 24 08:39:36 EST 2006

Hello Scott,

I've finished reading the AMS white book, liked some of its features and
have some questions or notes. First of all, I have to say I'm a newbie
so maybe I missed some past discussions on the same topic.

Here are my notes:

1) Predictability issues. There is a number of places where I'm
concerned about predictability issues. Most of them are related to
"immediate propagation" of certain information to "all" nodes,
registrars or configuration servers. I understand that in most cases the
selected protocol follows the general publish/subscribe communication
model of AMS. But the AMS white book makes me feel that it all is so
dynamic and so costly, that it can be hardly used in hard real-time
applications. 

Maybe the answer is as follows: An HRT application should avoid dynamic
changes such as subscriptions, terminations, etc., and thus avoid
time-costly and unpredictable operations happening at runtime. An HRT
application should setup the message space(s) in an initialization phase
and then only use regular messaging.

I have two notes to this: First, if what I have written is how AMS is
meant to be used by HRT applications, then perhaps there could be a
section explaining which parts are better to be done in the
initialization phase and not during the "main computation" phase.
Second, anyway, some fault handling might happen during the computation
phase, and hence the application schedulability analysis must take it
into account anyway.

Here are parts of the protocol which make it (in my opinion) highly
unpredictable with respect to response time:
- Registrar registration (Section 2.3.2, also 2.3.3, 3.1.5.4, 3.1.6.4,
4.2.3, etc.): After each configuration change (subscription, invitation,
termination, etc.), the registrar propagates it immediately to all nodes
and all other zones. Also, given Node registration (4.2.5), it seems
that all registrars have information on all nodes in all zones and the
propagation is always performed immediately. Is it necessary? Isn't it
ineffective and unpredictable if you consider limited bandwidth and a
number of messages being sent at the same moment?
- Configuration service fail-over (2.3.6): Registrar is cycling through
all well-known network locations to find out a new config server. In the
meantime, as the registrar is not sending heartbeats, all nodes start
cycling too. That might imply a huge number of messages being sent at
the same moment and predictability of such fail-over being very poor.

2) Priority of a message: It is mentioned number of times, but there is
no clear statement on how the priority is used, what the dispatch and
delivery mechanism is for messages with the same and different
priorities, what "higher urgency" exactly means for the protocol and how
the AMS entities participate on this.

3) Hardcoded intervals. The document says that AMS is for communication
both between modules of a ground system and flight system, as well as
between modules located on different spacecrafts or between ground
system and spacecraft system. I wonder whether then some of the
following hardcoded numbers are right for all those cases:
- 20 seconds between heartbeats for registrar <-> node, 3 missing
successive heartbeats imply a fail,
- 10 seconds between heartbeats for registrar <-> config server, 3
missing  successive heartbeats imply a fail,
- Configuration server location (4.2.2): config_msg_ack should be
received within 5 seconds, otherwise Fault.indication is sent (is 5
seconds enough for e.g. space-to-ground communication? - Maybe you want
to have some suggestions for distribution of AMS entities, e.g. one
config server per spacecraft and one for ground -> communication within
5 seconds makes more sense?)
- Registrar location (4.2.4): The same as previous

4) Configuration service fail-over (2.3.6, 4.2.1): After a new config
server starts, it sends I_am_running message and if I receives such a
message it immediately terminates. This can't work, as a scenario with
mutually terminating config servers is likely to happen. I think that a
kind of timestamp ordering or perhaps another negotiation protocol has
to be added to reason about "who was the first" and avoid unnecessary
termination of config servers.

5) Subject catalog (2.3.9): Last two paragraphs: I'd remove the
suggestion about potentially sparse large arrays and keeping the subject
numbers small. There is number of application-dependent solutions for
this, such as fixed number of subjects, hash function etc.

6) Node registration (4.2.5): Very strict, no option to detect why
registration was rejected and try to re-register. Neither
Register.request nor register.indication mention how the node is
eventually informed about the reason for rejection (maybe this is the
code in 5.1.5.16?).

7) Node registration (4.2.5): Forwarding an I_am_starting message by a
registrar: some kind of "marking" has to be done to avoid forwarding a
message which was previously forwarded by another registrar (looping of
messages). The same for other forwarding similar to this one (e.g.
2.3.10 - Remote AMS message exchange).

8) Heartbeats (4.2.7): After it receives "reconnect" form a node, a
registrar should return you_are_dead if it has been operating for more
than 60 seconds. This is not right. First, imagine that a registrar
crashes and a new one is started within x seconds. As a node connected
to the original registrar will need 60 seconds to notice that the
registrar is dead, therefore it can't contact the new registrar sooner
than 60 seconds after the old one crashed (in the worst case). That
means that effectively the node has in the worst case x seconds to i)
locate the new registrar and ii) contact it. 
Second, the communication between node and registrar could be
spacecraft-to-spacecraft or even ground-to-spacecraft (unless you
specify or suggest otherwise, as I suggested above) and hence 60 seconds
for a message round trip is unrealistic at all.

9) Minor typographic issues: in 4.2.8 you  reference 4.2.10, 4.2.11,
4.2.12 and 4.2.13 as "above".

That's it for now. I'm sorry if some answers are quite straightforward
and I missed something in the white book. Looking forward to have any
comments.

Best regards,
Marek