[Sis-ams] Predictability concerns with AMS, and other questions too

Marek Prochazka Marek.Prochazka at scisys.co.uk
Fri Jan 27 10:07:32 EST 2006


Hi Scott,

thanks for your reply. Before further commenting on some of the points
discussed, I have yet another question: How would you summarize delivery
guarantees/fault model of AMS?

I guess that

1) If a node publishes a message, it is doesn't have any guarantees on
delivery (send) to destination nodes as it has no knowledge on which
nodes are subscribed. The AMS has the knowledge.

2) If a node announces or queries a message, or if it replies to a
message, it is doesn't have any guarantees on delivery either. Is it
right? My feeling is that all fault notifications are related only to
wrong authorizations for announce and publish, no "best fit" delivery
point for send, wrong context for reply, etc.

3) However, using heartbeats, each node could figure out which nodes in
the same zone are down.

4) There are certain types of AMS faults reported to nodes.

Something else related to this topic? Could you please correct me if I'm
wrong/misunderstood the spec?

Thanks,
Marek


> -----Original Message-----
> From: sis-ams-bounces at mailman.ccsds.org 
> [mailto:sis-ams-bounces at mailman.ccsds.org] On Behalf Of Scott Burleigh
> Sent: 24 January 2006 18:23
> To: sis-ams at mailman.ccsds.org
> Subject: Re: [Sis-ams] Predictability concerns with AMS, and 
> other questionstoo
> 
> Marek Prochazka wrote:
> 
> >Hello Scott,
> >
> >I've finished reading the AMS white book, liked some of its features 
> >and have some questions or notes. First of all, I have to say I'm a 
> >newbie so maybe I missed some past discussions on the same topic.
> >
> Hi, Marek.  Thanks for taking the time to read through the 
> book and develop some good questions.  Some answers in-line below.
> 
> >Here are my notes:
> >
> >1) Predictability issues. There is a number of places where I'm 
> >concerned about predictability issues. Most of them are related to 
> >"immediate propagation" of certain information to "all" nodes, 
> >registrars or configuration servers. I understand that in most cases 
> >the selected protocol follows the general publish/subscribe 
> >communication model of AMS. But the AMS white book makes me 
> feel that 
> >it all is so dynamic and so costly, that it can be hardly 
> used in hard 
> >real-time applications.
> >
> >Maybe the answer is as follows: An HRT application should 
> avoid dynamic 
> >changes such as subscriptions, terminations, etc., and thus avoid 
> >time-costly and unpredictable operations happening at 
> runtime. An HRT 
> >application should setup the message space(s) in an initialization 
> >phase and then only use regular messaging.
> >
> >I have two notes to this: First, if what I have written is 
> how AMS is 
> >meant to be used by HRT applications, then perhaps there could be a 
> >section explaining which parts are better to be done in the 
> >initialization phase and not during the "main computation" phase.
> >Second, anyway, some fault handling might happen during the 
> computation 
> >phase, and hence the application schedulability analysis 
> must take it 
> >into account anyway.
> >  
> >
> This is an important topic.  The AMS design isn't principally 
> aimed at hard real-time applications; the main intent is to 
> reduce the cost of developing and operating distributed 
> systems over networks, including the future interplanetary 
> internet when and if we get it built, and you normally don't 
> expect hard real-time performance over Ethernet, for example. 
>  That said, in JPL's Flight Systems Testbed we successfully 
> used Tramel, the lineal antecedent of AMS, to convey data 
> among the threads of a real-time attitude control system; the 
> control laws were able to function without much difficulty.
> 
> As you say, it's all a question of exactly how you use the 
> system: once the communication configuration of the real-time 
> elements of the message space has stabilized (other bits of 
> configuration can continue to change without noticeable 
> effect) - and provided your real-time nodes are using a 
> real-time-suitable transport system (such as message queues) 
> underneath AMS - I believe you can get bounded maximum 
> latency in AMS message exchange among those nodes.  This 
> remains to be demonstrated, of course, and a lot does depend 
> on careful implementation, but my experience with Tramel 
> makes me hopeful.
> 
> I think the explanatory section you propose is a great idea, 
> but I think it belongs in an AMS Green Book (yet to be 
> written) rather than in the specification itself, as it is 
> informative and advisory rather than normative.  And 
> certainly nothing about the design of AMS obviates the need 
> for schedulability analysis in any case.
> 
> >Here are parts of the protocol which make it (in my opinion) highly 
> >unpredictable with respect to response time:
> >- Registrar registration (Section 2.3.2, also 2.3.3, 
> 3.1.5.4, 3.1.6.4, 
> >4.2.3, etc.): After each configuration change (subscription, 
> >invitation, termination, etc.), the registrar propagates it 
> immediately 
> >to all nodes and all other zones. Also, given Node registration 
> >(4.2.5), it seems that all registrars have information on 
> all nodes in 
> >all zones and the propagation is always performed immediately. Is it 
> >necessary? Isn't it ineffective and unpredictable if you consider 
> >limited bandwidth and a number of messages being sent at the 
> same moment?
> >  
> >
> Registrars aren't required to retain information on the nodes 
> in remote zones; they receive it and they are required to 
> pass it on to all the nodes in their own zone, and of course 
> nothing prevents the implementation of registrar 
> functionality from retaining all this stuff, but no required 
> registrar functions depend on it (the registrar is not a 
> message broker).  In the JPL implementation, registrars know 
> nothing about other zones' nodes.
> 
> Each node, on the other hand, is required to know about all 
> other nodes in the message space.  This tends to increase 
> nodes' memory requirements, but it makes it possible for all 
> AMS message traffic to be exchanged directly between nodes 
> rather than through a message broker; this reduces bandwidth 
> consumption (the number of messages is cut in
> half) and increases robustness (there is no single point of failure).
> 
> The trade-off here is between increasing the number of 
> configuration messages (propagating configuration information 
> to the nodes) versus doubling the number of application 
> messages (which is necessary if you retain configuration 
> information only at message brokers).  On the assumption that 
> application message traffic will normally be vastly heavier 
> than configuration traffic, this seems like the right design approach.
> 
> >- Configuration service fail-over (2.3.6): Registrar is 
> cycling through 
> >all well-known network locations to find out a new config server. In 
> >the meantime, as the registrar is not sending heartbeats, all nodes 
> >start cycling too. That might imply a huge number of messages being 
> >sent at the same moment and predictability of such fail-over 
> being very poor.
> >  
> >
> I think this merits some real quantitative analysis; I don't 
> believe the actual traffic load would be particularly 
> substantial, as these messages are quite short and aren't 
> issued frequently.  But I certainly agree that the 
> predictability of fail-over will not be good.  The point of 
> the fail-over design isn't preservation of real-time 
> performance (which would be unaffected by failure of a 
> configuration server anyway, since the real-time application 
> messages are exchanged directly between nodes) but the 
> overall robustness and survivability of the distributed 
> application.  When a configuration server fails, you can 
> either try to recover automatically (the current fail-over 
> design) or do it manually; in neither case is the moment of 
> recovery very predictable.
> 
> >2) Priority of a message: It is mentioned number of times, 
> but there is 
> >no clear statement on how the priority is used, what the 
> dispatch and 
> >delivery mechanism is for messages with the same and different 
> >priorities, what "higher urgency" exactly means for the protocol and 
> >how the AMS entities participate on this.
> >  
> >
> Good point, there should be some clarifying language somewhere.  
> Priority and flow label are both merely passed through to the 
> underlying transport layer adapter, to be used (or not) as 
> makes sense for that protocol; the AMS protocol itself 
> doesn't use them at all.  The JPL implementation does use 
> priority to order arriving messages in the queue of messages 
> awaiting delivery to the application, but this is strictly an 
> implementation choice; interoperability is not affected.
> 
> >3) Hardcoded intervals. The document says that AMS is for 
> communication 
> >both between modules of a ground system and flight system, 
> as well as 
> >between modules located on different spacecrafts or between ground 
> >system and spacecraft system. I wonder whether then some of the 
> >following hardcoded numbers are right for all those cases:
> >- 20 seconds between heartbeats for registrar <-> node, 3 missing 
> >successive heartbeats imply a fail,
> >- 10 seconds between heartbeats for registrar <-> config server, 3 
> >missing  successive heartbeats imply a fail,
> >- Configuration server location (4.2.2): config_msg_ack should be 
> >received within 5 seconds, otherwise Fault.indication is sent (is 5 
> >seconds enough for e.g. space-to-ground communication? - 
> Maybe you want 
> >to have some suggestions for distribution of AMS entities, e.g. one 
> >config server per spacecraft and one for ground -> 
> communication within
> >5 seconds makes more sense?)
> >- Registrar location (4.2.4): The same as previous
> >  
> >
> These intervals could be configuration options rather than 
> fixed values, but in my experience that introduces a lot of 
> operational and implementation complexity for little if any 
> benefit.  Certainly wide variations in signal propagation 
> delay could make the fixed values in the spec less than 
> useful, but I would argue that in this case you should 
> partition your system into multiple closed continua and use 
> remote AMS for message exchange across the long-delay links; 
> that's really what RAMS was designed for.
> 
> >4) Configuration service fail-over (2.3.6, 4.2.1): After a 
> new config 
> >server starts, it sends I_am_running message and if I 
> receives such a 
> >message it immediately terminates. This can't work, as a 
> scenario with 
> >mutually terminating config servers is likely to happen. I 
> think that a 
> >kind of timestamp ordering or perhaps another negotiation 
> protocol has 
> >to be added to reason about "who was the first" and avoid 
> unnecessary 
> >termination of config servers.
> >  
> >
> No, it should work fine, because each configuration server 
> sends I_am_running only to other configuration servers that 
> rank lower than itself in the well-known list of 
> configuration server network locations (see 4.2.1); no 
> configuration server will ever receive I_am_running from any 
> other configuration server to which it sent such a message.
> 
> >5) Subject catalog (2.3.9): Last two paragraphs: I'd remove the 
> >suggestion about potentially sparse large arrays and keeping the 
> >subject numbers small. There is number of application-dependent 
> >solutions for this, such as fixed number of subjects, hash 
> function etc.
> >  
> >
> I suppose you're right, this is the sort of implementation 
> hint that really belongs in a Green Book.  But I think it's 
> correct nonetheless: I would suggest that every one of the 
> alternative solutions you allude to is either less general or 
> more time-consuming than simply using subject number as an 
> index into an array.
> 
> >6) Node registration (4.2.5): Very strict, no option to detect why 
> >registration was rejected and try to re-register. Neither 
> >Register.request nor register.indication mention how the node is 
> >eventually informed about the reason for rejection (maybe 
> this is the 
> >code in 5.1.5.16?).
> >  
> >
> The bullets at the top of page 41 say that the reason for 
> rejection is noted in the rejection MPDU; conveying this 
> information to the application is an implementation matter, 
> which doesn't affect interoperability.
> 
> >7) Node registration (4.2.5): Forwarding an I_am_starting 
> message by a
> >registrar: some kind of "marking" has to be done to avoid 
> forwarding a 
> >message which was previously forwarded by another registrar 
> (looping of 
> >messages). The same for other forwarding similar to this one (e.g.
> >2.3.10 - Remote AMS message exchange).
> >  
> >
> I don't understand how this would happen: I don't think 
> there's any clause in the spec that talks about a registrar 
> forwarding a MAMS message to another registrar, except when 
> the source of that MAMS message is a node in its own zone 
> (i.e., NOT a registrar).  So there can't be any looping of 
> messages through registrars.
> 
> >8) Heartbeats (4.2.7): After it receives "reconnect" form a node, a 
> >registrar should return you_are_dead if it has been 
> operating for more 
> >than 60 seconds. This is not right. First, imagine that a registrar 
> >crashes and a new one is started within x seconds. As a node 
> connected 
> >to the original registrar will need 60 seconds to notice that the 
> >registrar is dead, therefore it can't contact the new 
> registrar sooner 
> >than 60 seconds after the old one crashed (in the worst case). That 
> >means that effectively the node has in the worst case x 
> seconds to i) 
> >locate the new registrar and ii) contact it.
> >
> I think we're okay here.  Suppose the registrar crashes at 
> time T and the new replacement registrar starts at time T+x.  
> By time T+60 every node in the zone will have noticed the 
> death of the original registrar and will have started asking 
> the configuration server where the new one is.  All of the 
> nodes will be querying the configuration server and trying to 
> reconnect, every 20 seconds, so by T+x+20 every node in the 
> zone will have learned about the new location of the 
> registrar and will have sent a reconnect message to it.  
> Since the registrar doesn't shut off reconnects until T+x+60, 
> there's no problem, no matter what the value of x is.
> 
> >Second, the communication between node and registrar could be 
> >spacecraft-to-spacecraft or even ground-to-spacecraft (unless you 
> >specify or suggest otherwise, as I suggested above) and hence 60 
> >seconds for a message round trip is unrealistic at all.
> >  
> >
> Again, for communication over long-delay links it's important 
> to deploy multiple continua and use Remote AMS; ordinary AMS 
> and MAMS functioning just doesn't make sense over that sort 
> of distance, and there's no need for it to.
> 
> >9) Minor typographic issues: in 4.2.8 you  reference 4.2.10, 4.2.11,
> >4.2.12 and 4.2.13 as "above".
> >  
> >
> Good catch, I'll fix that.  Thanks, Marek.
> 
> Scott
> 
> 
> 
> _______________________________________________
> Sis-ams mailing list
> Sis-ams at mailman.ccsds.org
> http://mailman.ccsds.org/cgi-bin/mailman/listinfo/sis-ams
> 





More information about the Sis-ams mailing list