Scalable Group Composition with End-to-end Delivery Semantics

Scott Johnson, Farnam Jahanian, and Jigney Shah ¹
Dept. of Electrical Engineering and Computer Science
University of Michigan
Ann Arbor, MI 48109-2122

Group communication is a widely studied paradigm for building fault-tolerant distributed systems. A number of group multicast protocols have been developed which allow a set of processors to communicate using various semantics such as FIFO, causal, or total ordering, while providing atomicity and timeliness guarantees. In order to improve scalability and flexibility, it is often desirable to be able to compose a system from multiple cooperating process groups, each of which may provide specific service guarantees appropriate for a subset of the processes in the system. Our research focuses on the problem of building scalable, fault-tolerant distributed systems from collections of communicating process groups while maintaining well-defined end-to-end delivery semantics.

Our approach to inter-group communication takes advantage of modular group composition. We have developed a generic architecture which distinguishes between intra-group and inter-group communication. Our model allows for independent composition of communication protocols and improves scalability by reducing the need for shared state between groups. Using this model, we have formally analyzed the end-to-end semantic behavior of messages in a number of composed systems, and determined exactly which delivery semantics a set of receivers will observe for a given set of messages in various topologies. Using our analysis, a system designer can compose groups into a larger distributed system such that desired end-to-end delivery semantics on messages sent between groups are enforced. More information on our analysis is available in [1].

Our model relies on an abstraction which we call an intergroup router. Intergroup routers provide a mechanism for the exchange of messages between groups. They allow process groups to communicate in a fault-tolerant manner with minimal group or message state information and without modification of the underlying communication protocols. A sample group composition using inter-group routers is shown in Figure 1. The architecture of the inter-group routers is shown in Figure 2a. Each router consists of a routing protocol that runs on top of the attached communication protocols. This routing protocol is seen as a normal application by each communication protocol; it is a member of each process group and a regular sender/receiver in other types of communication protocols. It forwards messages between protocol stacks in FIFO order, ensuring that any message orderings enforced by a sending protocol are maintained as messages are forwarded across the distributed system.

To ensure that messages destined for other groups are forwarded to the intergroup router, a simple routing filter is inserted between the application and the group communication protocol on each node (Figure 2b). This filter uses the local group communication protocol to send intergroup messages to the intergroup router for forwarding, while allowing messaged destined for the local group to be sent and delivered normally. This enables the application to address messages to other groups without the local group communication protocol having to be aware of the rest of the composed system.

There are several benefits to this approach:

Each group can reside on a separate physical network, or multiple groups can reside on the same physical network.
Group communication protocols do not need to be modified for applications to be able to use the intergroup routers.
Messages can be sent to/from closed process groups by non-members.

To test the scalability of our architecture, we conducted a number of simulations to compare the performance of a distributed system composed of multiple process groups using intergroup routers to that of a single process group containing the same number of processors as the composed system. The simulation was performed using OpNet, and was built using the actual implementation code from RTCAST, an atomic, totally ordered real-time group multicast protocol developed at the University of Michigan. Using data from our published performance tests of the RTCAST protocol [2], we tuned the simulation so that it yielded the same performance characteristics for a single group as the actual implementation of RTCAST. Since the performance of RTCAST compares very favorably with the published performance data of other group multicast protocols (most notably Totem [3] and Horus [4]), we expect that our simulation results will be applicable to systems built using other protocols besides RTCAST.

This figure shows the results of one of our tests, comparing the average message latency of a single group to a system in which four RTCAST groups are composed around a single intergroup protocol with four intergroup routers. To mimic the behavior of a real distributed system, we varied the proportion of messages that were sent to all processors in the multiple group case from 10% to 100%. Messages that were not sent to all groups were delivered only in the sender's group. Note that the latency is highest for the single group case. Even when every message is sent to all processors, the multigroup system yields lower message latency than the single group case. This is likely because message latency is proportional to the group size, and once the group reaches a certain critical size it is faster to send the message through several smaller groups than one large one. Other simulations showed similar performance benefits from using multiple process groups.

In conclusion, we have shown how our abstraction of inter-group routers can support group composition while maintaining scalable performance and well-defined message delivery semantics. We are now developing a test application to utilize the intergroup routers and examine their behavior in a real distributed system. We are in the process of adding priority-based communication to our architecture, to provide better application control of end-to-end message latency.

References

[1] T. Abdelzaher, A. Shaikh, S. Johnson, F. Jahanian, and K. Shin. "RTCAST: Lightweight Multicast for Real-Time Process Groups." Tech Report of Dept. of Electrical Engineering and Computer Science, University of Michigan. 1997. To appear in IEEE Transactions on Software Engineering.

[2] S. Johnson, F. Jahanian, and J. Shah. "Scalable Group Composition with End-to-end Delivery Semantics." Tech Report of Dept. of Electrical Engineering and Computer Science, University of Michigan. 1998. Available at http://www.eecs.umich.edu/~scottdj/papers/group_composition.ps.

[3] Y. Amir, L. Moser, P. Melliar-Smith, D. Agarwal, and P. Ciarfella. "The Totem Single-Ring Ordering and Membership Protocol." ACM Transactions on Computer Systems. v 13, n 4. November, 1995. pp. 311-342.

[4] R. van Renesse, K. Birman, and S. Maffeis. "Horus: A Flexible Group Communication System." Communications of the ACM. v 39, n 4. April, 1996. pp. 76-83.

1. Author contact: {scottdj,farnam,jiggey}@eecs.umich.edu.