Network Working Group Vipin Jain Internet-Draft Riverstone Networks Category: Standards Track Editor Expires March 2005 Sep 2004 Fail Over extensions for L2TP "failover" draft-ietf-l2tpext-failover-04.txt Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2002). All Rights Reserved. Abstract L2TP is a connection-oriented protocol that has shared state between active endpoints. Some of this shared state is vital for operation but may be rather volatile in nature, such as packet sequence numbers used on the L2TP Control Connection. When failure of one side of a control connection occurs, a new control connection is created and associated with the old connection by exchanging information about the old connection. Such a mechanism is not intended as a replacement for an active fail over with some mirrored connection states, but as an aid just for those parameters that are particularly difficult to have immediately available. Protocol extensions to L2TP defined in this document are intended to facilitate state recovery, providing additional resiliency in an L2TP network and improving a remote system's layer 2 connectivity. Jain Standards Track [Page 1] INTERNET DRAFT FAILOVER March 2005 Contributors Following is the complete list of contributors to this document. Paul Howard Juniper Networks Mark Townsley Cisco Systems Vipin Jain Riverstone Networks Sam Henderson Cisco Systems Ly Loi Tahoe Networks Leo Huber Extreme Networks Keyur Parikh Sentito Networks Table of Contents Status of this Memo .......................................... 1 1.0 Introduction ............................................. 2 2.0 Protocol Operation ....................................... 3 2.1 Pre Failover Operation ................................... 4 2.2 Failover Recovery Process ................................ 5 2.2.1 Recovery Tunnel Establishment .......................... 6 2.2.2 Control and Data Channel Reset ......................... 6 2.3 Failover Recovery Process ................................ 8 3.0 IANA Considerations ...................................... 11 4.0 Security Considerations .................................. 11 5.0 Acknowledgements ......................................... 11 6.0 Author Addresses ......................................... 11 9.0 References................................................ 13 Appendix A .................................................. 14 Appendix B .................................................. 15 Appendix C .................................................. 16 Appendix D .................................................. 16 Terminology Endpoint: L2TP control connection endpoint i.e. either LAC or LNS. Also known as LCCE in RFC-TBA [L2TPv3] Active Endpoint: An endpoint that is currently providing service. Backup Endpoint: A redundant endpoint standing by for the active endpoint. Failover: The action of a Backup Endpoint taking over the service of an active endpoint. This could be due to administrative action or failure of the active endpoint. Old Tunnel: A tunnel that existed before failure and is subjected to recovery upon failover. Jain Standards Track [Page 2] INTERNET DRAFT FAILOVER March 2005 Recovered tunnel: After an Old Tunnel is recovered (i.e. tunnel and its sessions are restored) using the mechanism described in this document it is referred as Recovered Tunnel. Recovery Tunnel: A new control connection established only to recover an old tunnel. 1.0 Introduction The goal of this draft is to aid the overall resiliency of an L2TP endpoint by introducing extensions to RFC 2661 [L2TPv2] and RFC-TBA [L2TPv3] that will minimize the recovery time of the L2TP layer after a failover, while minimizing the impact on its performance. Therefore it is assumed that the endpoint's overall architecture is also supportive in the resiliency effort. To ensure proper operation of an L2TP endpoint after a failover, the associated information of the control connection and sessions between them must be correct and consistent. This includes both the configured and dynamic information. The configured information is assumed to be correct and consistent after a failover, otherwise the tunnels and sessions would not have been setup in the first place. The dynamic information, which is also referred to as stateful information, changes with the processing of the tunnel's control and data packets. Currently, the only such information that is essential to the tunnel's operation is its sequence numbers. For the tunnel control channel, the inconsistencies in its sequence numbers can result in the termination of the entire tunnel. For tunnel sessions, the inconsistency in its sequence numbers, when used, can cause significant data loss thus giving perception of "service loss" to the end user. Thus, an optimal resilient architecture that aims to minimize "service loss" after a failover must make provision for the tunnel's essential stateful information - i.e. its sequence numbers. Currently, there are two options available: the first option is to ensure that the backup endpoint is completely synchronized with the active with respect to the control and data sessions sequence numbers. The other option is to re-establish all the tunnels and its sessions after a failover. The drawback of the first option is that it adds significant performance and complexity impact to the endpoint's architecture, especially as tunnel and session aggregation increases. The drawback of the second option is that it increases the "service loss" time, especially as the architecture scales. To alleviate the above-mentioned drawbacks of the current options, this draft introduces a mechanism to bring the dynamic stateful information of a tunnel to correct and consistent state after a failure. The proposed mechanism, defines the recovery of tunnels and sessions that were in Jain Standards Track [Page 3] INTERNET DRAFT FAILOVER March 2005 established state prior to the failure. 2.0 Protocol Operation The failover protocol consists of three phases - pre failover, failover recovery, and session state synchronization. Pre failover operation allows an endpoint to specify its failover capabilities and timer values, attributes that are used when failover occurs. Failover recovery is started at the failed endpoint when it initiates a new L2TP control connection (called recovery tunnel), for every old tunnel that needs recovery. The recovery tunnel serves four purposes: 1) It provides a means of authentication and a three-way handshake to ensure both ends agree on the failover for a given tunnel. 2) It identifies the old tunnel that needs recovery. 3) It tells whether failed endpoint would like to recover control and/or data channel. 4) It exchanges the Ns and Nr values to be used in the recovered tunnel on both ends. Upon establishing the recovery tunnel two endpoints reset their control and/or data channel; after which recovery tunnel could be torn down. The sessions that were in established state resume traffic. Session state synchronization process allows two endpoints to agree on the state of various sessions in the recovered tunnel. The inconsistency could arise due to failure on one of the endpoints. To synchronize, both endpoints first silently clear the sessions that were not in established state. At this point they can allow new sessions to establish on the recovered tunnel. Then, they utilize FSQ/FSR messages (over recovered tunnel) to obtain the state of sessions on the peer, in order to clear stale sessions. 2.1 Pre Failover Operation An endpoint that supports the failover protocol defined in this document MUST include Failover Capability AVP in SCCRQ or SCCRP during control connection establishment. Failover Capability AVP 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved |D|C| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Recovery Time (in milliseconds) | Jain Standards Track [Page 4] INTERNET DRAFT FAILOVER March 2005 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The AVP MAY be hidden (the H-bit set to 0 or 1). The AVP is not mandatory (the M-bit MUST be set to 0). The D bit, when set indicates that an endpoint is capable of supporting its peer's data channel failure. The C bit, when set indicates that an endpoint is capable of supporting its peer's control channel failure. Recovery Time is the time in milliseconds an endpoint asks its peer to wait before assuming the recovery process has failed. This timer starts with when an endpoint's control channel timeout ([L2TPv2] section 5.8, [L2TPv3] section 4.2) is started, and is not terminated (before expiry) until an endpoint successfully authenticate its peer during recovery. A value of zero indicates that the sender can not preserve the state of sessions within the tunnel, but it is able to support its peer's failure. 2.2 Failover Recovery Procedure Failover recovery procedure consists of two steps: 1) Recovery tunnel establishment 2) Control and/or data channel reset 2.2.1 Recovery tunnel establishment Failed endpoint establishes a new control connection, called recovery tunnel, for every old tunnel it wishes to recover. The purpose of the recovery tunnel is solely to recover the corresponding old tunnel. An endpoint SHOULD not send any control message on this tunnel, other than the messages to establish the tunnel itself. Recovery tunnel MUST use the same L2TP version and establishment procedures that were used for the control connection being recovered. It MUST follow the procedures described in [L2TPv2] or [L2TPv3] to establish the recovery tunnel. To identify the old control connection, SCCRQ message for recovery tunnel MUST include Tunnel Recovery AVP. Tunnel Recovery AVP for L2TPv3 tunnels: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved |D|C| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Jain Standards Track [Page 5] INTERNET DRAFT FAILOVER March 2005 | Recover Tunnel Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Recover Remote Tunnel Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Tunnel Recovery AVP for L2TPv2 tunnels: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved |D|C| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Recover Tunnel Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Recover Remote Tunnel Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This AVP MUST not be hidden (the H-bit is set to 0). The AVP is mandatory (the M-bit is set to 1). The D bit is set when a failed endpoint would like to recover the data channel. The C bit is set when the failed endpoint would like to recover the control channel. Recover Tunnel Id encodes the local tunnel id that it wants recovered. Similarly, Recover Remote Tunnel Id encodes the remote tunnel id corresponding to the old tunnel. Upon getting an SCCRQ with Tunnel Recovery AVP, the non failed endpoint validates Recover Tunnel Id and Recover Remote Tunnel Id and responds with an SCCRP. It MUST terminate the tunnel if: - Recover Tunnel Id or Remote Recover Tunnel Id is unknown. - Non failed endpoint did not indicate it was failover capable. - The L2TP version of recovery tunnel is different from the version used in the old tunnel. If non failed endpoint accepts the SCCRQ, it MAY include Suggested Control Sequence AVP in the SCCRP. Jain Standards Track [Page 6] INTERNET DRAFT FAILOVER March 2005 Suggested Control Sequence AVP 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Suggested Ns | Suggested Nr | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This AVP MAY be hidden (the H-bit set to 0 or 1). The AVP is not mandatory (the M-bit is set to 0). This is an optional AVP, suggesting Ns and Nr values to be used by the failed endpoint. If this AVP is present in an SCCRP message, the failed endpoint MUST set the Ns and Nr values of the recovered tunnel to the respective suggested values. When this AVP is not sent in SCCRP or not present in an incoming SCCRP, the Ns and Nr values for the recovered tunnel are set to zero. It is RECOMMENDED that the non failed endpoint suggest the Ns and Nr values to help avoid the interference in recovered tunnel's control channel with old control packets. In case of L2TPv3, Recovery tunnel MUST use the Control Message authentication (i.e. exchange the nonce values) as described in [L2TPv3] section 4.3, if the old tunnel was configured to do Control Message authentication. An L2TP Version 3 recovered tunnel MUST reset their nonce values (local and remote) to the nonce values exchanged in the recovery tunnel. To authenticate an endpoint during recovery, an endpoint MUST follow the procedure described in either [L2TPv2] section 5.1.1 or [L2TPv3] section 4.3. It SHOULD use the same secret that was used to authenticate the old tunnel. Not being able to authenticate could be a reason to terminate the recovery tunnel. If, for any reason, the failed endpoint could not establish the recovery tunnel then it MUST silently clear the recovered tunnel and sessions within, assuming the recovery process has failed. Any control packet received on the recovered tunnel, before control channel reset, MUST be silently discarded. If both endpoints fail simultaneously: - Each endpoint SHOULD follow the procedure described for a failed endpoint to recover the tunnel and its sessions. Jain Standards Track [Page 7] INTERNET DRAFT FAILOVER March 2005 - To avoid teardown of either one of the recovery tunnels an endpoint SHOULD not use the tie breaker AVP ([L2TPv2] section 4.4.3, [L2TPv3] section 5.4.3) during recovery tunnel establishment. Appendix C illustrates double failover scenario. 2.2.2 Control and Data Channel Reset Failed endpoint in Tunnel Recovery AVP (SCCRQ) indicates if it would like to reset control channel and/or data channel. Control channel reset on recovered tunnel SHOULD flush the transmit and receive windows, and reset the control channel sequence numbers (i.e. Ns and Nr values). The control channel on failed endpoint is reset upon getting a valid SCCRP, whereas control channel on non failed endpoint is reset upon getting a valid SCCCN. If failed endpoint does not receive Suggested sequence number AVP in SCCRP then it MUST reset Ns and Nr values to zero. Similarly, if non failed endpoint opts not to send suggested sequence number AVP then it MUST reset Ns and Nr values to zero. Data channel reset requires no action if data channel doesn't use sequence numbers. Whereas if data channel were using sequence numbers then the sequence numbers are reset as follows: - Before sending SCCRQ on the recovery tunnel, failed endpoint MUST stop receiving and transmitting data packets on all sessions. - Failed endpoint resets Ns to zero. It also sets Nr from the Ns received in the first data packet after sending SCCCN on the recovery tunnel. - After resetting Ns and Nr values, failed endpoint can transmit and receive data. - Non failed endpoint reset the Nr to zero upon receipt of a valid SCCN. It doesn't reset the Ns value. An endpoint MUST prevent establishment of new sessions until it has cleared (or marked for clearance) the sessions that were not in established state i.e. until after Step 1, section 2.3 is complete. 2.3 Session State Synchronization If failover happens while a session is being established or being torn down, it is possible for an endpoint to consider a session in established state, when its peer considers the same session non existent. Two such situations occur when an endpoint fails after Jain Standards Track [Page 8] INTERNET DRAFT FAILOVER March 2005 sending: - A CDN message that never made it to the peer. - An ICCN message that never made it to the peer. Following mechanism MUST be used to identify and clear the sessions that exists on an endpoint but not on its peer: Step1: After the recovery tunnel is established, the sessions that were not in established state MUST be silently cleared (i.e. without sending a CDN message) by each endpoint. Step2: Both endpoints SHOULD identify the sessions that might have been in inconsistent states, perhaps based on data channel inactivity. Step3: An endpoint sends Failover Session Query (FSQ) message, message type [TBD], to query the state of stale sessions on its peer. An FSQ message MUST include at least one Failover Session State (FSS) AVPs. An endpoint MAY send another FSQ message on the recovered tunnel before getting response for its previous FSQs. Failover Session State AVP is described as follows: Failover Session State AVP for L2TPv3 sessions: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Session Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Remote Session Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Failover Session State AVP for L2TPv2 sessions: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |M|H| rsvd | Length | Vendor Id [IETF] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Attribute Type [TBD] | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Session Id | Jain Standards Track [Page 9] INTERNET DRAFT FAILOVER March 2005 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved | Remote Session Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ This AVP MAY be hidden (the H-bit set to 0 or 1). The AVP is mandatory (the M-bit is set to 1). Session Id identifies the local session id sender had assigned, for which it would like to query the state on its peer. Remote Session Id is the remote session id for the same session. Before all sessions are synchronized using FSQ/FSR mechanism, if an endpoint receives an ICRQ for a session it believe is already in established state, it MUST respond to such ICRQ with a CDN, setting Assigned/Local Session ID AVP ([L2TPv2] section 4.4.4, [L2TPv3] section 5.4.4) to its local session id, and clear the session that it considered established. An endpoint could assign least recently used session ids to avoid this situation. When an endpoint receives an FSQ message, it MUST ensure that for each FSS AVP in FSQ message it includes an FSS AVP in Failover Session Response (FSR) message, message type [TBD]. There is no one-to-one correspondence between FSQ message and FSR message. Therefore an endpoint could respond to multiple FSQs using one FSR message, or it could respond one FSQ with multiple FSRs. For each FSS AVP received in FSQ, an endpoint MUST validate the Remote Session Id and determine if it is paired with the Session Id specified in the message. If FSS AVP is not valid (i.e. session is non-existing or it is paired with different remote session id), then the Session Id field in FSS AVP in the response MUST be set to zero. When session is discovered to be pairing with mismatching session id, the local session MUST not be cleared, but rather marked stale, to be queried later using another FSQ message. An example dialogue in Appendix D elaborates the endpoints behavior on mismatching session ids. Also, when responding to FSQ with an FSR message, Remote Session Id in FSS AVP is always set to the received value of Session ID in FSS AVP in FSQ message. When an endpoint receives an FSR message, it MUST use the Remote Session Id field to identify the local session and silently (without sending a CDN) clear the session if Session Id in the AVP was zero. Otherwise it can consider the session to be in established state and recovered. Jain Standards Track [Page 10] INTERNET DRAFT FAILOVER March 2005 3. IANA Considerations This document requires four new "AVP Attributes" and two new messages to be assigned through IETF Consensus [RFC2434] as indicated in Section 10.1 of [RFC2661]. These are: Failover Capability AVP (section 2.1) Tunnel Recovery AVP (section 2.2.1) Suggested Control Sequence AVP (section 2.2.1) Failover Session State AVP (section 2.3) Failover Session Query Message (FSQ) (section 2.3) Failover Session Response Message (FSR) (section 2.3) 4. Security Considerations The failover mechanism described here leaves a some room (1 in 2^32) for an intruder to discover the old tunnel id of an existing tunnel by trying out various possibilities in Recovery Tunnel Id and Recovery Remote Tunnel Id AVP. It also introduces an opportunity for an intruder to spoof the FSQ/FSR messages and know the active sessions. 5. Acknowledgements Leo Huber of Extreme Networks provided valuable suggestions to help define the failover concept. Ly Loi reviewed the draft and provided suggestions on improving it. 6. Author Information Vipin Jain Riverstone Networks 5200 Great America Parkway Santa Clara, CA 95054 Phone: +1 408.878.0464 Email: vipinietf@yahoo.com Paul W. Howard Juniper Networks 10 Technology Park Drive Westford, MA 01886 Jain Standards Track [Page 11] INTERNET DRAFT FAILOVER March 2005 Email: phoward@juniper.net Sam Henderson Cisco Systems 7025 Kit Creek Rd. PO Box 14987 Research Triangle Park, NC 27709 Email: samh@cisco.com Keyur Parikh Sentito Networks 2096 Gaither Road Suite 100 Rockville, MD 20850 Email: kparikh@sentito.com W. Mark Townsley Cisco Systems 7025 Kit Creek Road PO Box 14987 Research Triangle Park, NC 27709 Email: townsley@cisco.com 7. References [L2TPv2] Townsley, et. al., "Layer Two Tunneling Protocol L2TP", RFC2661 [L2TPv3] Lau, Townsley, Goyret, "Layer Two Tunneling Protocol (version 3)", RFCxxx Appendix A This section describes some design considerations that came up during discussions when developing the proposal: A.1 Backward compatibility and extensibility - The mechanism should be backwards compatible; i.e. it should not redefine existing behavior of [L2TP] compliant systems. - The protocol should allow a peer to detect failover capabilities in advance, for it to fall back to other failover mechanisms should peer does not support proposed failover protocol. - The protocol should allow future extensions to fail-over mechanism at ease. Jain Standards Track [Page 12] INTERNET DRAFT FAILOVER March 2005 A.2 Less failover recovery time The mechanism should have least possible time to recover from failover (target of 3-5 seconds for 30k tunnels). Specifically it should take following into consideration: - Faster recovery: by utilizing less number of messages exchanged to recover from failover - CPU intensiveness: less cpu intensive a proposal is, better are the changes of faster recovery - Parallel establishment of various tunnels: by keeping different tunnel reestablishments independent of one another. A.3 Less Payload data loss The mechanism should have least possible impact on data flows for sessions with sequencing enabled. A.4 Minimum interference with pre-failure control traffic The mechanism should define a way of clearly distinguishing the messages that were sent before failover from that which are sent after. Specifically, it should define a mechanism that avoid confusion between sequence numbers that were used before and after if the same Tunnel Id is used. A.5 Simplicity Simpler the protocol is, better are the changes of being adopted by everybody. Following would help achieve this: - Use of existing AVPs, messages and packet formats. - Avoid introducing special considerations and mechanisms a new implementation would have to deal with. - Simpler post fail-over synchronization mechanism. A.6 Security The mechanism should provide a mechanism to authenticate peers when resynchronization is happening after a failover. A.7 Scalability Jain Standards Track [Page 13] INTERNET DRAFT FAILOVER March 2005 It is very important for a proposed protocol to work well for a scalable deployment. This includes dealing with all design considerations discussed above for scalable deployments, having thousands of tunnels or sessions or mix of the two. A target of 30,000 tunnels carrying 150,000 to 200,000 sessions from 300 peers was considered during the design. Appendix B Description below outlines the failover protocol operation for an example tunnel. The failover protocol does not preclude an endpoint from recovering multiple tunnels in parallel. It also allows an endpoint from sending multiple FSQs to recover quickly. Pre Failover Exchange (section 2.1): Endpoint Peer (assigned tid = x, failover capable) SCCRQ --------------------------------------> validate SCCRQ (assigned tid = y, failover capable) validate <-------------------------------------- send SCCRP SCCRP, etc. .... .... < This Node fails > Failed endpoint establishes recovery tunnel (section 2.2.1). Initiate recovery tunnel establishment for the old tunnel 'x': Failed Endpoint Peer (assigned tid = z, Recovery AVP) SCCRQ -----------------------------------> Detects failover (recover tid = x, recover remote tid = y) validate SCCRQ (Suggested Control Sequence AVP, Suggested Ns/Nr = 3/100) validate <----------------------------------- send SCCRP SCCRP (recover tid = y, recover remote tid = x) reset Ns = 3, Nr = 100 on the recovered tunnel SCCCN -----------------------------------> validate and reset Jain Standards Track [Page 14] INTERNET DRAFT FAILOVER March 2005 Ns = 100, Nr = 3 on the recovered tunnel. Terminate the recovery tunnel tid = 'z' StopCCN --------------------------------------> Cleanup 'w' Session states are synchronized both endpoints may send FSQs and cleanup stale sessions (section 2.3) (FSS AVP for sessions s1, s2, s3..) send FSQ -------------------------------------> compute the state of sessions in FSQ (FSS AVP for sessions s1, s2, s3...) deletes <-------------------------------------- send FSR stale sessions, if any (FSS AVP for sessions s7, s8, s9...) compute <-------------------------------------- send FSQ the sate of sessions in FSQ (FSS AVP for sessions s7, s8, s9...) send FSR --------------------------------------> delete stale sessions, if any Appendix C This section shows an example dialogue to illustrate double failure recovery. Although illustration assumes two endpoints failing almost at the same time, the behavior on two endpoints would be similar even if the failure is interlaced. Failed endpoint Failed endpoint (assume old tid = A) (assume old tid = B) Recovery AVP = (A, B) SCCRQ --------------------------> valid SCCRQ ---+ (recovery tunnel 'C') | Jain Standards Track [Page 15] INTERNET DRAFT FAILOVER March 2005 | | Recovery AVP = (B, A) | +- valid <-------------------------- Send SCCRQ | | SCCRQ (recovery tunnel 'D') | | | | No SCS AVP | | Validate <-------------------------- send SCCRP <---+ | SCCRP; Reset 'A' | Ns, Nr set to zero | | | | No SCS AVP +->Send SCCRP -------------------------> Validate SCCRP | Reset 'B'; | Ns, Nr set to zero --+ | | +-> Send SCCCN ---------------------> Validate SCCCN; | Reset 'B' again; | Ns, Nr set to zero | | Validate SCCN <---------------------- Send SCCN --------+ Reset 'A' again; Ns, Nr set to zero FSQs and FSRs for the old tunnel (A, B) are exchanged on the recovered tunnel. This should be no different from handling simultaneous FSQs and FSRs between two nodes when only one node had failed. Appendix D Session id mismatch could not be a result of failure on one of the endpoints. However, failover session recovery procedure could exacerbate the situation, resulting into a permanent mismatch in session ids between two endpoints. Dialogue below outlines the behavior described in section 2.3 to handle such situations gracefully. Failed endpoint Non failed endpoint (assume a mismatch) (assume a mismatch) Sid = A, Remote Sid = B Sid = B, Remote Sid = C Sid = C, Remote Sid = D Jain Standards Track [Page 16] INTERNET DRAFT FAILOVER March 2005 FSS AVP (A, B) send FSQ -------------------------> No (B, A) pair exist; rather (B, C) exist. If it clears B then peer doesn't know if C is stale on other end. Instead if it marks B stale and queries the session state via FSQ, C would be cleared on the other end. FSS AVP (0, A) Clears A <-------------------------- send FSR ... some time later ... FSS AVP (B, C) No (B,C) <-------------------------- send FSQ Mark C Stale FSS AVP (B, 0) Send FSR --------------------------> Clears B Jain Standards Track [Page 17]