Network Working Group Johan Sjoberg INTERNET-DRAFT Magnus Westerlund Expires: March 2005 Ericsson Ari Lakaniemi Nokia September 30, 2004 Real-Time Transport Protocol (RTP) Payload Format for Extended AMR Wideband (AMR-WB+) Audio Codec Status of this memo By submitting this Internet-Draft, I (we) certify that any applicable patent or other IPR claims of which I am (we are) aware have been disclosed, and any of which I (we) become aware will be disclosed, in accordance with RFC 3668 (BCP 79). Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document is a submission of the IETF AVT WG. Comments should be directed to the AVT WG mailing list, avt@ietf.org. Abstract This document specifies a real-time transport protocol (RTP) payload format to be used for Extended AMR Wideband (AMR-WB+) encoded audio signals. The AMR-WB+ codec is an audio extension of the AMR-WB codec providing additional frame types designed to give higher quality of music and speech than the original frame types. A media type registration is included for AMR-WB+. Sjoberg, et. al. [Page 1] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 TABLE OF CONTENTS 1. Definitions.....................................................3 1.1. Glossary...................................................3 1.2. Terminology................................................3 2. Introduction....................................................3 3. Background on AMR-WB+ and Design Principles.....................4 3.1. The AMR-WB+ Audio Codec....................................5 3.2. Multi-rate Encoding and Rate Adaptation....................6 3.3. Voice Activity Detection and Discontinuous Transmission....7 3.4. Support for Multi-Channel Session..........................7 3.5. Unequal Bit-error Detection and Protection.................7 3.6. Robustness against Packet Loss.............................7 3.6.1. Use of Forward Error Correction (FEC).................8 3.6.2. Use of Frame Interleaving.............................9 3.7. AMR-WB+ Audio over IP scenarios...........................10 4. RTP Payload Format for AMR-WB+.................................11 4.1. RTP Header Usage..........................................11 4.2. Payload Structure.........................................12 4.3. Payload definitions.......................................13 4.3.1. The Payload Table of Contents........................13 4.3.2. Audio Data...........................................15 4.3.3. Methods for Forming the Payload......................16 4.3.4. Payload Examples.....................................17 4.4. Interleaving Considerations...............................18 4.5. Implementation Considerations.............................19 4.5.1. ISF recovery when frames are lost....................19 5. Congestion Control.............................................21 6. Security Considerations........................................21 6.1. Confidentiality...........................................22 6.2. Authentication and Integrity..............................22 6.3. Decoding Validation.......................................22 7. Payload Format Parameters......................................23 7.1. Media Type Registration...................................23 7.2. Mapping Media Type Parameters into SDP....................25 7.2.1. Offer-Answer Model Considerations....................25 7.2.2. Examples.............................................26 8. IANA Considerations............................................27 9. Contributors...................................................27 10. Acknowledgements..............................................27 11. References....................................................27 11.1. Normative references.....................................27 11.2. Informative References...................................28 12. Authors' Addresses............................................29 13. IPR Notice....................................................29 14. Copyright Notice..............................................30 15. Changes.......................................................30 Sjoberg, et. al. Standards Track [Page 2] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 1. Definitions 1.1. Glossary 3GPP - the Third Generation Partnership Project AMR - Adaptive Multi-Rate Codec AMR-WB - Adaptive Multi-Rate Wideband Codec AMR-WB+ - Extended Adaptive Multi-Rate Wideband Codec CMR - Codec Mode Request CN - Comfort Noise DTX - Discontinuous Transmission FEC - Forward Error Correction FT - Frame Type ISF - Internal Sampling Frequency SCR - Source Controlled Rate Operation SID - Silence Indicator (the frames containing only CN parameters) TFI - Transport Frame Index TS - Timestamp VAD - Voice Activity Detection UED - Unequal Error Detection UEP - Unequal Error Protection 1.2. Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [2]. n^r is exponentiation where n is multiplied by itself r times; n and r are integers. k%m denotes the modulo operation (k mod m), i.e. the remainder part from the operation k/m; k and m are integers. 2. Introduction This document specifies the payload format for packetization of Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] encoded audio signals into the Real-time Transport Protocol (RTP) [3]. The payload format supports transmission of mono or stereo audio, aggregating multiple frames per payload, and mechanisms enhancing robustness against packet loss. AMR-WB+ codec is an extension to the Adaptive Multi-Rate Wideband (AMR-WB) codec and therefore has a couple of features not available in AMR-WB. The new features in transport point of view are native support also for stereophonic audio and the possibility to use different internal sampling frequencies. The primary usage scenario for AMR-WB+ is transport over IP and therefore AMR-WB-like need for interworking with other transport networks is not necessary. Sjoberg, et. al. Standards Track [Page 3] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 AMR-WB+ will mainly be used in streaming scenarios and there the benefit of using an octet-aligned format to decrease the complexity of the server is seen substantial, and therefore anything similar to the bandwidth efficient mode defined in [7] is not specified for AMR-WB+; the saved bandwidth using bandwidth efficient mode would also be very small for all extension frame types as they are octet aligned. The inbuilt codec support for stereo encoding makes the RTP payload format implementation of multi-channel support as in AMR and AMR-WB [7] difficult, but also less needed. Therefore, the multi-channel support as specified in AMR and AMR-WB payload format is not specified for AMR-WB+. Due to all these changes, and the different scope of the AMR-WB+ codec this formats defines a new significantly different RTP payload format compared to the ones for AMR and AMR-WB [7]. There is no file format for AMR-WB+ defined within this specification. Instead the 3GPP defined ISO based 3GP file format [14] will support AMR-WB+, and provides all functionality needed from a file format. This format does also support storage of AMR and AMR-WB, plus other multi-media formats allowing for synchronized playback. As the 3GP format provides much greater capability than the previously defined formats for AMR and AMR-WB, this format is expected to be used and be sufficient for all use cases. Background on AMR-WB+ and design principles can be found in Section 3. The payload format itself is specified in Section 4 and follows the principles used in [3], [9], and [7]. In Section 7, a media type registration is provided. 3. Background on AMR-WB+ and Design Principles The Extended Adaptive Multi-Rate Wideband (AMR-WB+) [1] audio codec is designed for compression of speech and audio achieving low bit- rate with good quality. The codec is specified by 3GPP, and primary target applications within 3GPP are packet-switched streaming service (PSS) [13] and multimedia messaging service (MMS). However, due to its flexibility and robustness, AMR-WB+ is very well suited for streaming services in highly varying transport environments, e.g. the Internet. Because of the flexibility of this codec, the behavior in a particular application is controlled by several parameters that select options or specify the acceptable values for a variable. These options and variables are described in general terms at appropriate points in the text of this specification as parameters to be established through out-of-band means. In Section 7, all of the parameters are specified in the form of media type registration Sjoberg, et. al. Standards Track [Page 4] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 for the AMR-WB+ encoding. The method used to signal these parameters at session setup or to arrange prior agreement of the participants is beyond the scope of this document; however, Section 7.2 provides a mapping of the parameters into the Session Description Protocol (SDP) [6] for those applications that use SDP. 3.1. The AMR-WB+ Audio Codec The AMR-WB+ audio codec was originally developed by 3GPP to be used for streaming and messaging services in GSM and 3G cellular systems. AMR-WB+ is designed as an audio extension to the AMR-WB speech codec. The new extension frame types add new functionality to the codec in order to provide high audio quality for a large range of signals including music. Stereophonic operation has also been added where a new high-efficiency hybrid stereo coding algorithm enables stereo operation at bit-rates as low as 6.2 kbit/s in total. The AMR-WB+ audio codec includes the nine frame types specified for AMR-WB, extended with additional new frame types with bit-rates ranging from 5.2 to 48 kbit/s. Whereas the AMR-WB frame types employ 16000 Hz sampling frequency and operates only on monophonic signals, the extension frame types can operate at a number of internal sampling frequencies, ISFs, both in mono and stereo, see Table 24 in [1]. However, the output sampling frequency of the decoder is limited to 8, 16, 24, 32 or 48 kHz. The audio processing is performed on equal-size superframes, each corresponding to 2048 samples per encoded channel. The codec performs a number of encoding decisions for each superˇframe choosing between different encoding algorithms and block lengths giving fidelity-optimized encoding adapting to the signal characteristics of the source. The superframes are encoded in 4 equal-size transport frames, i.e. corresponding to 512 samples per channel, each being individually decodable. For the individual transport frames to be decodable, the position within the superframe must be known. An AMR-WB+ frame type is constructed from two different parameters; core bit-rate, and stereo bit-rate. The core bit-rate denotes the bits available for the core codec while the stereo bit-rate denotes the bit-rate added to the core bit-rate when enabling stereo encoding. In order to calculate the correct bit-rate, also the ISF must be taken into account. The total bit-rate of the frame is calculated as the sum of the core bit-rate and the stereo bit-rate times the ISF where 25600 Hz has been normalized to 1. The AMR-WB+ standard specifies eight core bit-rates, sixteen stereo bit-rates and thirteen different ISF values. These can be found in Tables 22, 23 and 24 in [1]. In addition to the AMR-WB frame types 0-9, there are four pre-defined AMR-WB+ extension frame types, which have fixed core bit-rates, stereo bit-rates and ISFs, see Table 21 in [1]. Sjoberg, et. al. Standards Track [Page 5] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 These four pre-defined frame types have also a fixed input sampling frequency to the encoder set at 16 or 24 kHz respectively. These frametypes share the property with the AMR-WB modes that each frame is only capable of representing 20 ms of audio signal. Since there is a large number of possible parameter combinations, a limited normative combination set of core bit-rates and stereo bit- rates has been defined, see Table 25 in [1]. Note that the first 16 entries in this table are the same as the entries in Table 21, which incorporates the original AMR-WB modes. The totel bit rate specified with the frame type in conjunction with the chosen ISF defines the actual codec bit rate. There exist a number of combinations that will produce the same codec bit-rate. For example, one possible way of producing a 32 kbps audio stream is to utilize frame type 41, i.e. 25.6 kbps, and the ISF of 32kHz (5/4 * (19.2+6.4) = 32 kbps), and another way is to use frame type 47 and the ISF of 25.6 kHz (1 * (24 + 8) = 32 kbps). The duration of one AMR-WB+ audio transport frame can vary and depends on the ISF. Since a frame always correspond to 512 samples at the used ISF, its duration is limited to the range 13.33 to 40 ms. The RTP TS clock rate 72000 Hz results in an AMR-WB+ transport frame length from 960 to 2880 ticks. If the internal sampling rate is set to 25600 Hz, a transport frame is equal to 20 ms and the superframe is equal to 80 ms. The encoder is able to change the used ISF and encoding frame type (both mono and stereo) during an encoding session. For the extension frame types with index 16-47 ISF changes are constrained to occur at superframe boundaries, i.e. within a super-frame the ISF is constant. Such a limitation does not apply for frame types with index 0-9, i.e. the original AMR-WB frame types. 3.2. Multi-rate Encoding and Rate Adaptation The multi-rate encoding capability of AMR-WB+ is designed for preserving high audio quality under a wide range of bandwidth requirements and transmission conditions. AMR-WB+ enables seamless switching between frame types using the same number of audio channels and the same internal sampling frequency. Every AMR-WB+ codec implementation is required to support all the respective audio coding frame types defined by the codec and must be able to handle switching between any two frame types. Switching between frame types employing different number of audio channels or different internal sampling frequency is possible, but may not be seamless. Therefore it is recommended to perform such switchings infrequently and if possible during periods where the input is silent. Sjoberg, et. al. Standards Track [Page 6] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 3.3. Voice Activity Detection and Discontinuous Transmission AMR-WB+ supports the same algorithms for voice activity detection (VAD) and generation of comfort noise (CN) parameters during silence periods as used by the AMR-WB codec. However it can only be used in together with the AMR-WB frame types (FT=0-8). As with the AMR-WB codec this option allows for reduction of the number of transmitted bits and packets during silence periods to a minimum when operating in the AMR-WB frame types (FT = 0..8). The operation of sending CN parameters at regular intervals during silence periods is usually called discontinuous transmission (DTX) or source controlled rate (SCR) operation. The AMR-WB+ frames containing CN parameters are called Silence Indicator (SID) frames. See more details about VAD and DTX functionality in [4] and [5]. 3.4. Support for Multi-Channel Session Some of the AMR-WB+ frame types support encoding of stereophonic audio. Because of this native support for two-channel stereophonic signal it does not seem necessary to support multi-channel transport with separate codecs as done in AMR-WB RTP payload [7]. The codec has the capability of stereo to mono downmixing. Thus also receiver only capable of playout of mono, can still decode and play signals originally encoded as stereo. However, to avoid spending bit-rate on stereo encoding that will not be utilized, a mechanism for signalling mono only support is defined. 3.5. Unequal Bit-error Detection and Protection The audio bits encoded in each AMR-WB frame are sorted according to their different perceptual sensitivity to bit errors. This property can be exploited e.g. in cellular systems to achieve better voice quality by using unequal error protection and detection (UEP and UED) mechanisms. However, the bits of the extension frame types of the AMR-WB+ codec do not have a consistent sensitivity property and are not sorted in sensitivity order. Thus, UEP or UED cannot be utilized with the extension frame types. If there is a need to use UEP or UED and for a payload format supporting this, please use the RTP payload format for the AMR-WB frame types defined in RFC 3267 [7]. 3.6. Robustness against Packet Loss The payload format supports several means, including forward error correction (FEC) and frame interleaving, to increase robustness against packet loss. Sjoberg, et. al. Standards Track [Page 7] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 3.6.1. Use of Forward Error Correction (FEC) The simple scheme of repetition of previously sent data is one way of achieving FEC. Another possible scheme which can be more bandwidth efficient is to use payload external FEC, e.g. RFC2733 [11], which generates extra packets containing repair data. For the AMR-WB+ extension frame types, it is only possible to use the codec to send redundant copies using the same frame type and internal sampling frequency. We describe such a scheme next. This involves the simple retransmission of previously transmitted frames together with the current frame(s). This is done by using a sliding window to group the audio frames to be sent in each payload. Figure 1 below shows us an example. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- <---- p(n-1) ----> <----- p(n) -----> <---- p(n+1) ----> <---- p(n+2) ----> <---- p(n+3) ----> <---- p(n+4) ----> Figure 1: An example of redundant transmission. In this example each frame is retransmitted once in the following RTP payload packet. Here, f(n-2)..f(n+4) denotes a sequence of audio frames and p(n-1)..p(n+4) a sequence of payload packets. The use of this approach does not require signaling at the session setup. In other words, the audio sender can choose to use this scheme without consulting the receiver. This is because a packet containing redundant frames will not look different from a packet with only new frames. For a certain timestamp, the receiver may receive multiple copies of a frame containing encoded audio data or frames indicated as NO_DATA. This redundancy scheme provides the same functionality as the one described in RFC 2198 "RTP Payload for Redundant Audio Data" [12]. In most cases the mechanism in this payload format is more efficient and simpler than requiring both endpoints to support RFC 2198 in addition. There is one situation in which use of RFC 2198 is indicated: if some other codec than AMR-WB+ is desired for the redundant encoding, the AMR-WB+ payload format won't be able to carry it. The sender is responsible for selecting an appropriate amount of redundancy based on feedback about the channel, e.g., in RTCP Sjoberg, et. al. Standards Track [Page 8] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 receiver reports. The sender is also responsible for avoiding congestion, which may be exacerbated by redundancy (see Section 5 for more details). 3.6.2. Use of Frame Interleaving To decrease protocol overhead, the payload design allows several audio frames be encapsulated into a single RTP packet. One of the drawbacks of such an approach is that in case of packet loss this means loss of several consecutive audio frames, which usually causes clearly audible distortion in the reconstructed audio. Interleaving of frames can improve the audio quality in such cases by distributing the consecutive losses into a series of single frame losses. However, interleaving and bundling several frames per payload will also increase end-to-end delay and sets higher buffering requirements, and it is therefore not appropriate for all usage scenarios. Anyway, streaming applications will most likely be able to exploit interleaving to improve audio quality in lossy transmission conditions. This payload design supports the use of frame interleaving as an option. The usage of this feature needs to be negotiated or at least signalled. The interleaving supported by this format is rather flexible. For example, a continuous pattern can be defined, as the below example shows. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- [ P(n) ] [ P(n+1) ] [ P(n+1) ] [ P(n+2) ] [ P(n+2) ] [ P(n+3) ] [P( [ P(n+4) ] Figure 2: Example of interleaving pattern that has constant delay. In Figure 2 the consecutive frames, denoted f(n-2) to f(n+4), are aggregated two in each packet with interleaving. The packets, P(n) to P(n+4), contains a pattern that allows for constant delay in both interleaving and deinterleaving process. The deinterleaving buffer in this example needs to have room for at least 3 frames including the one that is ready to be consumed. The case when this is needed is for example when f(n) is the next to be played, then the receiver would have consumed all previous frames, and will need to have f(n), f(n+1) and f(n+3) in the buffer. Then when it is time to consume f(n+1) no more RTP packet is need. When f(n+2) is to be consumed Sjoberg, et. al. Standards Track [Page 9] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 then P(n+3) is needed and the deinterleaving buffer will contain f(n+2), f(n+3) and f(n+5). 3.7. AMR-WB+ Audio over IP scenarios Since the primary target application for the AMR-WB+ codec is packet switched streaming, the most relevant usage scenario for this payload format is IP end-to-end between a server and a terminal, as shown in Figure 3. +----------+ +----------+ | | IP/UDP/RTP/AMR-WB+ | | | SERVER |<------------------------>| TERMINAL | | | | | +----------+ +----------+ Figure 3: Server to terminal IP scenario Sjoberg, et. al. Standards Track [Page 10] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 4. RTP Payload Format for AMR-WB+ The AMR-WB+ payload format is different from the AMR and AMR-WB payload formats [7]. The structure is simpler, and does only consist of a table of contents, and the audio data. The payload format has two modes, the basic, and the interleaved mode. The main structural difference between the two modes is the extension of the table of contents with a timestamp offset field in the interleaved mode. The basic mode supports aggregation of multiple consecutive frames in a payload. The interleaved mode supports aggregation of multiple frames that are non-consecutive in time. It is possible to have frames of different internal sampling frequency in the same payload. However frequent switching of the internal sampling frequency is not expected. The codec is restricted for the extended frame types to switch ISF on superframe boundaries. However to avoid any limitation on how many frames that are present in a payload, the payload format allows for switching at any frame in the payload. The payload format is designed around the property that the AMR-WB+ frames can be sorted and identified based on the RTP timestamp of each audio frame. For example, the timestamp of the audio frames is used to identify duplicates. The timestamp is also used in the deinterleaving buffer to regenerate the correct order of the frames before decoding. The interleaving scheme of this payload format is significantly more flexible than the one present in RFC 3267. The AMR and AMR-WB payload format is only capable of using periodic patterns with frames taken from an interleaving group at fixed intervals. This interleaving scheme allows for any patterns as long as the time difference between any two in the payload adjacent frames are not more than 0.91 seconds, i.e. maximum field value / RTP timestamp rate (65535/72000). And by using extra NO_DATA frames even that can be extended. To allow for error resiliency through redundant transmission, the periods covered by multiple packets MAY overlap in time. A receiver MUST be prepared to receive any audio frame multiple times, all multiply sent frames MUST use the same frame type (or NO_DATA) and internal sampling frequency and have the same RTP timestamp. The payload is always made an integral number of octets long by padding with zero bits if necessary. If additional padding is required to bring the payload length to a larger multiple of octets or for some other purpose, then the P bit in the RTP header MAY be set and padding appended as specified in [3]. 4.1. RTP Header Usage Sjoberg, et. al. Standards Track [Page 11] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 The format of the RTP header is specified in [3]. This payload format uses the fields of the header in a manner consistent with that specification. The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame in the packet. The timestamp clock frequency SHALL be 72000 Hz. This frequency allows the frame duration to be integer RTP timestamp ticks for the used internal sampling frequencies, and also gives reasonable conversion factors to used audio sampling frequencies. See section 4.3.1 for how to derive the RTP timestamp for any audio frame beyond the first one. The RTP header marker bit (M) SHALL be set to 1 if the first frame carried in the packet contains an audio frame, which is the first in a talkspurt. For all other packets the marker bit SHALL be set to zero (M=0). The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically. The media type parameter "channels" is used to indicate the maximum number of channels allowed to be used for a given payload type. A payload type where channels=1 (mono), SHALL only carry mono content. While a payload type for which channels=2 has been declared MAY carry both mono and stereo content. 4.2. Payload Structure The complete payload consists of a payload table of contents, and audio data representing one or more audio frames. The following diagram shows the general payload format layout: +-------------------+---------------- | table of contents | audio data ... +-------------------+---------------- Payloads containing more than one audio frame are called compound payloads. The following sections describe the variations taken by the payload format depending on whether the AMR-WB+ session is set up to use the basic mode or interleaved mode. Sjoberg, et. al. Standards Track [Page 12] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 4.3. Payload definitions 4.3.1. The Payload Table of Contents The table of contents (ToC) consists of a list of ToC entries where each entry corresponds to an audio frame carried in the payload, i.e., +----------------+----------------+- ... -+----------------+ | ToC entry #1 | Toc entry #2 | ToC entry #N | +----------------+----------------+- ... -+----------------+ When multiple frames are present in a packet, the ToC entries SHALL be placed in the packet in order of their creation time. All fields in the RTP payload are in network byte order, i.e. with the left most bit being most significant. A ToC entry takes the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F| Frame Type |TFI|R| ISF | Timestamp offset (optional) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ F (1 bit): If set to 1, indicates that this frame is followed by another audio frame in this payload; if set to 0, indicates that this frame is the last frame in this payload. Frame Type (FT) (7 bits): Indicates the audio codec frame type used for the corresponding frame. Indicates the combination of AMR-WB+ core and stereo rate, special AMR-WB+ frame types, the AMR-WB rate, or comfort noise, as specified by Table 25 in [1]. Transport Frame Index (TFI) (2 bits): An index from 0 (first) to 3 (last) indicating this transport frame's position in the superframe. This field SHALL be set to 0 for Frame Type values 0- 9. ISF (5 bits): Indicates the internal sampling frequency employed for the corresponding frame. The index values correspond to internal sampling frequency as specified in Table 24 in [1]. This field SHALL be set to 0 for Frame Type values 0-13. Timestamp offset (16 bits): When using interleaved mode, this field SHALL be present, otherwise not. The field indicates the number of RTP Timestamp ticks that this frame is offset, in relation to the previous frame's RTP timestamp value. The RTP Timestamp offset for the first audio frame SHALL be 0. The field is in network byte order and is a 16 bit unsigned integer. Sjoberg, et. al. Standards Track [Page 13] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 R: Reserved bit, SHALL be set to 0 and SHALL be ignored by receivers. The RTP Timestamp value for a frame is the timestamp value of the first sample encoded in the frame. The timestamp value for a frame is derived differently depending on if it is basic or interleaved mode. In both cases the first frame in a compound packet has a RTP timestamp equal to the one given in the RTP header. In the basic mode, the RTP time for any frame of a subsequent frame is derived by adding together the frame durations of all the previous frames and add that to the RTP header timestamp value. For example if the RTP Header timestamp value is 12345, and the frame duration is 16 ms (Internal sampling frequency = 32 kHz). Then the RTP timestamp of a fourth frame present in the payload will be 12345 + 3 * 1152 = 15801. In interleaved mode the RTP timestamp is derived from the RTP header timestamp field and the sum of the RTP timestamp offset field in the TOC entries up to and including the frame for which one calculates the RTP TS for in modulo arithmetic. The following example derives the RTP TS for the third frame in a compound packet, which has the following header and TOC information: RTP header TS: 12345 Frame 1 offset field: 0 Frame 2 offset field: 13824 Frame 3 offset field: 18432 In this case one simply adds together the offset values up to current frame to compute the frame timestamp. For example Frame 3's timestamp is (12345 + 0 + 13824 + 18432)% 2^32 = 44601 (% stands for modulo operation) The value of Frame Type is defined in Table 25 in [1]. FT=14 (AUDIO_LOST) is used to indicate frames that are lost. NO_DATA (FT=15) frame could mean either that there is no data produced by the audio encoder for that frame or that no data for that frame is transmitted in the current payload (i.e., valid data for that frame could be sent in either an earlier or later packet). The duration for these non-included frames is dependent on the internal sampling frequency indicated by the ISF field. For frame types with index 0-13 the ISF field SHALL be set 0 and has no meaning. The frame length for these frame types are fixed to 20 ms in time, and an RTP timestamp duration of 1440 ticks. For frame types with index 0-9 the TFI field SHALL be set to 0, and lacks meaning. If receiving a ToC entry with a FT value not defined the whole packet SHOULD be discarded. This is to avoid the loss of data Sjoberg, et. al. Standards Track [Page 14] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 synchronization in the depacketization process, which can result in a severe degradation in audio quality. Note that packets containing only NO_DATA frames SHOULD NOT be transmitted. Also, NO_DATA frames at the end of a frame sequence to be carried in a payload SHOULD NOT be included in the transmitted packet. The AMR-WB+ SCR/DTX is identical with AMR-WB SCR/DTX described in [5] and SHALL only be used in combination with the AMR- WB frame types (0-8). When multiple frames are present, their ToC entries will be placed in the ToC in order of their creation time independent on payload mode. In basic mode the frames will be consecutive in time, while in interleaved mode the frames may not only be non-consecutive in time but may even have varying inter frame distances. The following figure shows an example of a ToC of three entries in basic mode. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Frame Type1 | 0 |0| ISF 1 |1| Frame Type2 | 1 |0| ISF 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Frame Type3 | 2 |0| ISF 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The following figure shows an example of a TOC of three entries in interleaved mode. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Frame Type1 | 2 |0| ISF 1 | Timestamp offset 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| Frame Type2 | 0 |0| ISF 2 | Timestamp offset 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Frame Type3 | 3 |0| ISF 3 | Timestamp offset 3 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3.2. Audio Data Audio data of a payload contains one or more audio frames or comfort noise frames, as described in the ToC of the payload. Note, for ToC entries with FT=14 or 15, there will be no corresponding audio frame present in the audio data. Each audio frame for an extension frame type represents an AMR-WB+ transport frame corresponding to the encoding of 512 samples of Sjoberg, et. al. Standards Track [Page 15] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 audio sampled with the internal sampling frequency specified by the ISF indicator. Frame types with index 10-13, being the exception, are only capable of using a single internal sampling frequency (25600 Hz). The encoding rates (core and stereo) are indicated in the frame type field of the corresponding ToC entry. The octet length of the audio frame is implicitly defined by the frame type field and is given in tables 21 and 25 of [1]. The order and numbering notation of the bits are as specified in [1]. As specified there, the bits of the AMR-WB audio frames (frame type values in range 0...8) have been rearranged in order of decreasing sensitivity. For the AMR-WB+ extension frame types and comfort noise frames, the bits are in the order produced by the encoder. The last octet of each audio frame MUST be padded with zeroes at the end if not all bits in the octet are used. In other words, each audio frame MUST be octet-aligned. However, all frame types specified in [1] lead to octet-aligned frames. 4.3.3. Methods for Forming the Payload The payload begins with the table of contents consisting of a list of ToC entries, two or four bytes per entry. The audio data follows the table of contents, all of the octets comprising an audio frame are appended to the payload as a unit. The audio frames are packed in the same order as their corresponding ToC entries are arranged in the ToC list, with the exception that if a given frame has a ToC entry with FT=14 or 15, there will be no data octets present for that frame. Sjoberg, et. al. Standards Track [Page 16] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 4.3.4. Payload Examples 4.3.4.1. Example 1, Basic Payload Carrying Multiple Frames The following diagram shows a payload from a session that carries three AMR-WB+ frames of 14 kbps coding frame type (FT=26) with a frame length of 280 bits. The internal sampling frequency in this example is 25.6 kHz (ISF = 8). The TFI for the first frame is 2, indicating that the first transport frame in this payload is the third in a superframe. The following frames are consecutive, i.e. the fourth and first transport frames in the superframe. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| FT = 26 | 2 |0| ISF = 8 |1| FT = 26 | 3 |0| ISF = 8 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| FT = 26 | 0 |0| ISF = 8 | f1(0..7) | f1(8..15) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(272..279) | f2(0..7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f2(272..279) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(0..7) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | f3(272..279) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sjoberg, et. al. Standards Track [Page 17] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 4.3.4.2. Example 2, Payload in Interleaved mode This example shows a payload with three frames of 24 kbps stereo coding frame type (FT=40). This payload uses the interleaved mode. The frames 1, 2 and 3 are not consecutive in time. They are in playout order frame 1, 8, and 15, and the TFI values also match this. The internal sampling frequency in this example is 32 kHz (ISF = 10). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| FT = 40 | 1 |0| ISF = 10| Timestamp offset = 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| FT = 40 | 0 |0| ISF = 10| Timestamp offset = 8064 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| FT = 40 | 3 |0| ISF = 10| Timestamp offset = 8064 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(0..7) | f1(8..15) | f1(16..23) | f1(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(448..455) | f1(456..463) | f1(464..471) | f1(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(0..7) | f2(8..15) | f2(16..23) | f2(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(448..455) | f2(456..463) | f2(464..471) | f2(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(0..7) | f3(8..15) | f3(16..23) | f3(24..31) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f3(448..455) | f3(456..463) | f3(464..471) | f3(472..479) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.4. Interleaving Considerations The flexible interleaving scheme requires some further usage considerations. As presented in the example in Section 3.6.2, an interleaving pattern requires certain sizes of the deinterleaving buffer. This required buffer space, expressed as number of frame slots is expressed using the "interleaving" media parameter. The number of frame slots needed, can be converted into actual memory requirement, considering the largest (in bytes) combination of AMR- WB+'s core and stereo rates. However the frame buffer size is not always sufficient to determine when it is appropriate to start consuming frames from the interleaving buffer. Two cases exist, either due to switching of the Sjoberg, et. al. Standards Track [Page 18] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 internal sampling frequency or due to changes of the interleaving pattern. Due to this the "int-delay" media type parameter is defined. It allows a sender to indicate the minimal media time that needs to be present in the buffer before starting to consume media from the buffer. 4.5. Implementation Considerations An application implementing this payload format MUST understand all the payload parameters in the out-of-band signaling used. For example, if an application uses SDP, all the SDP and MIME parameters in this document MUST be understood. This requirement ensures that an implementation always can decide if it is capable or not of communicating. Both basic and interleaving mode SHALL be implemented. The implementation burden of both is rather small and requiring both ensures interoperability. It is also RECOMMENDED to implement the AMR-WB format in RFC 3267 [7], for applications or scenarios where interoperability with AMR-WB only codecs is necessary. When doing error concealment certain precautions are needed due to the possibility of switching of the internal sampling frequency. The main difficulty arises from the fact that with packet loss information gets lost such as timestamp, frame lengths and the chosen ISF. This may lead to that concealment is done using incorrect framelengths, which can in the worst case make some of the subsequent frames unusable. More information and an example algorithm solving this problem is available in section 4.5.1 below. As the AMR-WB+ codec contains all the functionality of the AMR-WB codec, anyone supporting the AMR-WB+ codec and this payload format is RECOMMENDED to also implement the payload format in RFC 3267 [7] for the AMR-WB frame types. This will significantly help interoperability with other devices that only support AMR-WB, in applications and scenarios where this is possible. Otherwise an end- point that is in fact capable of everything except the RTP payload format for AMR-WB will not be able to communicate. 4.5.1. ISF recovery when frames are lost In case of packet loss proper error concealment has to be initiated in the AMR-WB+ decoder for the lost frames associated with the lost packets. Proper frame loss concealment requires a codec framing that matches the timestamps of the correctly received frames. Hence, it is necessary to recover the timestamps of the lost frames. Adifficulty with this may arise due to the fact that the codec frame length that is associated with the ISF may have changed during the frame loss. Sjoberg, et. al. Standards Track [Page 19] INTERNET-DRAFT RTP payload format for AMR-WB+ September 30, 2004 The task of recovering the timestamps of lost frames is illustrated in an example in which a case is assumed where two frames at timestamps t0 and t1 have been received properly, the ISF values being isf0 and isf1, respectively. The associated frame lengths (in timestamp ticks) are given with L0 and L1, respectively. Three frames with timestamps x1 - x3 have been lost. The example further assumes that there is one ISF change during the frame loss from isf0 to isf1, as shown in the figure below. What is generally not known in the decoder and what is required for recovery of the timestamps is: * the ISFs associated to the lost frames * how many frames have been lost |<---L0--->|<---L0--->|<-L1->|<-L1->|<-L1->| | Rxd | lost | lost | lost | Rxd | --+----------+----------+------+------+------+-- t0 x1 x2 x3 t1 In the following an example algorithm is given according to which timestamps and ISFs belonging to lost frames can be recovered. As in above example, it is assumed that two frames have been received properly with timestamps t0 and t1, and ISF values isf0 and isf1, and associated frame lengths L0 and L1, respectively. Furthermore, the TFIs of the two received frames are denoted by tfi0 and tfi1, respectively. Example Algorithm: Start: # check for frame loss If (t0 + L0) == t1 Then goto End # no frame loss Step 1: # check case with no ISF change If (isf0 != isf1) Then goto Step 2 # At least one ISF change If (isFractional(t1 - t0)/L0) Then goto Step 3 # More than 1 ISF change Return recovered timestamps as x(n) = t0 + n*L1 and associated ISF equal to isf0, for 0