<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">

<?rfc rfcedstyle="yes" ?>
<?rfc subcompact="no" ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="no" ?>

<rfc number="5369" category="info">
<front>
    <title abbrev="Transcoding Framework">
Framework for Transcoding with the Session Initiation Protocol (SIP)      
    </title>
    <author initials="G." surname="Camarillo" fullname="Gonzalo Camarillo">
      <organization>Ericsson</organization>
      <address>
	<postal>
          <street>Hirsalantie 11</street>
	  <code>02420</code> 
	  <city>Jorvas</city> 
	  <country>Finland</country>
 	</postal>
	<email>Gonzalo.Camarillo@ericsson.com</email>
      </address>
    </author>

    <date month="October" year="2008" />

    <keyword>SIP</keyword>
    <keyword>transcoding</keyword>

    <abstract>
    <t>
This document defines a framework for transcoding with SIP. This
framework includes how to discover the need for transcoding services in
a session and how to invoke those transcoding services. Two models for
transcoding services invocation are discussed: the conference bridge
model and the third-party call control model. Both models meet the
requirements for SIP regarding transcoding services invocation to
support deaf, hard of hearing, and speech-impaired individuals.
    </t>
    </abstract>
</front>
<middle>

<section title="Introduction">
<t>
Two user agents involved in a SIP <xref target="RFC3261"/> dialog may
find it impossible to establish a media session due to a variety of
incompatibilities.  Assuming that both user agents understand the same
session description format (e.g., SDP <xref
target="RFC4566"/>), incompatibilities can be found at
the user agent level and at the user level. At the user agent level,
both terminals may not support any common codec or may not support
common media types (e.g., a text-only terminal and an audio-only
terminal).  At the user level, a deaf person will not understand
anything said over an audio stream.
</t>
<t>
In order to make communications possible in the presence of
incompatibilities, user agents need to introduce intermediaries that
provide transcoding services to a session. From the SIP point of view,
the introduction of a transcoder is done in the same way to resolve
both user level and user agent level incompatibilities. So, the
invocation mechanisms described in this document are generally
applicable to any type of incompatibility related to how the
information that needs to be communicated is encoded.
</t>
<t>
<list style="hanging">
<t>
Furthermore, although this framework focuses on transcoding, the
mechanisms described are applicable to media manipulation in
general. It would be possible to use them, for example, to invoke a
server that simply increases the volume of an audio stream.
</t>
</list>
</t>
<t>
This document does not describe media server discovery. That is an
orthogonal problem that one can address using user agent provisioning
or other methods.
</t>
<t>
The remainder of this document is organized as follows. <xref
target="sec-discovery"/> deals with the discovery of the need for
transcoding services for a particular session. <xref
target="sec-invocation"/> introduces the third-party call control and
conference bridge transcoding invocation models, which are further
described in Sections <xref target="sec-invocation-3pcc" format="counter"/> and <xref
target="sec-invocation-conf" format="counter"/>, respectively. Both models meet the
requirements regarding transcoding services invocation in RFC 3351
<xref target="RFC3351"/>, which support deaf, hard of hearing, and
speech-impaired individuals.
</t>
</section>

<section title="Discovery of the Need for Transcoding Services"
anchor="sec-discovery">
<t>
According to the one-party consent model defined in RFC 3238 <xref
target="RFC3238"/>, services that involve media manipulation
invocation are best invoked by one of the endpoints involved in the
communication, as opposed to being invoked by an intermediary in the
network. Following this principle, one of the endpoints should be the
one detecting that transcoding is needed for a particular session.
</t>
<t>
In order to decide whether or not transcoding is needed, a user agent
needs to know the capabilities of the remote user agent. A user agent
acting as an offerer <xref target="RFC3264"/> typically obtains this
knowledge by downloading a presence document that includes media
capabilities (e.g., Bob is available on a terminal that only supports
audio) or by getting an SDP description of media capabilities as
defined in RFC 3264 <xref target="RFC3264"/>.
</t>
<t>
Presence documents are typically received in a NOTIFY request <xref
target="RFC3265"/> as a result of a subscription. SDP media
capabilities descriptions are typically received in a 200 (OK)
response to an OPTIONS request or in a 488 (Not Acceptable Here)
response to an INVITE.
</t>
<t>
In the absence of presence information, routing logic that involves
parallel forking to several user agents may make it difficult (or
impossible) for the caller to know which user agent will answer the
next call attempt. For example, a call attempt may reach the user's
voicemail while the next one may reach a SIP phone where the user is
available. If both terminating user agents have different
capabilities, the caller cannot know, even after the first call
attempt, whether or not transcoding will be necessary for the
session. This is a well-known SIP problem that is referred to as HERFP
(Heterogeneous Error Response Forking Problem). Resolving HERFP is
outside the scope of this document.
</t>
<t>
It is recommended that an offerer does not invoke transcoding services
before making sure that the answerer does not support the capabilities
needed for the session. Making wrong assumptions about the answerer's
capabilities can lead to situations where two transcoders are
introduced (one by the offerer and one by the answerer) in a session
that would not need any transcoding services at all.
</t>
<t>
<list style="hanging">
<t>
An example of the situation above is a call between two GSM (Global System
for Mobile Communications) phones (without using transcoding-free
operation). Both phones use a GSM codec, but the speech is converted
from GSM to PCM (Pulse Code Modulation) by the originating MSC (Mobile
Switching Center) and from PCM back to GSM by the terminating MSC.
</t>
</list>
</t>
<t>
Note that transcoding services can be symmetric (e.g., speech-to-text
plus text-to-speech) or asymmetric (e.g., a one-way speech-to-text
transcoding for a hearing-impaired user that can talk).
</t>

</section>
<?rfc needLines="10" ?>
<section title="Transcoding Services Invocation"
anchor="sec-invocation">
<t>
Once the need for transcoding for a particular session has been
identified as described in <xref target="sec-discovery"/>, one of the
user agents needs to invoke transcoding services.
</t>
<t>
As stated earlier, transcoder location is outside the scope of this
document. So, we assume that the user agent invoking transcoding
services knows the URI of a server that provides them.
</t>
<t>
Invoking transcoding services from a server (T) for a session between
two user agents (A and B) involves establishing two media sessions;
one between A and T and another between T and B.  How to invoke T's
services (i.e., how to establish both A-T and T-B sessions) depends on
how we model the transcoding service. We have considered two models
for invoking a transcoding service. The first is to use third-party
call control <xref target="RFC3725"/>, also referred to as 3pcc. The
second is to use a (dial-in and dial-out) conference bridge that
negotiates the appropriate media parameters on each individual leg
(i.e., A-T and T-B).
</t>
<t>
<xref target="sec-invocation-3pcc"/> analyzes the applicability of the
third-party call control model, and <xref
target="sec-invocation-conf"/> analyzes the applicability of the
conference bridge transcoding invocation model.
</t>


<section title="Third-Party Call Control Transcoding Model"
anchor="sec-invocation-3pcc">
<t>
In the 3pcc transcoding model, defined in <xref
target="RFC4117"/>, the user agent invoking the
transcoding service has a signalling relationship with the transcoder
and another signalling relationship with the remote user agent. There
is no signalling relationship between the transcoder and the remote
user agent, as shown in <xref target="fig-3pcc"/>.
</t>
<t>
<figure anchor="fig-3pcc" title="Third-Party Call Control Model">
<artwork><![CDATA[
       +-------+                                                          
       |       |                                                          
       |   T   |**                                                        
       |       |  **                                                      
       +-------+    **                                                    
         ^   *        **                                                  
         |   *          **                                                
         |   *            **                                              
        SIP  *              **                                            
         |   *                **                                          
         |   *                  **                                        
         v   *                    **                                      
       +-------+               +-------+                                  
       |       |               |       |                                  
       |   A   |<-----SIP----->|   B   |                                  
       |       |               |       |                                  
       +-------+               +-------+                                  
                                                                          
                                                                          
        <-SIP-> Signalling                                                
        ******* Media                                                     
]]></artwork></figure> 
</t>


<t>
This model is suitable for advanced endpoints that are able to
perform third party call control. It allows endpoints to invoke
transcoding services on a stream basis. That is, the media streams
that need transcoding are routed through the transcoder while the
streams that do not need it are sent directly between the endpoints.
This model also allows invoking one transcoder for the sending
direction and a different one for the receiving direction of the same
stream.
</t>
<t>
Invoking a transcoder in the middle of an ongoing session is also
quite simple. This is useful when session changes occur (e.g., an
audio session is upgraded to an audio/video session) and the endpoints
cannot cope with the changes (e.g., they had common audio
codecs but no common video codecs).
</t>
<t>
The privacy level that is achieved using 3pcc is high, since the
transcoder does not see the signalling between both endpoints. In
this model, the transcoder only has access to the information that is
strictly needed to perform its function.
</t>

</section>



<section title="Conference Bridge Transcoding Model"
anchor="sec-invocation-conf">
<t>
In a centralized conference, there are a number of media streams
between the conference server and each participant of a conference.
For a given media type (e.g., audio) the conference server sends, over
each individual stream, the media received over the rest of the
streams, typically performing some mixing. If the capabilities of all
the endpoints participating in the conference are not the same, the
conference server may have to send audio to different participants
using different audio codecs.
</t>
<t>
Consequently, we can model a transcoding service as a two-party
conference server that may change not only the codec in use, but also
the format of the media (e.g., audio to text).
</t>
<t>
Using this model, T behaves as a B2BUA (Back-to-Back User Agent) and
the whole A-T-B session is established as described in <xref
target="RFC5370"/>. <xref target="fig-conf"/>
shows the signalling relationships between the endpoints and the
transcoder.
</t>
<t>
<figure anchor="fig-conf" title="Conference Bridge Model">
<artwork><![CDATA[
                 +-------+                                                          
                 |       |**                                                        
                 |   T   |  **                                                      
                 |       |\   **                                                    
                 +-------+ \\   **                                                  
                   ^   *     \\   **                                                
                   |   *       \\   **                                              
                   |   *         SIP  **                                            
                  SIP  *           \\   **                                          
                   |   *             \\   **                                        
                   |   *               \\   **                                      
                   v   *                 \    **                                    
                 +-------+               +-------+                                  
                 |       |               |       |                                  
                 |   A   |               |   B   |                                  
                 |       |               |       |                                  
                 +-------+               +-------+                                  
                                                                                    
                                                                                    
                  <-SIP-> Signalling                                                
                  ******* Media                                                     
]]></artwork></figure> 
</t>
<t>
In the conferencing bridge model, the endpoint invoking the
transcoder is generally involved in less signalling exchanges than in
the 3pcc model. This may be an important feature for endpoints using
low-bandwidth or high-delay access links (e.g., some wireless
accesses).
</t>

<?rfc needLines="10" ?>
<t>
On the other hand, this model is less flexible than the 3pcc model.
It is not possible to use different transcoders for different streams
or for different directions of a stream.
</t>
<t>
Invoking a transcoder in the middle of an ongoing session or changing
from one transcoder to another requires the remote endpoint to
support the Replaces <xref target="RFC3891"/> extension. At present,
not many user agents support it.
</t>
<t>
Simple endpoints that cannot perform 3pcc and thus cannot use the
3pcc model, of course, need to use the conference bridge model.
</t>

</section>

</section>


<section title="Security Considerations"
anchor="sec-security">
<t>
The specifications of the 3pcc and the conferencing transcoding models
discuss security issues directly related to the implementation of
those models. Additionally, there are some considerations that apply
to transcoding in general.
</t>
<t>
In a session, a transcoder has access to at least some of the media
exchanged between the endpoints. In order to avoid rogue transcoders
getting access to those media, it is recommended that endpoints
authenticate the transcoder. TLS <xref target="RFC5246"/> and S/MIME
<xref target="RFC3850"/> can be used for this purpose.
</t>
<t>
To achieve a higher degree of privacy, endpoints following the 3pcc
transcoding model can use one transcoder in one direction and a
different one in the other direction. This way, no single transcoder
has access to all the media exchanged between the endpoints.
</t>
<t>
The fact that transcoders need to access media exchanged between the
endpoints implies that endpoints cannot use end-to-end media security
mechanisms. Media encryption would not allow the transcoder to access
the media, and media integrity protection would not allow the
transcoder to modify the media (which is obviously necessary to
perform the transcoding function). Nevertheless, endpoints can still
use media security between the transcoder and themselves.
</t>

</section>


<section title="Contributors" anchor="sec-contributors">
<t>
This document is the result of discussions amongst the conferencing
design team. The members of this team include Eric Burger, Henning
Schulzrinne, and Arnoud van Wijk.
</t>
</section>


</middle>
<back>
<?rfc needLines="10" ?>
    <references title="Normative References">

      <?rfc include="reference.RFC.3238" ?>
      <?rfc include="reference.RFC.3261" ?>
      <?rfc include="reference.RFC.3264" ?>
      <?rfc include="reference.RFC.3265" ?>
      <?rfc include="reference.RFC.3351" ?>
      <?rfc include="reference.RFC.3725" ?>
      <?rfc include="reference.RFC.3850" ?>
      <?rfc include="reference.RFC.3891" ?>
      <?rfc include="reference.RFC.4117" ?>
      <?rfc include="reference.RFC.5246" ?>

      <!-- ietf-sipping-transc-conf became RFC 5370 -->
      <reference anchor="RFC5370">
	  <front>
	  <title>The Session Initiation Protocol (SIP) Conference Bridge Transcoding Model</title>
	      <author initials="G" surname="Camarillo" fullname="Gonzalo Camarillo">
              <organization/>
              </author>
          <date month="October" year="2008"/>
          </front>
          <seriesInfo name="RFC" value="5370"/>
      </reference>


    </references>

    <references title="Informative References">
      <?rfc include="reference.RFC.4566" ?>
    </references>
  </back>

</rfc>

<!--  LocalWords:  xref PPR PPA SAA RTA RTR LIR LIA CDATA
 -->
