Minutes - Speech Services Control WG (speechsc) 
Reported by Tom Taylor 

Wednesday, July 17 at 0900-1130 
=================================== 

Chairs: Eric Burger (eburger@showshore.com) 
David Oran (oran@cisco.com) 

0900 - Agenda Bashing/Charter Review (Chairs) 
============================================= 

The proposed agenda was accepted. 

0910 - Work Roadmap & Timeline (Chairs) 
======================================= 

Dave Oran presented. The charts are available at 
http://www.ietf.org/proceedings/02jul/slides/speechsc-0/sld002.htm 

Charter 
------- 

The Working Group is chartered. The name has changed from CATS to SPEECHSC. The 
scope is initially limited to Automatic Speech Recognition (ASR), Text To Speech 
(TTS), and Speaker Verification (SV). This will expand later after the group has 
demonstrated its ability to meet deliverables. Scott Bradner (sob@harvard.edu) 
noted that there had been concern within IESG that the group is too narrowly 
focused; he would be disappointed if the scope didn't expand. There should be a 
strong bias toward protocol reuse. The group is to coordinate with ETSI Aurora, 
ITU-T SG 16 (Question 15), W3C, and any other interested groups that emerge. 

Work Items 
---------- 

Dave listed the milestones set by the charter. The Working Group is already late 
on requirements publication, hence this is highest priority. 

Timeline for Work Items 
----------------------- 

The Chairs would like to do Working Group Last Call on requirements by early 
August. (Hence the meeting will focus on this.) 

They would like to kick off work on protocol analysis immediately following 
Working Group Last Call of requirements. (The meeting wrapup will include a 
discussion of ways and means.) 

0930 - Discuss requirements document (draft-burger-speechsc-reqts-00.txt) 
========================================================================= 

This document, an update to draft-burger-cats-reqts-00.txt, was posted a month 
ago. A small number of people generated a substantial number of postings on the 
list. The Chairs wanted to take as long as necessary to cover open issues in the 
document and posted on the list. 

Open Issues 
----------- 

Identified in the reqts document: 

(1) Means of detection of Speech Synthesis Markup Language (SSML) 
Proposed resolution: require content type header. 
Accepted. 

(2) Should control channels be long-lived? 
There was only one comment on the list: allow, not require long-lived control 
channels. 
Question: does this mean requiring that the control channels be set up in 
advance? 
The discussion distinguished long-lived vs. session-based vs. on-demand control 
channel setup. 
Long-lived: set up in advance. 
Session-based: (note "session" is undefined). 
On-demand: per utterance. 
It was proposed that the protocol should support the first two, and may allow 
on-demand setup. On-demand raises design issues if support is stronger than MAY. 

Proposed summation: there is agreement on session and something larger than 
session, but there is some question of whether smaller than session duration is 
needed. 

(3) For parameters that persist across a session, allow setting on a per-session 
basis? 
The proposal is to allow session parameters. There was no discussion. 

(4) Allow for speech markers, as specified for MRCP over RTSP? 
Two comments on list: Stephane Maes (smaes@us.ibm.com) stated that speech 
markers are needed and must be efficient. Dan Burnett (burnett@nuance.com) asked 
whether SSML was not adequate. The proposed resolution is that SSML is a good 
initial hypothesis. 
Discussants noted that we have to to support markers in messaging. SSML is 
acceptable for now. The protocol must provide an efficient mechanism for 
reporting that a marker has been sensed. 

Stephane Maes noted that SSML can reference audio files. You don't know at 
beginning how many files you are going to play. It was recognized that this is a 
separate issue from marker. The proposed resolution was accepted. 

(5) Should ASR support alternative grammar formats? 
Stephane Maes said yes, we need that. 
Stephane added that we need an extensibility mechanism, but not discovery. 
Dan Burnett agreed. 
Stephane noted that we should differentiate between capability discovery for 
resource management and capability discovery for control. 
Dave Oran restated the conclusion: there is a need to discover the capabilities 
of given device, but this is not necessarily part of this protocol. There was 
further discussion, but Dave suggested we read RFC 2533 then revisit this 
discussion. It may be a matter of incorporating that protocol within this one as 
SIP has done. 
Proposal: the protocol must be able to explicitly signal grammar format and 
support extensibility, but we will say nothing for now on capability discovery. 

(6) Is there an need for all the parameters specified for MRCP over RTSP? 
List comments: Yes, we need to go beyond the W3C grammar, and also need 
extensibility (Maes, Burnett). 
Proposed resolution: Yes. Moreover, we need to be able to specify parameters on 
per-session basis. The exact set is to be decided as part of the protocol 
analysis and design phase. 
At this point there was some discussion of parameter setting beyond the session 
and within a session. There SHOULD be a capability to reset parameters within a 
session. It was noted that processor adjustment is done per-call, hence the 
protocol at least needs to allow adjustment per call. There is also some need to 
transfer data between servers (e.g. on background noise). Note: session and call 
are not necessarily related concepts. 

A question was raised on the handling of conferences (multiple speakers). 
Dave Oran suggested a protocol requirement to recognize different SSRCs in the 
RTP stream. There is a problem here: a conference could have multiple speakers 
associated with an SSRC. 

(7) The scope of the requirements should go beyond ASR, TTS, and SV/SR 
(Speaker Recognition). 
Proposed resolution: not for now. 
Steve (***don't know E-mail address) remarked that the main market is still for 
pre-recorded speech. The Chairs responded that this is a solved problem, not 
something we need to work on. However, we can recognize that it will be present. 
Text to express this is requested. 

Stephane Maes suggested that we need some requirement for extensibility of 
scope. Dave Oran asked how one would determine with such a requirement that the 
protocol meets that requirement. He saw this as preferable to leave to the 
design stage. Text is requested for consideration, if this avenue is to be 
pursued. 

The question was raised, whether DTMF is in scope. Dave Oran noted that other 
mechanisms are available for hnadling DTMF. Eric Burger added that in ASR, DTMF 
would be invisible to protocol: it would be specified in the grammar. For a DTMF 
server, use another protocol such as Megaco. 
There was a suggestion that one might want conversion between voice and 
DTMF. Eric responded that this was an application function. 

(8) Does protocol have to cope with both parallel and serial composition of 
servers? 
Proposal: the charter limits topology. Serial chaining involves OPES proxy 
issues. 

There was some discussion about cases associated with wireless LAN. The Chairs' 
response was that OPES issues are matters of delegation, trust, security, and 
traceability. We would have to convince the IESG that these issues do not arise 
or are well met in this case. It would represent a major expansion of work to 
generate the required analysis. See RFC 3238 for more information. 
Compromise: note this as an area of research and possible future enhancement. 

(9) Does the requirement not to redo RTSP or SIP/msuri restrict the ability to 
use markers and other playout options like pacing? 
Proposed resolution: reword the requirement to clarify that the intent is not to 
impose such a restriction. 

(10) Clarify the OPES requirement. 
Proposed resolution: add a reference to RFC 3238. The intent is that the client 
side of the protocol will operate on behalf of one user. Stephane Maes will 
supply text. 

(11) Load balancing. 
The Chairs noted that the current text captured the outcome of lengthy 
discussions. The requirements must not preclude load balancing but also must not 
require load balancing. The general feeling was that it is not a fruitful area 
of effort. 

(12) Must be able to control language and prosody for plain text. 
Proposed: this is a matter of clarification: SSML provides the desired control. 

(13) Need "full control" over TTS engine (Maes). VCR and other fine-grained 
control should be lower priority (Burnett). 
Dan Burnett clarified: VCR control are audio controls, not TTS controls. It was 
agreed such controls are needed, but they are not a high priority for TTS 
applications. The counter-argument was that we have the analogue in text 
operations: e.g. skip paragraph, go back to previous page. Stephane's point was 
that real-time controls are needed, and he is not sure why we would call them 
out specifically for lower priority. 

This issue is one for the list to consider. We need a more detailed explication 
of control requirements. Note that there is a problem of interaction with SSML. 
There is the question of how what kind of units to skip ahead, for instance: 
seconds, paragraphs, ... 

(14) Must handle prompting, recording, possibly utterance verification, 
retrianing, in addition to record for analysis (Maes). 
Proposed resolution: design for extensibility, but no specific requirements in 
the protocol other than for recording for now. 

(15) Grammar sharing. 
The Chairs proposed to adopt the Burnett phrasing of requirements: 

(i) A server implementation needs to be able to store large grammars originally 
provided by client, and 
(ii) we need the ability within the protocol to reference grammars already known 
to the server (e.g. built in). Dave saw this as a name space issue. 

The distinction between globally unique and well-known was noted, but seen as a 
design issue. 
The question of control of grammar use was raised. The Chairs suggested that 
this is a matter of passing it only to trusted entities. 
There was a suggestion that (i) is a matter of cache control. 
There is the issue of who can use which grammar, but the meeting agreed that 
this is outside of scope. Discovery of grammars is also out of scope. 

It was agreed that the protocol must not preclude grammar sharing across 
sessions. 

Dan Burnett is to supply text. 

(16) Need to to cover speaker enrollment, identification and classification as 
well as recognition as part of SV. Multiple methods are needed. 
Resolution: will add this to requirements. Dan Burnett is to provide more 
details. 

(17) Why a requirement on cross-utterance state? 
Dan Burnett explained: he wants to make sure the implementation option remains 
open. Hence his concern is that there be no requirement that cross-utterance 
information be held only in the client. Stephane saw this as an example of a 
number of cases where extensibility will be needed. Dave Oran suggested we need 
a way to express in the protocol that some barrier has been crossed and 
resynchronization is needed. Looking at it another way: we need to be able to 
indicate that different transactions, not necessarily sequential, are 
correlated. Stephane suggested we add to this that the specific kind of 
correlation is proprietary. Following on, it is important that the server be 
able to give a result and say what context it applies to. 

(19) Need simultaneous performance of multiple functions on the same streams. 
The meeting agreed to add the requirement but not to consider parallel 
decomposition for now. (Could be happening behind the scenes due to OPES 
issues.) 

Stephane wondered if we always assume the output of engine goes back to the 
issuer of the command. The Chairs' answer was "yes", on security grounds: there 
are too many hacking scenarios otherwise. 

It was noted that the security section needs expansion. It should distinguish 
between requirements on the protocol (being put together in this document) and 
requirements on the system (not to be documented). 

Other agenda points 
=================== 

The requirements discussion took all the time available, so intervening points 
of the agenda were not covered. 

1115 - Wrap-up and next steps 
============================= 

The intent is to reissue the requirements draft by July 27. The group would aim 
for Working Group Last Call by early August, with text going to the IESG by the 
end of August. 
Issue: do use cases go into the requirements, or will they just be used as a 
guide? 
Stephane Maes proposed a short summary in the requirements, but mainly use them 
as a guide. 

Steve asked what the group would do about discovery and resource management. 
Dave Oran pointed out that this is a generic problem for client-server 
protocols. He suggested just leaving it to system architects. This implies that 
clients and servers become limited in applicability to the discovery mechanisms 
they implement. 

The list is now at speechsc@ietf.org. 

Note 1: the posted IETF agenda still has this as a "CATS" BoF". We are in fact 
an approved WG, and are called SPEECHSC as we previously reported to the mailing 
list. 

Note 2: the mrcp@showshore.com mailing list will be decommissioned immediately 
following this IETF. PLEASE subscribe to the speechsc@ietf.org mailing list as 
soon as convenient.