DISPATCH A. Amirante Internet-Draft T. Castaldi Expires: July 11, 2010 L. Miniero S P. Romano University of Napoli January 7, 2010 Session Recording for Conferences using SMIL draft-romano-dcon-recording-01 Abstract This document deals with session recording, specifically for what concerns recording of multimedia conferences, both centralized and distributed. Each involved media is recorded separately, and is then properly tagged. A SMIL [W3C.CR-SMIL3-20080115] metadata is used to put all the separate recordings together and handle their synchronization, as well as the possibly asynchronous opening and closure of media within the context of a conference. This SMIL metadata can subsequently be used by an interested user by means of a compliant player in order to passively receive a playout of the whole multimedia conference session. The motivation for this document comes from our experience with our conferencing framework, Meetecho, for which we implemented a recording functionality. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on July 11, 2010. Amirante, et al. Expires July 11, 2010 [Page 1] Internet-Draft DCON Session Recording January 2010 Copyright Notice Copyright (c) 2010 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Recording . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1. Audio/Video . . . . . . . . . . . . . . . . . . . . . . . 4 4.2. Chat . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3. Slides . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.4. Whiteboard . . . . . . . . . . . . . . . . . . . . . . . . 11 5. Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.1. SMIL Head . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2. SMIL Body . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2.1. Audio/Video . . . . . . . . . . . . . . . . . . . . . 16 5.2.2. Chat . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2.3. Slides . . . . . . . . . . . . . . . . . . . . . . . . 18 5.2.4. Whiteboard . . . . . . . . . . . . . . . . . . . . . . 19 6. Playout . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7. Security Considerations . . . . . . . . . . . . . . . . . . . 22 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23 Amirante, et al. Expires July 11, 2010 [Page 2] Internet-Draft DCON Session Recording January 2010 1. Introduction This document deals with session recording, specifically for what concerns recording of multimedia conferences, both centralized and distributed. Each involved media is recorded separately, and is then properly tagged. Such a functionality is often required in many conferencing systems, and is of great interest to the XCON [RFC5239] Working Group. The motivation for this document comes from our experience with our conferencing framework, Meetecho, for which we implemented a recording functionality. Meetecho is a standards-based conferencing framework, and so we tried our best to implement recording in a standard fashion as well. In the approach presented in this document, a SMIL [W3C.CR-SMIL3-20080115] metadata is used to put all the separate recordings together and handle their synchronization, as well as the possibly asynchronous opening and closure of media within the context of a conference. This SMIL metadata can subsequently be used by an interested user by means of a compliant player in order to passively receive a playout of the whole multimedia conference session. The document presents the approach by sequentially describing the several required steps. So, in Section 4 the recording step is presented, with an overview of how each involved media might be recorded and stored for future use. As it will be explained in the following sections, existing approaches might be exploited to achieve this steps (e.g. MEDIACTRL [RFC5567]. Then, in Section 5 the tagging process is described, by showing how each media can be addressed in a SMIL metadata file, with specific focus upon the timing and inter-media synchronization aspects. Finally, Section 6 is devoted to describing how a potential player for the recorded session can be implemented and what it is supposed to achieve. 2. Conventions In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations. 3. Terminology TBD. Amirante, et al. Expires July 11, 2010 [Page 3] Internet-Draft DCON Session Recording January 2010 4. Recording When a multimedia conference is realized over the Internet, several media might be involved at the same time. Besides, these media might come and go asynchronously during the lifetime of the same conference. This makes it quite clear that, in case such a conference needs to be recorded in order to allow a subsequent, possibly offline, playout, these media need to be recorded in a format that is aware of all the timing-related aspects. A typical example is a videoconference with slide sharing. While audio and video have a life of their own, slides changes might be triggered at a completely different pace. Besides, the start of a slideshow might occur much later than the start of the audio/video session. All these requirements must be taken into account when dealing with session recording in a conference. Besides, it's important that all the individual recordings be taken in a standard fashion, in order to achieve the maximum compatibility among different solutions and avoid any proprietary mechanism or approach that could prevent a successful playout later on. In this document, we present our approach towards media recording in a conference. Specifically, we will deal with the recording of the following media: o audio and video streams (in Section 4.1); o text chats (in Section 4.2); o slide presentations (in Section 4.3); o whiteboards (in Section 4.4). Additional media that might be involved in a conference (e.g. desktop or application sharing) are not presented in this document, and their description is left to future extensions. 4.1. Audio/Video In a conferencing system compliant with [RFC5239], audio and video streams contributed by participants are carried in RTP channels [RFC3550]. These RTP channels may or may not be secured (e.g by means of SRTP/ZRTP). Whether or not these channels are secured, anyway, is not an issue in this case. In fact, as it is usually the case, all the participants terminate their media streams at a central point (a mixer entity), with whom they would have a secured connection. This means that the mixer would get access to the unencrypted payloads, and would be able to mix and/or store them accordingly. From an high level topology point of view, this is how a recorder for audio and video streams could be envisaged: Amirante, et al. Expires July 11, 2010 [Page 4] Internet-Draft DCON Session Recording January 2010 SIP +------------+ SIP /----------| XCON AS |-------- / +------------+ \ / |MEDIACTRL \ / | \ +-----+ +-----+ +-----+ | | RTP | | RTP | | |UA-A +<------------>+Mixer+<------------>+UA-B | | | | | | | +-----+ +-++--+ +-----+ | | RTP UA-A | | RTP UA-B (Rx+Tx) (Rx+Tx) V V +----------+ | | | Recorder | | | +----------+ Figure 1: Audio/Video Recorder [Editors' Note: this is a slightly modified version of the topology proposed on the DISPATCH mailing list, http://www.ietf.org/mail-archive/web/dispatch/current/ msg00256.html where the Application Server has been specialized in an XCON-aware AS, and the AS<->Mixer protocol is the Media Control Channel Framework protocol (CFW) specified in [I-D.ietf-mediactrl-sip-control-framework].] That said, actually recording audio and video streams in a conference may be accomplished in several ways. Two different approaches might be highlighted: 1. recording each contribution from/to each participant in a separate file (Figure 2); 2. recording the overall mix (one for audio and one from video, or more if several mixes for the same media type are available) in a dedicated file (Figure 3). Amirante, et al. Expires July 11, 2010 [Page 5] Internet-Draft DCON Session Recording January 2010 +-------+ | UAC-C | +-------+ " C (RTP) " " " v +-------+ A (RTP) +----------+ B (RTP) +-------+ | UAC-A |===================>| Recorder |<===================| UAC-B | +-------+ +----------+ +-------+ * * * ****> A.gsm, A.h263 ****> B.g711, B.h264 ****> C.amr Figure 2: Recording individual streams +-------+ | UAC-C | +-------+ " C (RTP) " " " v +-------+ A (RTP) +----------+ B (RTP) +-------+ | UAC-A |===================>| Recorder |<===================| UAC-B | +-------+ +----------+ +-------+ * * * ****> (A+B+C).wav, (A+B+C).h263 Figure 3: Recording mixed streams Of the two, the second is probably more feasable. In fact, the first would require a potentially vast amount of separate recordings which would need to be subsequently muxed and correlated to each other. Besides, within the context of a multimedia conference, most of the times the streams are already mixed for all the participants, and so recording the mix directly would be a clear advantage. Such an Amirante, et al. Expires July 11, 2010 [Page 6] Internet-Draft DCON Session Recording January 2010 approach, of course, assumes that all the streams pass through a central point where the mixing occurs: it is the case depicted in Figure 1. The recording would take place in that point. Such central point, the mixer (which in this case would also act as the recorder, or a frontend to it), might be a MEDIACTRL-based [RFC5567] Media Server. Considering the similar nature of audio and video (both being RTP based and mixed by probably the same entity) they are analysed in the same section of this document. The same applies to tagging and playout as well. It is important to note that in case any policy is involved (e.g. moderation by means of the BFCP [RFC4582]) the mixer would take it into account when recording. In fact, the same policies applied to the actual conference with respect to the delivery of audio and video to the participants needs to be enforced for the recording as well. In a more general way, if the mixer does not support a direct recording of the mixes it prepares, recording a mix can be achieved by attaching the recorder entity (whatever it is) as a passive participant to the conference. This would allow the recorder to receive all the involved audio and video streams already properly mixed, with policies already taken into consideration. This approach is depicted in Figure 4. +-------+ | UAC | | C | +-------+ " ^ C (RTP) " " " " " " A+B (RTP) v " +-------+ A (RTP) +--------+ A+C (RTP) +-------+ | UAC |===================>| Media |===================>| UAC | | A |<===================| Server |<===================| B | +-------+ B+C (RTP) +--------+ B (RTP) +-------+ " " " A+B+C (RTP) " v +----------+ | Recorder | +----------+ * ****> (A+B+C).wav, (A+B+C).h263 Amirante, et al. Expires July 11, 2010 [Page 7] Internet-Draft DCON Session Recording January 2010 Figure 4: Recorder as a passive participant Whether or not the mixer is MEDIACTRL-based, it's quite likely that the AS handling the multimedia conference business logic has some control on the mixing involved. This means it can request the recording of each available audio and/or video mix in a conference, if only by adding the passive participant as mentioned above. Besides, events occurring at the media level or business logic in the AS itself allow the AS to take note of timing information for each of the recorded media. For instance, the AS may take note of when the video mixing started, in order to properly tag the video recording in the tagging phase. Both the recordings and the timing events list would subsequently be used in order to prepare the metadata information of the audio and video in the overall session recording description. Such a phase is described in Section 5.2.1. In a MEDIACTRL Media Server, such a functionality might be accomplished by means of the Mixer Control Package [I-D.ietf-mediactrl-mixer-control-package]. At the end of the conference, URLs to the actual recordings would be made available for the AS to use. The AS might then subsequently access those recordings according to its business logic, e.g. to store them somewhere else (the MS storage might be temporary) or to implement an offline transcoding and/or mixing of all the recordings in order to obtain a single file representative of the whole audio/video participation in the conference. Practical examples of these scenarios are presented in [I-D.ietf-mediactrl-call-flows]. Of course, if the recording of a mix is not possible or desired, one could still fallback to the first approach, that is individually recording all the incoming contributions. It is the case, for instance, of conferencing systems which don't implement video mixing, but just rely instead on a switching/forwarding of the potentially several video streams to each participant. This functionality can also be achieved by means of the same control package previously introduced, since it allows for the recording of both mixes and individual connections. Once the conference ends, the AS can then decide what to do with the recordings, e.g. mixing them all together offline (thus obtaining an overall mix) or leave them as they are. The tagging process would the subsequently take the decision into account, and address the resulting media accordingly. 4.2. Chat What has been said about audio and video partially applies to text chats as well. In fact, just as for audio and video a central mixer is usually involved, for instant messaging most of the times the contributions by all participants pass through a central node from Amirante, et al. Expires July 11, 2010 [Page 8] Internet-Draft DCON Session Recording January 2010 where they are forwarded to the other participants. It is the case, for instance, of XMPP [RFC3920] and MSRP [RFC4975] based text conferences. If so, recording of the text part of a conference is not hard to achieve either. The AS just needs to implement some form of logging, in order to store all the messages flowing through the text conference central node, together with information on the senders of these messages and timing-related information. Of course, the AS may not directly be the text conference mixer: the same considerations apply, however, in the sense that the remote mixer must be able to implement the aforementioned logging, and must be able to receive related instructions from the controlling AS. Besides, considering the possible protocol-agnostic nature of the conferencing system (as envisaged in [RFC5239]), several different instant messaging protocols may be involved in the same conference. Just as the conferencing system would act as a protocol gateway during the lifetime of the conference (i.e. provide MSRP users with the text coming from XMPP participants and viceversa), all the contributions coming from the different instant messaging protocols would need to be recorded in the same log, and in the same format, to avoid ambiguity later on. An example of a recorder for instant messaging is presented in Figure 5. +-------+ | UAC-C | +-------+ ^ C (MSRP) " '10.11.24 Hi!' " " v +-------+ A (XMPP) +----------+ B (IRC) +-------+ | UAC-A |<==================>| Recorder |<==================>| UAC-B | +-------+ '10.11.26 Hey C' +----------+ '10.11.30 Hey man' +-------+ * * * [..] ****> 10.11.24 Hi! ****> 10.11.26 Hey C ****> 10.11.30 Hey man [..] Figure 5: Recording a text conference Amirante, et al. Expires July 11, 2010 [Page 9] Internet-Draft DCON Session Recording January 2010 The same considerations already mentioned about optional policies involved apply to text conferences as well: i.e., if a UAC is not allowed to contribute text to the chat, this contribution is excluded both from the mix the other participants receive and from the ongoing recording. Considerations about the format of the recording are left to Section 5.2.2. Until then, we just assume the AS has a way to record text conferences somehow in a format it is familiar with. This format would subsequently be converted to another, standard, format that a player would be able to access. 4.3. Slides Another media typically available in a multimedia conference over the internet is the slides presentation. In fact, slides, whatever format they're in, are still the most common way of presenting something within a collaboration framework. The problem is that, most of the times, these slides are deployed in a proprietary way (e.g. Microsoft Powerpoint and the like). This means that, besides the recording aspect of the issue, the delivery itself of such a slides can be problematic when considered in a standards based conferencing framework. Considering that no standard way of implementing such a functionality is commonly available yet, we assume that a conferencing framework makes such slides available to the participants in a conference as a slideshow, that is, a series of static images whose appearance might be dictated by a dedicated protocol. For instance, a presenter may trigger the change of a slide by means of an instant messaging protocol, providing each authorized participant with an URL from where to get the current slide with optional metadata to describe its content. An example is presented in Figure 6. The presenter has previously uploaded its presentation converted in a proprietary format. The presentation has been converted to images and a description of the new format has been sent back to the presenter (e.g. an XML metadata). At this point, the presenter makes use of XMPP to inform the other participants about the current slide, by providing an HTTP URL to the related image. Amirante, et al. Expires July 11, 2010 [Page 10] Internet-Draft DCON Session Recording January 2010 +-----------+ | Presenter | +-----------+ " (XMPP) " Current presentation: f44gf " Current slide number: 4 " URL: http://example.com/f44gf/4.jpg " v +-------+ (XMPP) +----------+ (XMPP) +-------+ | UAC-A |<===================| ConfServ |===================>| UAC-B | +-------+ +----------+ +-------+ | | | HTTP GET (http://example.com/f44gf/4.jpg) | v HTTP GET (http://example.com/f44gf/4.jpg) | v Figure 6: Presentation sharing via XMPP From this assumption, the recording of each slide presentation would be relatively trivial to achieve. In fact, the AS would just need to have access to the set of images (with the optional metadata involved) of each presentation, and to the additional information related to presenters and to when each slide was triggered. For instance, the AS may take note of the fact that slide 4 from presentation "f44gf" of the example above has been presented by UAC "spromano" from the second 56 of the conference to the second 302. Properly recording all those events would allow for a subsequent tagging, thus allowing for the integration of this medium in the whole session recording description together with the other media involved. This phase will be described in Section 5.2.3. 4.4. Whiteboard To conclude the overview on the analysed media, we consider a further medium which is quite commonly deployed in multimedia conferences: the shared whiteboard. There are several ways of implementing such a functionality. While some standard solutions exist, they are rarely used within the context of commercial conferencing application, which usually prefer to implement it in a proprietary fashion. Without delving into a discussion on this aspect, suffices it to say that for a successful recording of a whiteboard session most of the times it is enough to just record the individual contributions of each involved participant (together with the usual timing-related information). In fact, this would allow for a subsequent replay of the whiteboard session in an easy way. Unlike audio and video, Amirante, et al. Expires July 11, 2010 [Page 11] Internet-Draft DCON Session Recording January 2010 whiteboarding usually is a very lightweight media, and so recording the individual contributions rather than the resulting mix (as we suggested in Section 4.1) is advisable. These contributions may subsequently be mixed together in order to obtain a standard recording (e.g. a series of images, animations, or even a low framerate video). An example of recording for this medium is presented in Figure 7. +-------+ | UAC-C | +-------+ " C (XMPP) " 10.11.20: line " " v +-------+ A (XMPP) +-----------+ B (XMPP) +-------+ | UAC-A |===================>| WB server |<===================| UAC-B | +-------+ 10.10.56: circle +-----------+ 10.12.30: text +-------+ * * * ****> 10.10.56: circle (A) ****> 10.11.20: line (C) ****> 10.12.30: text (B) Figure 7: Recording a whiteboard session The recording process may be enriched by the population of a parallel event list. For instance, optimizations might include event as the creation of a new whiteboard, the clearing of an existing whiteboard or the adding of a background image that replaced the previously existing content. Such event would be precious in a subsequent playout of the recorded steps, since they would allow for a more lightweight replication in case seeking is involved. For instance, if 70 drawings have been done, but at second 560 of the conference the whiteboard has been cleared and since then only 5 drawings have been added, a viewer seeking to second 561 would just need the clear+5 drawings to be replicated. Anyway, further discussion upon the tagging process of this media is presented in Section 5.2.4. 5. Tagging Once the different media have been recorded and stored, and their Amirante, et al. Expires July 11, 2010 [Page 12] Internet-Draft DCON Session Recording January 2010 timing related somehow, this information needs to be properly tagged in order to allow intra-media and inter-media synchronization in case a playout is invoked. Besides, it would be desirable to make use of standard means for achieving such a functionality. For these reasons, we chose to make use of the Synchronized Multimedia Integration Language [W3C.CR-SMIL3-20080115], which fulfills all the aforementioned requirements, besides being a well-established W3C standard. In fact, timing information is very easy to address using this specification, and VCR-like controls (start, pause, stop, rewind, fast forward, seek and the like) are all easily deploayble in a player using the format. The SMIL specification provides means to address different media by using custom tags (e.g. audio, img, textstream and so on), and for each of these media the related tempification can be easily described. The following subsections will describe how a SMIL metadata could be prepared in order to map with the media recorded as described in Section 4. Specifically, considering how a SMIL file is assumed to be constructed, the head will be described in Section 5.1, while the body (with different focus for each media) will be presented in Section 5.2. 5.1. SMIL Head As specified in [W3C.CR-SMIL3-20080115], a SMIL file is composed of two separate sections: a head and a body. The head, among all the needed information, also includes details about the allowed layouts for a multimedia presentation. Considering the amount of media that might have been involved in a single conference, properly constructing such a section definitely makes much sense. In fact, all the involved media need to be placed in order not to prevent access to other concurrent media within the context of the same recording. For instance, this is how a series of different media might be placed in a layout according to different screen resolutions: Amirante, et al. Expires July 11, 2010 [Page 13] Internet-Draft DCON Session Recording January 2010 [..] That said, it's important that this section of the SMIL file be constructed properly. In fact, the layout description also contains explicit region identifiers, which are referred to when describing media in the body section. TBD. (?) 5.2. SMIL Body The SMIL head section described previously is very important for what concerns presentation-related settings, but does not contain any timing-related information. Such information, in fact, belongs to a separate section in the SMIL file, the so called body. This body contains the information on all the involved media in the recorded session, and for each media timing information are provided. This Amirante, et al. Expires July 11, 2010 [Page 14] Internet-Draft DCON Session Recording January 2010 timing information includes not only when each media appears and when it goes away, but also details on the media lifetime as well. By correlating the timing information for each media, a SMIL reader can infer inter-media synchronization and present the recorded session as it was conceived to appear. Besides, the involved media can be grouped in the body in order to implement sequential and/or parallel playback involving a subset of the available media. This is made possible by making use of the and elements. The element in particular is of great interest to this document, since in a multimedia conference many media are presented to participants at the same time. That said, it is important to be able to separately address each involved medium. To do so, SMIL makes use of well specified elements. For instance, a