Main Article Content

David Griol
Carlos III University of Madrid
Jesús García-Herrero
Carlos III University of Madrid
José Manuel Molina
Carlos III University of Madrid
Vol. 2 No. 3 (2013), Articles, pages 37-53


In this paper we present a novel framework for the integration of visual sensor networks and speech-based interfaces. Our proposal follows the standard reference architecture in fusion systems (JDL), and combines different techniques related to Artificial Intelligence, Natural Language Processing and User Modeling to provide an enhanced interaction with their users. Firstly, the framework integrates a Cooperative Surveillance Multi-Agent System (CS-MAS), which includes several types of autonomous agents working in a coalition to track and make inferences on the positions of the targets. Secondly, enhanced conversational agents facilitate human-computer interaction by means of speech interaction. Thirdly, a statistical methodology allows modeling the user conversational behavior, which is learned from an initial corpus and improved with the knowledge acquired from the successive interactions. A technique is proposed to facilitate the multimodal fusion of these information sources and consider the result for the decision of the next system action.


Download data is not yet available.

Article Details


Avis, P., Surveillance and Canadian maritime domestic security, Canadian Military Journal, vol. 1, no. 4, pp. 9-15, 2003.

Bailly, G., Raidt, S., Elisei, F. Gaze, conversational agents and face-to-face communication. Speech Communication, 52(6), 598-612, 2010.

Bangalore, S., G. D. Fabbrizio, and A. Stent. Learning the Structure of Task-Driven Human-Human Dialogs, IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 7, 1249-1259, 2008.

Baker, J., Deng, L., Glass, J., Khudanpur, S., Lee, C., Morgan, N., O’Shaughnessy, D. Developments and directions in speech recognition and understanding. IEEE Signal Processing Magazine 26(3), 75-80, 2009.

Batliner, A., Hacker, C., Steidl, S., Nöth, E., D’Arcy, S., Russel, M., Wong, M. Towards multilingual speech recognition using data driven source/target acoustical units association. Proc. of ICASSP’04. Montreal, Quebec, Canada, 521-524, 2004.

Benesty, J., Sondhi, M.M., Huang, Y. Springer Handbook of Speech Processing. Springer. 2008.

Berger, A., S. Pietra, and V. Pietra. A maximum entropy approach to natural language processing, Comput. Linguist, 22(1), 39-71, 1996.

Bohus, D., Rudnicky, A. RavenClaw: Dialog management using hierarchical task decomposition and an expectation agenda. In: Proc. of 8th European Conference on Speech Communication and Technology (Eurospeech’03), pp. 597-600. Geneva, Switzerland, 2003.

Bricon-Souf N, Newman CR. Context awareness in health care: A review. International journal of medical informatics 76, 2-12, 2007.

Cassell, J., Sullivan, J., Prevost, S., Churchill, E.F. Embodied Conversational Agents. The MIT Press, 2000.

Castanedo, F., J. García, M. A. Patricio, and J. M. Molina. Data fusion to improve trajectory tracking in a Cooperative Surveillance Multi-Agent Architecture, Information Fusion, vol. 11, 243-255, 2010.

Catizone, R., Setzer, A., Wilks, Y. Multimodal Dialogue Management in the COMIC Project. In: Proc. of EACL’03 Workshop on Dialogue Systems: interaction, adaptation, and styles of management. Budapest, Hungary, 25-34, 2003.

Corradini A, Mehta M, Bernsen N, Martin J, Abrilian S. Multimodal input fusion in human-computer interaction. In: Proc. of the NATO-ASI Conference on Data Fusion for SituationMonitoring, IncidentDetection, Alert and Response Management, Yerevan, Armenia, 2003.

Cowie, R., Cornelius, R. Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2), 5-32, 2003.

Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson A. Towards human-like spoken dialogue systems. Speech Communication, 50 (8-9), 630-645, 2008.

Endrass, B., Rehm, M., André, E. Planning Small Talk behavior with cultural influences for multiagent systems. Computer Speech & Language, 25(2), 158-174, 2011.

Flecha-García, M.L. Eyebrow raises in dialogue and their relation to discourse structure, utterance function and pitch accents in English. Speech Communication, 52(6), 542-554, 2010.

Forbes-Riley, K. M., Litman, D. Modelling user satisfaction and student learning in a spoken dialogue tutoring system with generic, tutoring, and user affect parameters. In: Proc. of HLT-NAACL’04, New York, USA, 264-271, 2004.

Gaver WW. Using and creating auditory icons. SFI studies in the sciences of complexity, Addison Wesley Longman, 1992.

Gibbon, D., I. Mertins, and R. K. Moore (Eds.), Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation. Kluwer Academic Publishers, 2000.

Griol, D., L. F. Hurtado, E. Segarra, and E. Sanchis. A statistical Approach to Spoken Dialog Systems Design and Evaluation, Speech Communication, 50(8-9), 666-682, 2008.

Griol, D., J. Carbó, and J. M. Molina, Agent Simulation to Develop Interactive and User-Centered Conversational Agents, Advances in Intelligent and Soft Computing, 91, 69-76, 2011.

Griol, D., J. Molina, and Z. Callejas. Bringing together commercial and academic perspectives for the development of intelligent AmI interfaces, Journal of Ambient Intelligence and Smart Environments, 4(3), 83-207, 2012.

Haseel L, Hagen E. Adaptation of an automotive dialogue system to users’ expertise. In: Proc. of 9th International Conference on Spoken Language Processing (Interspeech’05-Eurospeech), Lisbon, Portugal, 222-226, 2005.

Heim, J., Nilsson, E. G., Skjetne, J. H. User Profiles for Adapting Speech Support in the Opera Web Browser to Disabled Users. LNCS, 4397, 154-172, 2007.

Heinroth, T. and W. Minker, Introducing Spoken Dialogue Systems into Intelligent Environments. Springer, 2012.

Jokinen, K. Natural interaction in spoken dialogue systems. In: Proc. of the Workshop Ontologies and Multilinguality in User Interfaces. Crete, Greece, 730-734, 2003.

Lalanne, D., L. Nigay, P. Palanque, P. Robinson, J. Vanderdonckt, and J. Ladry. Fusion engines for multimodal input: a survey, in Proc. of ICMI-MLMI'09, 153-160, 2009.

Langner, B., Black, A. Using speech in noise to improve understandability for elderly listeners. In: Proc. of ASRU’05. San Juan, Puerto Rico, 392-396, 2005.

Lemon, O. and O. Pietquin (Eds.), Data-Driven Methods for Adaptive Spoken Dialogue Systems. Computational Learning for Conversational Interfaces. Springer, 2012.

Lech, T. and L. W. M. Wienhofen, AmbieAgents: A Scalable Infrastructure for Mobile and Context-Aware Information Services. In: Proc. of AAMAS'05, 625-631, 2005.

Levin E, Levin A. Dialog design for user adaptation. In: Proc. of the International Conference on Acoustics Speech Processing, Toulouse, France, 57-60, 2006.

Liggins, M., Hall, D., and Llinas, J. Handbook of Multisensor Data Fusion (2nd Edition). Boca Ratón, Florida, USA: CRC Press, 2009.

Lo, B.P. J. Sun, and S. A. Velastin. Fusing visual and audio information in a distributed intelligent surveillance system for public transport systems, Acta Automatica Sinica, 29(3), 393-407, 2003.

López-Cózar, R. and M. Araki. Spoken, Multilingual and Multimodal Dialogue Systems. John Wiley & Sons Publishers, 2005.

López-Cózar, R., and Callejas, Z. ASR post-correction for spoken dialogue systems based on semantic, syntactic, lexical and contextual information. Computer Speech and Language, 50, 745-766, 2008.

Markopoulos P, de Ruyter B, Privender S, van Breemen A. Case study: bringing social intelligence into home dialogue systems. Interactions, 12(4), 37-44, 2005.

Martinovski, B., Traum, D. Breakdown in human-machine interaction: the error is the clue. In: Proc. of the ISCA Tutorial and Research Workshop on Error Handling in Dialogue Systems. Chateau d’Oex, Vaud, Switzerland, 11-16, 2003.

McCarthy, J. Generality in Artificial Intelligence. Communications of the ACM, 30(12), 1030-1035, 1987.

Minker, W. Stochastic versus rule-based speech understanding for information retrieval. Speech Communication 25(4), 223-247, 1998.

Minker, W. Design considerations for knowledge source representations of a stochastically-based natural language understanding component. Speech Communication, 28, 141-154, 1999.

Nazari AA. A Generic UPnP Architecture for Ambient Intelligence Meeting Rooms and a Control Point allowing for Integrated 2D and 3D Interaction. In: Proc. of Joint Conference on Smart Objects and Ambient Intelligence: Innovative Context-Aware Services, Usages and Technologies, 207-212, 2005.

Nigay L, Coutaz J. A generic platform for addressing the multimodal challenge. In: Proc. of the SIGCHI Conference on Human Factors in Computing Systems, ACM, Denver, Colorado, US, 98-105, 1995.

Osland, P., B. Viken, F. Solsvik, G. Nygreen, J. Wedvik, and S. Myklbust, Enabling Context-Aware Applications, In: Proc. of ICIN'06, 1-6, 2006.

Pieraccini, R. The Voice in the Machine: Building Computers that Understand Speech. The MIT Press, 2012.

Rabiner, L., Juang, B. Fundamentals of Speech Recognition. Prentice Hal, 1993.

Prendinger, H., Mayer, S., Mori, J., Ishizuka, M. Persona effect revisited. Using bio-signals to measure and reflect the impact of character-based interfaces. In: Proc. of IVA’03. Kloster Irsee, Germany, 283-291, 2003.

Radford, L. Gestures, Speech, and the Sprouting of Signs: A Semiotic-Cultural Approach to Students' Types of Generalization. Mathematical thinking and learning, 5 (1), 37-70, 2003.

Raux, A., Langner, B., Black, A. W., Eskenazi, M. LET’S GO: Improving Spoken Dialog Systems for the Elderly and Non-natives. In: Proc. of Eurospeech’03, Geneva, Switzerland, pp. 753-756, 2003.

Salovey, P., Mayer, J.D. Emotional intelligence. Imagination, Cognition, and Personality, 9, 185-211, 1990.

Sánchez, A.M., M. Patricio, J. García, and J. M. Molina. Video tracking improvement using context-based information. In: Proc. of 10th Int. Conference on Information Fusion, 1-7, 2007.

Schatzmann, J., K. Weilhammer, M. Stuttle, and S. Young. A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies, Knowledge Engineering Review, 21(2), 97–126, 2006.

Schuller, B., Batliner, A., Steidl, S., Seppi, D. Recognising Realistic Emotions and Affect in Speech: State Of The Art and Lessons Learnt from The First Challenge. Speech Communication, vol.53(9-10), 1062-1087, 2011.

Seneff, S., M. Adler, J. Glass, B. Sherry, T. Hazen, C.Wang, and T.Wu. Exploiting Context Information in Spoken Dialogue Interaction with Mobile Devices. In: Proc. of IMUx'07, 1-11, 2007.

Strauss, P. and W. Minker. Proactive Spoken Dialogue Interaction in Multi-Party Environments. Springer, 2010.

Traum, D., Larsson, S. Current and New Directions in Discourse and Dialogue, chap. The Information State Approach to Dialogue Management, pp. 325–354. Kluwer Academic Publishers, 2003.

Tsilfidis, A., Mporas, I., Mourjopoulos, J., and Fakotakis, N. Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing. Computer Speech & Language, 27, 380-395, 2013.

Wahlster, W. Towards Symmetric Multimodality: Fusion and Fission of Speech, Gesture, and Facial Expression. In: Proc. of the 26th German Conference on Artificial Intelligence, 1-18, 2003.

Weber, M.E. and M. L. Stone. Low altitude wind shear detection using airport surveillance radars. In: Record of IEEE Radar Conference, 52-57, 1994.

Williams, J., Young, S. Scaling POMDPs for Spoken Dialog Management. IEEE Audio, Speech and Language Processing 15(8), 2116-2129, 2007.

Wooldridge, M. and N. R. Jennings. Surveillance and Canadian maritime domestic security. The Knowledge Engineering Review, 10(2), 115-152, 1995.

Wu, L., S. L. Oviatt, and P. R. Cohen. From members to teams to committee-a robust approach to gestural and multimodal recognition. IEEE Transactions on Neural Networks, 13(4), 972-982, 2002.

Wu, W.-L., Lu, R.-Z., Duan, J.-Y., Liu, H., Gao, F., and Chen, Y.-Q. Spoken language understanding using weakly supervised learning. Computer Speech & Language, 24, 358-382, 2010.

Young, S. The Statistical Approach to the Design of Spoken Dialogue Systems. Tech. rep., Cambridge University Engineering Department (UK), 2002.