Paper Abstract and Keywords |
Presentation |
2019-05-31 10:00
Cross-modal Search using Visually Grounded Multilingual Speech Signal Yasunori Ohishi, Akisato Kimura, Takahito Kawanishi, Kashino Kunio (NTT), David Harwath, James Glass (MIT) PRMU2019-11 |
Abstract |
(in Japanese) |
(See Japanese page) |
(in English) |
We evaluate a deep neural network model capable of learning to associate images and audio captions describing the content of those images on crossmodal search (image and speech retrieval). We show that training a trilingual model simultaneously on English, Hindi, and newly recorded Japanese audio caption data offers improved performance over the monolingual models. Further, we demonstrate the trilingual model implicitly learns meaningful word-level translations based on images. |
Keyword |
(in Japanese) |
(See Japanese page) |
(in English) |
Vision and spoken language / Shared latent space / Crossmodal search / Convolutional neural network / / / / |
Reference Info. |
IEICE Tech. Rep., vol. 119, no. 64, PRMU2019-11, pp. 283-288, May 2019. |
Paper # |
PRMU2019-11 |
Date of Issue |
2019-05-23 (PRMU) |
ISSN |
Online edition: ISSN 2432-6380 |
Copyright and reproduction |
All rights are reserved and no part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Notwithstanding, instructors are permitted to photocopy isolated articles for noncommercial classroom use without fee. (License No.: 10GA0019/12GB0052/13GB0056/17GB0034/18GB0034) |
Download PDF |
PRMU2019-11 |
|