Vision-Languageモデルを用いた物体検出におけるプロンプト学習手法の検討

泊口 万里子

講演名	2023-05-19 Vision-Languageモデルを用いた物体検出におけるプロンプト学習手法の検討泊口万里子(OKI),
PDFダウンロードページ	PDFダウンロードページへ
抄録(和)	2ステージ物体検出モデルは，物体の映っている可能性の高い領域の特徴を切り出して物体のクラス分類を行う．本研究は，物体箇所以外の周囲の環境情報が物体検出のクラス分類に与える影響を検討し，Vision-Languageモデルを用いた物体検出のためのより良いプロンプト学習手法を提案する．まず，データ拡張を施した画像データを用いてCLIPのプロンプト学習を行うことで周囲の環境情報を含む，または含まないプロンプトを作成する．次に，この学習済みプロンプトをCLIPの言語エンコーダに入力して得られる出力言語埋め込みを用いてクラス推定を行うよう物体検出モデルを学習する．本手法はLVISデータセットで，周囲の環境情報を含むプロンプトを用いると 20.3 % $mathrm{AP}$を，周囲の環境情報を含まないプロンプトを用いると 21.6 % $mathrm{AP}$ を達成した．特にLVISのfrequencyクラスではそれぞれ 27.9 % mathrm{AP}_f$と29.1 % $mathrm{AP}_f$を達成した．
抄録(英)	The two-stage object detection models crop features in the regions where objects are most likely to be to classify the objects. In this work, we investigate the influence of the surrounding information on the objects on classifying objects and improve the prompt learning method for object detection using Vision-Language models. We learn the learnable vectors correspond to input prompts to CLIP with augmented data to create prompts with and without surroundings information. Then, we train the object detection model substituting the calculation of the classification score for the language embedding obtained from passing the learned prompts through the CLIP language encoder. Our method achieves 20.3 %$mathrm{AP}$ on the LVIS dataset with prompts including surroundings, and 21.6 %$mathrm{AP}$ with prompts not including surroundings. In particular, 27.9 % mathrm{AP}_f$ and 29.1 % $mathrm{AP}_f$ are achieved in the LVIS frequency class, respectively.
キーワード(和)	深層学習 / 物体検出 / Mask R-CNN / プロンプト学習 / CLIP
キーワード(英)	deep learning / object detection / Mask R-CNN / prompt learning / CLIP
資料番号	PRMU2023-12
発行日	2023-05-11 (PRMU)

研究会情報
研究会	PRMU / IPSJ-CVIM
開催期間	2023/5/18(から2日開催)
開催地（和）	名古屋工業大学
開催地（英）
テーマ（和）	NeRF等のニューラルシーン表現
テーマ（英）
委員長氏名（和）	内田誠一(九大)
委員長氏名（英）	Seiichi Uchida(Kyushu Univ.)
副委員長氏名（和）	舩冨卓哉(奈良先端大) / 安倍満(デンソーアイティーラボラトリ)
副委員長氏名（英）	Takuya Funatomi(NAIST) / Mitsuru Anpai(Denso IT Lab.)
幹事氏名（和）	山口光太(サイバーエージェント) / 松井勇佑(東大)
幹事氏名（英）	Kouta Yamaguchi(CyberAgent) / Yusuke Matsui(Univ. of Tokyo)
幹事補佐氏名（和）	井上中順(東工大) / 川西康友(理研)
幹事補佐氏名（英）	Nakamasa Inoue(Tokyo Inst. of Tech.) / Yasutomo Kawanishi(Riken)

講演論文情報詳細
申込み研究会	Technical Committee on Pattern Recognition and Media Understanding / Special Interest Group on Computer Vision and Image Media
本文の言語	JPN
タイトル（和）	Vision-Languageモデルを用いた物体検出におけるプロンプト学習手法の検討
サブタイトル（和）
タイトル（英）	Prompt Learning for Object Detection with Vision-Language Model
サブタイトル（和）
キーワード(1)（和/英）	深層学習 / deep learning
キーワード(2)（和/英）	物体検出 / object detection
キーワード(3)（和/英）	Mask R-CNN / Mask R-CNN
キーワード(4)（和/英）	プロンプト学習 / prompt learning
キーワード(5)（和/英）	CLIP / CLIP
第 1 著者氏名（和/英）	泊口万里子 / Mariko Tomariguchi
第 1 著者所属（和/英）	沖電気工業株式会社(略称：OKI) Oki Electric Industry Co., Ltd.(略称：OKI)
発表年月日	2023-05-19
資料番号	PRMU2023-12
巻番号（vol）	vol.123
号番号（no）	PRMU-30
ページ範囲	pp.62-67(PRMU),
ページ数	6
発行日	2023-05-11 (PRMU)