Presentation 2022-05-20
Proposal for a Method of Estimating IT System Failure Locations Using Alert Scoring
Reiko Kondo, Kazutaka Ogihara, Takashi Shiraishi,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) In a system where services (microservices) consisting of a combination of many applications are deployed on multiple virtual and physical machines, each device works together, so when a failure occurs, alerts are generated from a wide range of devices, making it difficult to identify the location of the failure. In addition, the alert thresholds are set to identify the failure location, but if the settings are inappropriate, the equipment cannot be identified as the failure location. Also, because of the large number of devices, multiple different failures can occur in a short period of time, and it is possible that multiple alerts generated from different failure locations may be investigated as the same failure. Therefore, we propose a failure location estimation method that evaluates possibility of a failure location from the alerts of each device. The failure location estimation method has the following three features. By integrating and analyzing the dependencies between applications and infrastructure, the failure location can be estimated from alerts for a wide range of devices from applications to infrastructures. In addition, by reflecting the propagation of alerts between devices in the score, even devices that do not raise alerts due to a threshold setting error, etc., can be estimated as fault locations. Furthermore, by grouping related alerts based on configuration information and dependencies between devices, it is expected to be applied to the classification of multiple failures. By using the proposed technology, operators can investigate the equipment with the highest score first, which is expected to shorten the time required for fault recovery.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Container / Operation / System configuration / Scoring / Multiple failures / Docker / Kubernetes / Istio
Paper # ICM2022-8
Date of Issue 2022-05-12 (ICM)

Conference Information
Committee ICM / IPSJ-CSEC / IPSJ-IOT
Conference Date 2022/5/19(2days)
Place (in Japanese) (See Japanese page)
Place (in English)
Topics (in Japanese) (See Japanese page)
Topics (in English)
Chair Kazuhiko Kinoshita(Tokushima Univ.)
Vice Chair Haruo Ooishi(NTT) / Eiji Takahashi(NEC)
Secretary Haruo Ooishi(Bosco) / Eiji Takahashi(Fujitsu)
Assistant Yoshifumi Kato(NTT)

Paper Information
Registration To Technical Committee on Information and Communication Management / Special Interest Group on Computer Security / Special Interest Group on Internet and Operation Technology
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) Proposal for a Method of Estimating IT System Failure Locations Using Alert Scoring
Sub Title (in English)
Keyword(1) Container
Keyword(2) Operation
Keyword(3) System configuration
Keyword(4) Scoring
Keyword(5) Multiple failures
Keyword(6) Docker
Keyword(7) Kubernetes
Keyword(8) Istio
1st Author's Name Reiko Kondo
1st Author's Affiliation FUJITSU LIMITED(FUJITSU)
2nd Author's Name Kazutaka Ogihara
2nd Author's Affiliation FUJITSU LIMITED(FUJITSU)
3rd Author's Name Takashi Shiraishi
3rd Author's Affiliation FUJITSU LIMITED(FUJITSU)
Date 2022-05-20
Paper # ICM2022-8
Volume (vol) vol.122
Number (no) ICM-32
Page pp.pp.36-41(ICM),
#Pages 6
Date of Issue 2022-05-12 (ICM)