抄録(英) |
The botnet detection is imperative. Among several detection schemes, the promising one uses the communication sequences. The main idea of that scheme is that the communication sequences represent special feature since they are controlled by programs. That sequence is tokenized to truncated sequences by $n$-gram and the numbers of each pattern\'s occurrence are used as a feature vector. However, although the features are normalized by the total number of all patterns\' occurrences, the number of occurrences in larger $n$ are less than those of smaller $n$. That is, regardless of the value of $n$, the previous scheme normalizes it by the total number of all patterns\' occurrences. As a result, normalized long patterns\' features become very small value and are hidden by others. In order to overcome this shortcoming, in this paper, we propose tit. We realize the emphasizing by two ideas. The first idea is normalizing occurrences by the total number of occurrences in each $n$ instead of the total number of all patterns\' occurrences. By doing this, smaller occurrences in larger $n$ are normalized by smaller values and the feature becomes more balanced with larger value. The second idea is giving weights to the normalized features by calculating ranks of the normalized feature. By weighting features according to the ranks, we can get more outstanding features of longer patterns. By the computer simulation with real dataset, we show the effectiveness of our scheme. |