Time-interval temporal patterns can beat and explain the malware

摘要

Malware-based cyber-attacks are mainly aimed at obtaining sensitive data, intellectual property theft, denying critical services and data, and financial gain. Malware has continuously evolved, becoming more sophisticated and evasive, and thus it remains a major cyber-security threat. To keep pace with malware’s evolution, there is a critical need to develop new, advanced malware detection methods. Widely-used solutions, such as antivirus software and other static host-based intrusion detection systems, have limitations, particularly in detecting new, unknown, and evasive malware. Many of the limitations of static analysis can be overcome when dynamic malware analysis is leveraged by machine learning (ML) algorithms by executing the malware in an isolated environment (e.g., sandbox), which enables the acquisition of rich behavioral and time-oriented information associated with malware behavior. Prior studies have proposed various detection methods based on dynamically extracted API calls for malware detection, but other than simple order-based approaches, the use of more advanced time-based methods has not been explored. In this paper, we propose a more comprehensive detection framework which, by analyzing the raw multivariate time-series data associated with malware execution, can accurately capture malware behavior and provide clear explainability regarding malware behavior and detection model decisions. We are the first to mine and automatically discover meaningful and explainable time-interval temporal API call patterns associated with malware behavior and leverage them, using a variety of ML algorithms, for malware detection and categorization. To evaluate our proposed solution, we established a comprehensive dynamic-analysis environment using Cuckoo Sandbox and analyzed more than 17,000 portable executables executed in Windows 10, the most widely-used operating system today. We conducted extensive experiments on malware detection and categorization and compared the performance of our solution to state-of-the-art methods, including non-time-oriented (classic ML algorithms) and order-based methods (LSTM networks). The results show that our proposed solution outperforms the other methods, obtaining 99.6% detection accuracy for unknown malware and 97.65% categorization accuracy. In a more complex scenario of detecting an unknown malware type with unseen modus operandi, our method obtained almost 90% detection accuracy, outperforming the state-of-the-art methods. To demonstrate our ability to provide human explainability, we present some temporal patterns of different malware families that we discovered which shed light on malware behavior that can be used by cyber-security experts to better understand malware, better defend against future attacks, and even attribute malware campaigns to the cyber-attackers launching them.