Semi-supervised two-phase familial analysis of Android malware with normalized graph embedding

作者:

Highlights:

摘要

With the widespread use of smartphones, Android malware has posed serious threats to its security. Given the explosive growth of Android malware variants, detecting malware families are crucial for identifying new security threats, triaging, and building reference datasets. Building behavior profiles of Android applications (apps) with holistic graph-based features would help to retain program semantics and resist obfuscation. It is more effective to use representation with the low-dimensional feature, which could reduce calculation cost and improve the efficiency of downstream analytics tasks. To achieve this goal, we design and develop a practical system for the familial analysis of Android malware named GSFDroid. We first use graph-based features that contain structural information to analyze app behavior. Then, we employ Graph Convolutional Networks (GCNs) to embed nodes into a continuous and low-dimensional space, which improves the efficiency of downstream analytics tasks. Note that distributions of the learned feature vectors of APKs are not aligned and centered caused by the random initialization and propagation strategy of GCN, whose different scales can harm the performance of downstream tasks. Inspired by the z-score, we propose a simple graph feature normalization to standardize the embedded APK features. Finally, instead of fully supervised or unsupervised learning, we propose a two-phased familial analysis method fusing a semi-supervised classifier with a cluster operation on high uncertain score samples respect to the classifier. Promising experimental results based on real-world datasets demonstrate that our approach significantly outperforms state-of-the-art approaches, and can effectively cluster new malware samples from unknown families.

论文关键词:Android malware,Normalized graph embedding,Familial analysis,Semi-supervised learning

论文评审过程:Received 6 January 2020, Revised 14 June 2020, Accepted 17 January 2021, Available online 13 February 2021, Version of Record 18 February 2021.

论文官网地址:https://doi.org/10.1016/j.knosys.2021.106802