PotentRegion4MalDetect: Advanced Features from Potential Malicious Regions for Malware Detection
Abstract
Malware developers exploit the fact that most detection models focus on the entire binary to extract the feature rather than on the regions of potential maliciousness. Therefore, they reverse engineer a benign binary and inject malicious code into it. This obfuscation technique circumvents the malware detection models and deceives the ML classifiers due to the prevalence of benign features compared to malicious features. However, extracting the features from the potential malicious regions enhances the accuracy and decreases false positives. Hence, we propose a novel model named PotentRegion4MalDetect that extracts features from the potential malicious regions. PotentRegion4MalDetect determines the nodes with potential maliciousness in the partially preprocessed Control Flow Graph (CFG) using the malicious strings given by StringSifter. Then, it extracts advanced features of the identified potential malicious regions alongside the features from the completely preprocessed CFG. The features extracted from the completely preprocessed CFG mitigate obfuscation techniques that attempt to disguise malicious content, such as suspicious strings. The experiments reveal that the PotentRegion4MalDetect requires fewer entries to save the features for all binaries than the model focusing on the entire binary, reducing memory overhead, faster computation, and lower storage requirements. These advanced features give an 8.13% increase in SHapley Additive exPlanations (SHAP) Absolute Mean and a 1.44% increase in SHAP Beeswarm value compared to those extracted from the entire binary. The advanced features outperform the features extracted from the entire binary by producing more than 99% accuracy, precision, recall, AUC, F1-score, and 0.064% FPR.