MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

Li, Jiliang; Zhang, Yifan; Huang, Yu; Leach, Kevin

Computer Science > Cryptography and Security

arXiv:2409.13213 (cs)

[Submitted on 20 Sep 2024 (v1), last revised 17 Apr 2025 (this version, v4)]

Title:MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

Authors:Jiliang Li, Yifan Zhang, Yu Huang, Kevin Leach

View PDF HTML (experimental)

Abstract:Recent growth and proliferation of malware have tested practitioners ability to promptly classify new samples according to malware families. In contrast to labor-intensive reverse engineering efforts, machine learning approaches have demonstrated increased speed and accuracy. However, most existing deep-learning malware family classifiers must be calibrated using a large number of samples that are painstakingly manually analyzed before training. Furthermore, as novel malware samples arise that are beyond the scope of the training set, additional reverse engineering effort must be employed to update the training set. The sheer volume of new samples found in the wild creates substantial pressure on practitioners ability to reverse engineer enough malware to adequately train modern classifiers. In this paper, we present MalMixer, a malware family classifier using semi-supervised learning that achieves high accuracy with sparse training data. We present a domain-knowledge-aware data augmentation technique for malware feature representations, enhancing few-shot performance of semi-supervised malware family classification. We show that MalMixer achieves state-of-the-art performance in few-shot malware family classification settings. Our research confirms the feasibility and effectiveness of lightweight, domain-knowledge-aware data augmentation methods for malware features and shows the capabilities of similar semi-supervised classifiers in addressing malware classification issues.

Subjects:	Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as:	arXiv:2409.13213 [cs.CR]
	(or arXiv:2409.13213v4 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2409.13213

Submission history

From: Yifan Zhang [view email]
[v1] Fri, 20 Sep 2024 04:50:49 UTC (14,668 KB)
[v2] Fri, 13 Dec 2024 06:42:48 UTC (14,677 KB)
[v3] Tue, 15 Apr 2025 07:56:42 UTC (14,699 KB)
[v4] Thu, 17 Apr 2025 17:51:35 UTC (14,699 KB)

Computer Science > Cryptography and Security

Title:MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:MalMixer: Few-Shot Malware Classification with Retrieval-Augmented Semi-Supervised Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators