HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Gao, Sensen; Jia, Xiaojun; Huang, Yihao; Duan, Ranjie; Gu, Jindong; Bai, Yang; Liu, Yang; Guo, Qing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.13896 (cs)

[Submitted on 25 Aug 2024 (v1), last revised 15 Dec 2024 (this version, v3)]

Title:HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Authors:Sensen Gao, Xiaojun Jia, Yihao Huang, Ranjie Duan, Jindong Gu, Yang Bai, Yang Liu, Qing Guo

View PDF HTML (experimental)

Abstract:Text-to-Image(T2I) models have achieved remarkable success in image generation and editing, yet these models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content. Strengthening attacks and uncovering such vulnerabilities can advance the development of reliable and practical T2I models. Most of the previous works treat T2I models as white-box systems, using gradient optimization to generate adversarial prompts. However, accessing the model's gradient is often impossible in real-world scenarios. Moreover, existing defense methods, those using gradient masking, are designed to prevent attackers from obtaining accurate gradient information. While several black-box jailbreak attacks have been explored, they achieve the limited performance of jailbreaking T2I models due to difficulties associated with optimization in discrete spaces. To address this, we propose HTS-Attack, a heuristic token search attack method. HTS-Attack begins with an initialization that removes sensitive tokens, followed by a heuristic search where high-performing candidates are recombined and mutated. This process generates a new pool of candidates, and the optimal adversarial prompt is updated based on their effectiveness. By incorporating both optimal and suboptimal candidates, HTS-Attack avoids local optima and improves robustness in bypassing defenses. Extensive experiments validate the effectiveness of our method in attacking the latest prompt checkers, post-hoc image checkers, securely trained T2I models, and online commercial models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Cite as:	arXiv:2408.13896 [cs.CV]
	(or arXiv:2408.13896v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.13896

Submission history

From: Sensen Gao [view email]
[v1] Sun, 25 Aug 2024 17:33:40 UTC (6,926 KB)
[v2] Tue, 27 Aug 2024 15:13:01 UTC (5,081 KB)
[v3] Sun, 15 Dec 2024 05:13:26 UTC (2,957 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators