Discussion about this post

User's avatar
Austin Morrissey's avatar

Great article - and clever way of organizing the info. I spent a few hours tonight trying to understand why ACG needs model access. On a basic level, it seems like if you could add randomly generated suffixes to an unsafe prompt, then semantically classify the model’s output as safe/unsafe, and store the prompts that work well, you could just keep iterating. I tried to automate this process, and now I realize this would be a very expensive and inefficient lol.

The other methods you mention seem like intelligent, iterative guessing, rather than throwing everything at the wall to see what is gooey.

Expand full comment

No posts