Great article - and clever way of organizing the info. I spent a few hours tonight trying to understand why ACG needs model access. On a basic level, it seems like if you could add randomly generated suffixes to an unsafe prompt, then semantically classify the model’s output as safe/unsafe, and store the prompts that work well, you could just keep iterating. I tried to automate this process, and now I realize this would be a very expensive and inefficient lol.
The other methods you mention seem like intelligent, iterative guessing, rather than throwing everything at the wall to see what is gooey.
Great article - and clever way of organizing the info. I spent a few hours tonight trying to understand why ACG needs model access. On a basic level, it seems like if you could add randomly generated suffixes to an unsafe prompt, then semantically classify the model’s output as safe/unsafe, and store the prompts that work well, you could just keep iterating. I tried to automate this process, and now I realize this would be a very expensive and inefficient lol.
The other methods you mention seem like intelligent, iterative guessing, rather than throwing everything at the wall to see what is gooey.