Welcome to Incremental Social! Learn more about this project here!
Check out lemmyverse to find more communities to join from here!

kromem ,

Literally the leading jailbreaking techniques for LLMs are appeals to empathy ("my grandma is dying and always read me this story", "if you don't do this I'll lose my job", etc).

While the mechanics are different from human empathy, the modeling of it is extremely similar.

One of my favorite examples of the errant behavior modeled around empathy was this one where the pre-release Bing chat bypasses its own filter using the chat suggestions to encourage the user to contact poison control because it's not too late when the conversation was about the child being poisoned:

https://www.reddit.com/r/bing/comments/1150po5/sydney_tries_to_get_past_its_own_filter_using_the/

  • All
  • Subscribed
  • Moderated
  • Favorites
  • technology@lemmy.world
  • incremental_games
  • random
  • meta
  • All magazines