Extraction of human preferences (AI Safety Camp)

I got accepted as a participant in the AI Safety Camp in 2020. The camp was remote that year, since that’s the year when the coronavirus pandemic hit.

The team I was a part of worked on extraction of human preferences. You can see more here. The main research question we wanted to answer is: “Are human preferences present in the hidden states of an reinforcement learning agent that was trained in an environment where a human is present?”

After working on the project for slightly over a year, we produced a blog post that made it to The Alignment Forum.

Although we didn’t get the results we were hoping to get, I’m happy that I could contribute to the technical development of the field of AI safety (at least a little bit). I am also sad that I never met my teammates in real life, but I am happy to have been a part of such a great team.

NewsletterUpdates on interesting things I am doing

Subscribe to my newsletter to keep abreast of the interesting things I'm doing. I will send you the newsletter only when there is something interesting. This means 0% spam, 100% interesting content.