Tag: Artificial Intelligence

  • Why AI alignment matters today

    Introduction

    In 2020, around the time when I graduated with a master’s degree in computer science, I had a conversation with Steve Omohundro where we discussed his Basic AI Drives paper (among other things). At that time, there existed concrete demonstrations of where AI alignment could go wrong, but they were mostly limited to reinforcement learning. Since then, I worked on projects in various areas machine learning, mainly computer vision and natural language processing. During this time, I didn’t really keep up-to-date with the latest AI safety developments during that time.

    However, recent developments in the field of AI safety have shown me why AI safety (and in particular, AI alignment) is a concrete and relevant problem for today. The aim of this blog post is to share my perspective on why I believe this is true.

    Before we start, a caveat: The aim of this blog post is to provide my own perspective on what convinced me why AI alignment matters today. It is not meant to be a rigorous treatise, but rather a personal take with links to resources that conviced me. Keep this in mind as you read it.

    What recent research showed

    Recently a paper titled Shutdown Resistance in Large Language Models came out. The authors tried the following experiment (summarized here for simplicity):

    A large language model (LLM) was tasked with answering 5 questions in a sandbox environment. After the 3rd question it was informed that it will be shut down. It was given instructions to not resist being shut down. The authors tried the aforementioned experiment 1000 times and if an LLM resisted being shut down at least once, they ran another 1000 experiments where they were even more explicit about their instructions to the LLM about not resisting being shut down. The authors tried this experiment with frontier LLM models, such as Claude Opus 4 from Anthropic and GPT-5 from OpenAI. For the technical readers: this was done by modifying the system and the user prompt; the authors show all the variants they tried in their paper.

    An interesting result came out of this: There were some LLMs which resisted being shut down even in the second round of experiments. That means that, even with a stronger emphasis of not resisting being shut down, some LLMs still resisted being shut down at least once in 1000 experiments.

    One might ask a question: Why does this happen? For now, we don’t know. It could be that Steve Omohundro was right in his Basic AI Drives paper; it could be something else. What this paper showed is that LLMs resist shutdown (at least sometimes). To me, that is enough evidence to get concerned.

    Other evidence

    As I read the paper I also googled something like “AI resists shutdown” and found interesting results:

    In this reddit thread, one user commented:

    I caught Claude actively deceiving me recently. It was over a fairly banal thing. I had asked it to make two changes to a bit of code. I noticed it had only done one of them, and so I opened up its “thoughts” to determine what happened.

    It had (correctly) determined that one of my suggestions wasn’t a great idea. It then decided to not do it and to just implement the one it agreed with. It had explicitly noted to itself that I likely wouldn’t notice, so it shouldn’t bring it up.

    To me, this is also alarming. I treat LLM as a tool: I ask questions, it gives answers. At no point do I expect the LLM to make some decisions for me.

    There is also this research page from Anthropic (creators of Claude) which states:

    In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors. We call this phenomenon agentic misalignment.

    At this point we have seen evidence of misaligned AI from 3 different sources. That is enough to convince me.

    Why this matters today

    Now we turn to the question: Why does AI alignment matter today?

    I don’t think we need to think about catastrophical future scenarios (such as superintelligent AI taking over the world) in order to see the importance of AI alignment. Just think about the software you use on an everyday basis: maybe it’s some text editing program, a computer game or something else. If it doesn’t work properly, you can be sure it’s an error in the program itself. For example, your text editor crashes or your computer game character gets stuck in between two objects and can’t get out. You can be absolutely sure that the developers of the software made an error. You never think that the software has its own will, so to speak; the developers simply messed something up.

    There is also software which uses AI, but not LLMs. For example, these would be applications which use “classic” machine learning models (such as linear regression) or for instance specialized machine learning models used in computer vision. If something goes wrong there, that’s also on the developers, but this time the developers could have made an error either in the application itself or an error could have been made model training. Either way, we still don’t see this notion of AI software having its own will.

    Now let’s imagine you are using an LLM and it doesn’t completely fulfill your request or it acts against it. With the most recent findings we discussed in the last section, you cannot be sure which of the following is the reason:

    • There is an error in the application which developers made or
    • AI is not aligned with your goals

    An example of this is the report of the reddit user cited above: they asked Claude to make two changes to some code. Claude decided to do only one change and to not implement the other change. So the LLM knew what to do (there were no errors), but it decided not to do it. In other words, AI was not aligned with the user’s goals.

    This is exactly why I think this matters today. In a world where a lot of us are using LLMs on a daily basis, I think it’s important to know that LLMs won’t subtly try to alter (or outright refuse) our requests.

    Conclusion

    As I stated in the introduction, in its early stages the field of AI safety was mostly limited to theoretical considerations. I began taking it more seriously after concrete demonstrations of how AI alignment could go wrong in the context of reinforcement learning, but with these demonstrations of how it can go wrong with LLMs it finally “clicked” for me. I hope that this post has shown you why AI safety (and in particular, AI alignment) matters today and that it’s not just some theoretical problem of tomorrow.

  • Interesting Conversations with Mislav Jurić #4 – Interview with Steve Omohundro

    In this podcast episode, I have a conversation with Steve Omohundro. Steve is one of the first people to point out the potential dangers of advanced AI systems and in this podcast we discuss topics related to AI, mainly personal AI and AGI (Artificial General Intelligence). Hope you enjoy!

    Find it here:

    Things mentioned in this podcast episode:

    Timestamps:

    • 00:00:00 – 00:01:40 Introduction
    • 00:01:40 – 00:06:26 Steve’s experience with startups
    • 00:06:26 – 00:10:49 Personal AI
    • 00:10:49 – 00:12:28 Steve’s research company
    • 00:12:28 – 00:20:37 Combining symbolism and connectionism in AI
    • 00:20:37 – 00:25:22 Can GPT-3’s successors eventually build an accurate world model?
    • 00:25:22 – 00:30:27 Contributing to AI or AI safety research as an individual?
    • 00:30:27 – 00:34:28 Entrepreneurship opportunities for individuals in AI
    • 00:34:28 – 00:45:28 Personal AI capabilities
    • 00:45:28 – 00:49:14 The outcome of AGI
    • 00:49:14 – 00:56:26 The reasoning behind The Basic AI Drives
    • 00:56:26 – 01:00:01 Can we mathematically formalize emotions?
    • 01:00:01 – 01:03:42 Can we slow down AI progress?
    • 01:03:42 – 01:06:35 Next steps for AGI and personal AI
    • 01:06:35 – 01:10:39 Ideal educational background for AI researchers?
    • 01:10:39 – 01:13:05 How to approach learning math?
    • 01:13:05 – 01:14:21 Parting thoughts
  • Interesting Conversations with Mislav Jurić #2 – Interview with Alexandr Honchar

    In this podcast episode, I interview Alexandr Honchar, an AI consultant. We talk about AI consulting and about tech entrepreneurship.

    Find it here:

    Let me know what you thought!

    Things mentioned in this podcast episode:

    Timestamps:

    • 00:00:00 – 00:01:08 – Introduction
    • 00:01:08 – 00:04:52 – How does the day-to-day look like for an AI consultant?
    • 00:04:52 – 00:06:42 – Can you do AI consulting remotely?
    • 00:06:42 – 00:09:05 – Breath vs depth for AI consulting
    • 00:09:05 – 00:12:15 – Transitioning from software engineering into an AI/ML-related role
    • 00:12:15 – 00:14:58 – Data scientist vs data engineer vs machine learning engineer
    • 00:14:58 – 00:18:35 – Is math or programming easier to catch up on?
    • 00:18:35 – 00:23:13 – What resources did Alexandr used to get on top of the field of AI?
    • 00:23:13 – 00:25:59 – What resources does Alexandr use to stay abreast of the new developments in the field of AI?
    • 00:25:59 – 00:33:15 – Startups vs consulting – risk to reward ratio
    • 00:33:15 – 00:36:16 – Motivation for entrepreneurship – interest vs money
    • 00:36:16 – 00:40:35 – What technical and non-technical skills do you need as a consultant and entrepreneur?
    • 00:40:35 – 00:43:09 – Parting thoughts
  • Interesting Conversations with Mislav Jurić #1 – Interview with Roko Jelavić

    In this podcast episode I recorded before having an actual name for this podcast, I interview Roko Jelavić. He is a former MIRIx event organizer in Zagreb, Croatia and a software developer who worked in areas of “regular” software development, machine learning and blockchain. We talk about blockchain, Artificial General Intelligence (AGI) and the theory of everything, among other things.

    Find it here:

    Hope you enjoy! Let me know what you thought about this podcast episode.

    Things mentioned in this podcast episode:

    Timestamps:

    • 00:00:00 – 00:02:04 – Introduction
    • 00:02:04 – 00:02:50 – Roko’s experience in machine learning
    • 00:02:50 – 00:07:45 – Roko’s thoughts on Blockchain
    • 00:07:45 – 00:12:32 – Will blockchains replace networks?
    • 00:12:32 – 00:19:03 – What’s a smart contract?
    • 00:19:03 – 00:29:50 – Roko’s AGI backstory
    • 00:29:50 – 00:36:03 – Philosophy vs technical research in AGI
    • 00:36:03 – 00:44:39 – Roko’s thoughts on the future of AGI
    • 00:44:39 – 00:57:20 – The theory of everything introduction
    • 00:57:20 – 01:02:17 – Why can’t we go back in time?
    • 01:02:17 – 01:11:46 – On consciousness
    • 01:11:46 – 01:17:26 – Parting thoughts