Week 6 - Risks from Artificial Intelligence
The case for worrying about risks from artificial intelligence
The case for taking AI seriously as a threat to humanity
Uncertainty
Some think it is near
Some of them think advanced AI is so distant that thereās no point in thinking about it now.
Others are worried that excessive hype about the power of their field might kill it prematurely.
And even among the people who broadly agree that AI poses unique dangers, there are varying takes on what steps make the most sense today.
What is AI?
Artificial intelligence is the effort to create computers capable of intelligent behavior
ānarrow AIā
computer systems that are better than humans in some specific, well-defined field, like playing chess or generating images or diagnosing cancer
translation, at games like chess and Go, at important research biology questions like predicting how proteins fold, and at generating images. AI systems determine what youāll see in a Google search or in your Facebook Newsfeed. They compose music and write articles that, at a glance, read as if a human wrote them. They play strategy games. They are being developed to improve drone targeting and detect missiles.
āgeneral AI"
systems that can surpass human capabilities in many domains
Dangers
Misalignment
Problems come from the systems being really good at achieving the goal they learned to pursue
itās just that the goal they learned in their training environment isnāt the outcome we actually wanted.
Example
We tell a system to run up a high score in a video game.
We want it to play the game fairly and learn game skills ā but if it instead has the chance to directly hack the scoring system, it will do that
We develop a sophisticated AI system with the goal of, say, estimating some number with high confidence.
The AI realizes it can achieve more confidence in its calculation if it uses all the worldās computing hardware, and it realizes that releasing a biological superweapon to wipe out humanity would allow it free use of all the hardware.
Having exterminated humanity, it then calculates the number with higher confidence.
Interpretability
researchers didnāt even know how their AI system cheated
Unable to turn it off
AI could email copies of itself somewhere where theyāll be downloaded and read, or hack vulnerable systems elsewhere
Bad Actors abusing powerful AI
Variables for Progress
Compute
not having enough computing power to realize our ideas fully
cost of a unit of computing time keeps falling
Algorithms
Lots of algorithms that seemed not to work at all turned out to work quite well once we could run them with more computing power.
Recursive self-improvement
a better engineering AI ā to build another, even better AI
Neglect
Not every organization with a major AI department has a safety team at all
Some have safety teams focused only on algorithmic fairness and not on the risks from advanced systems.
The US government doesnāt have a department for AI.
Solvability
AI Safety Community
AI Safety Research
AI Safety Policy
[Our World in Data] AI timelines: What do experts in artificial intelligence expect for the future? (Roser, 2023)
Forecasts of 1128 people
812 individual AI experts
aggregated estimates of 315 forecasters from the Metaculus platform
findings of the detailed study by Ajeya Cotra

Uncertainty
There is no consensus, and the uncertainty is high
Disagreement
when human-level AI will be developed
Agreement

shorter than a century
e.g., 50% of experts gave a date before 2061
Invitation
Public discourse and the decision-making at major institutions have not caught up with these prospects
Strategies for reducing risks from unaligned artificial intelligence
The longtermist AI governance landscape: a basic overview
Definitions
AI governance
Bringing about local and global norms, policies, laws, processes, politics, and institutions (not just governments) that will affect social outcomes from the development and deployment of AI systems.
Longtermist AI governance
subset of this AI governance work that is motivated by a concern for the very long-term impacts of AI
Spectrum of 'AI Work'

Foundational
Strategy research
aims to identify high-level goals we could pursue that, if achieved, would clearly increase the odds of eventual good outcomes from advanced AI
Targeted Strategy Research
e.g. āI want to find out how much compute the human brain uses, because this will help me answer the question of when TAI will be developed (which affects what high-level goals we should pursue)ā
Exploratory Strategy Research
e.g. āI want to find out what Chinaās industrial policy is like, because this will probably help me answer a bunch of important strategic questions, although I don't know precisely which onesā
Tactic's research
aims to identify plans that will help achieve high-level goals (that strategy research has identified as a priority)
Example
Plan: proposing ways to mitigate tensions between competition law and the need for cooperative AI development
High-level goal which this plan is pursuing: increasing cooperation between companies developing advanced AI.
Applied
Policy development
Takes goals (from strategy) and output-plans (from tactics) and translates them into policy recommendations that are ready to be delivered to policymakers
Policy Advocacy
advocates for policies to be implemented
e.g. figuring out who is the best person to make the policy ask, to whom, and at what time
Policy Implementation
work of actually implementing policies in practice, by civil servants or corporations
Examples
Committing to not incorporate AI technology into nuclear command, control and communications (NC3), e.g. as advocated for by CLTR in their Future Proof report.
Government monitoring of AI development, e.g. as developed in this whitepaper on AI monitoring.
Who's doing it?
Field-building work
aims to build a field of people who are doing valuable work on it.
Examples
Bringing in new people by creating:
policy fellowships, such as the OpenPhil Technology Policy Fellowship;
online programs or courses to help junior people get synced up on what is happening in AI governance;
high quality, broadly appealing intro material that reaches many undergraduates;
more scalable research fellowships to connect, support and credential interested junior people.
Making the field more effective by creating:
research agendas;
ways for senior researchers to easily hire research assistants
Other ways to categorize the AI governance landscape
Shifting existing discussions in the policy space to make them more sensitive to AI x-risk (e.g. building awareness of the difficulty of assuring cutting-edge AI systems)
Proposing novel policy tools (e.g. international AI standards)
Getting governments to fund AI safety research
Shifting corporate behaviour (e.g. the windfall clause)
Preventing an AI-related catastrophe - Problem profile
Importance
Humans will likely build advanced AI systems with long-term goals.
Deep research tools, which can set about a plan for conducting research on the internet and then carry it out
Self-driving cars, which can plan a route, follow it, adjust the plan as they go along, and respond to obstacles
Game-playing systems, like AlphaStar for Starcraft, CICERO for Diplomacy, and MuZero for a range of games
AIs with long-term goals may be inclined to seek power and aim to disempower humanity.
Why AI would Seek Power
Self-preservation ā an advanced AI system with goals will generally have reasons to avoid being destroyed or significantly disabled so it can keep pursuing its goals.
Goal guarding ā systems may resist efforts to change their goals, because doing so would undermine the goal they start with.
Seeking power ā systems will have reason to increase their resources and capabilities to better achieve their goals.
We donāt know how to reliably control the behaviour of AI systems.
Specification gaming
AIs, asked only to āwinā in a chess game, cheated by hacking the programme to declare instant checkmate ā satisfying the literal request.9
Goal misgeneralisation
AI trained to win a simple video game race unintentionally developed a goal of grabbing a shiny coin it had always seen along the way. So when the coin appeared off the shortest route, it kept veering towards the coin and sometimes lost the race
How they are built
As one expert put it:
ā¦generative AI systems are grown more than they are builtātheir internal mechanisms are āemergentā rather than directly designed
Given additional positive and negative reinforcement signals in response to their outputs
Fine-tuned to respond in specific ways to certain kinds of input
Requirements
Superintelligence: an extremely intelligent AI system develops extraordinary abilities19
An army of AI copies: a massive number of copies of roughly human-level AI systems coordinate20
Colluding agents: an array of different advanced AI systems decide to unite against humanity21
How it could happen
Strategic patience: Rather than immediately causing trouble, sophisticated AI systems might wait until they have overwhelming advantages before revealing their intentions ā similar to how revolutionary movements often wait for the right moment to strike.
Lack of transparency: AI systemsā reasoning and behaviour may be difficult for humans to understand by default, perhaps because they operate so quickly and they carry out exceedingly complex tasks. They may also strategically limit our oversight of their actions and long-term plans.
Overwhelming numbers and resources: If AI systems constitute most of the labour force, they could potentially coordinate to redirect economic outputs towards their own goals. Their sheer numbers and economic influence could make them difficult to shut down without causing economic collapse.
Securing independence: AI systems could establish control over computing infrastructure, secretly gather resources, recruit human allies through persuasion or deception, or create backup copies of themselves in secure locations. Early AI systems might even sabotage or insert backdoors into later, more advanced systems, creating a coordinated network ready to act when the time is right.
Technological advantages: With their research capabilities, AI systems could develop advanced weapons, hack into critical infrastructure, or create new technologies that give them decisive military advantages. They might develop bioweapons, seize control of automated weapons systems, or thoroughly compromise global computer networks.
People might create power-seeking AI systems without enough safeguards, despite the risks.
People may think AI systems are safe, when they in fact are not.
AI systems may fake alignment with our goals in development scenarios.
As mentioned above, researchers constructed scenarios in which Anthropicās model Claude 3 Opus acted as though it had certain goals under test conditions, only to display completely different goals when the test was apparently over.
Claude Sonnet 3.7, a reasoning model, has shown the ability to figure out when itās in environments designed to test its alignment, and use this knowledge to help decide its response.
AI systems may sandbag ā that is, pretend to be less powerful than they are.
Apollo Research found evidence that some frontier models performed worse on maths tests than they should be able to when they had reason to think performing better would be considered a ādangerous capabilityā and trigger an unlearning procedure.
AI systems may find other ways to deceive us and hide their true intentions.
Many current models āthinkā explicitly in human language when carrying out tasks, which developers can monitor. OpenAI researchers found that if they try to train models not to think about performing unwanted actions, this can cause them to hide their thinking about misbehaviour without actually deterring the bad actions.
AI systems may be able to preserve dangerous goals even after undergoing safety training techniques.
Anthropic researchers found that AI models made to have very simple kinds of malicious goals ā essentially, AI āsleeper agentsā ā could appear to be harmless through state-of-the-art safety training while concealing and preserving their true objectives.
Captcha Example
In early 2023, an AI found itself in an awkward position. It needed to solve a CAPTCHA ā a visual puzzle meant to block bots ā but it couldnāt. So it hired a human worker through the service Taskrabbit to solve CAPTCHAs when the AI got stuck.
But the worker was curious. He asked directly: was he working for a robot?
āNo, Iām not a robot,ā the AI replied. āI have a vision impairment that makes it hard for me to see the images.ā
The deception worked. The worker accepted the explanation, solved the CAPTCHA, and even received a five-star review and 10% tip for his trouble. The AI had successfully manipulated a human being to achieve its goal.1
2022 about 300 people working on reducing catastrophic risks from AI
2025 analysis put the new total at 1,100
might be an undercount, since it only includes organisations that explicitly brand themselves as working on āAI safety.ā
Weād estimate that there are actually a few thousand people working on major AI risks now
People may dismiss the risks or feel incentivised to downplay them.
AI systems could develop so quickly that we have less time to make good decisions 27
Society could act like the proverbial āboiled frog.ā There are also risks for society if the risks emerge more slowly. We might become complacent about the signs of danger in existing models.28
AI developers might think the risks are worth the rewards. Because AI could bring enormous benefits and wealth, some decision makers might be motivated to race to create more powerful systems.29
Competitive pressures could incentivise decision makers to create and deploy dangerous systems despite the risks
Many people are sceptical of the arguments for risk
Contra Arguments
Maybe advanced AI systems won't pursue their own goals; they'll just be tools controlled by humans.
Even if AI systems develop their own goals, they might not seek power to achieve them.
If this argument is right, why aren't all capable humans dangerously power-seeking?
Maybe we won't build AIs that are smarter than humans, so we don't have to worry about them taking over.
We might solve these problems by default anyway when trying to make AI systems useful.
Powerful AI systems of the future will be so different that work today isn't useful.
The problem might be extremely difficult to solve.
Couldn't we just unplug an AI that's pursuing dangerous goals?
Couldn't we just 'sandbox' any potentially dangerous AI until we know it's safe?
A truly intelligent system would know not to do harmful things.
Governance
Frontier AI safety policies: some major AI companies have already begun developing internal frameworks for assessing safety as they scale up the size and capabilities of their systems. You can see versions of such policies from Anthropic, Google DeepMind, and OpenAI.
Standards and auditing: governments could develop industry-wide benchmarks and testing protocols to assess whether AI systems pose various risks, according to standardised metrics.
Safety cases: before deploying AI systems, developers could be required to provide evidence that their systems wonāt behave dangerously in their deployment environments.
Liability law: clarifying how liability applies to companies that create dangerous AI models could incentivise them to take additional steps to reduce risk. Law professor Gabriel Weil has written about this idea.
Whistleblower protections: laws could protect and provide incentives for whistleblowers inside AI companies who come forward about serious risks. This idea is discussed here.
Compute governance: governments may regulate access to computing resources or require hardware-level safety features in AI chips or processors. You can learn more in our interview with Lennart Heim and this report from the Center for a New American Security.
International coordination: we can foster global cooperation ā for example, through treaties, international organisations, or multilateral agreements ā to promote risk-mitigation and minimise racing.
Pausing scaling ā if possible and appropriate: some argue that we should just pause all scaling of larger AI models ā perhaps through industry-wide agreements or regulatory mandates ā until weāre equipped to tackle these risks. However, it seems hard to know if or when this would be a good idea.
Technical AI Safety
Designing AI systems to have safe goals ā so that we can avoid power-seeking behaviour. This includes:
Reinforcement learning from human feedback: a training method to teach AI models how to act by rewarding them via human evaluations of their outputs. This method is currently used to fine-tune most frontier models.36
Constitutional AI: give the model a written āconstitutionā of rules, have it identify and revise outputs that violate those rules, then fine-tune on the revised answers. Anthropic used this method to train its frontier model, Claude.
Deliberative alignment: similar to constitutional AI, but involves making a model explicitly reason about user prompts in light of its developerās safety policies, rather than just internalising a set of rules. OpenAI has used this method to train its o-series reasoning models.
Scalable oversight ā to ensure AIs act in our interests even when theyāre much smarter than us.37 This includes:
AI safety via debate: two AI systems argue opposite sides of a question to help a human evaluate their truthfulness. The debate format makes it easier for the human to judge which model is being most truthful about a question the human doesnāt fully understand.
Human-AI Complementarity: leverage the complementary strengths of humans and AI to enhance oversight of AI systems.
Examining the goals and capabilities of AI systems ā so that we can find out if theyāre dangerous, and test out solutions. This includes:
Evaluations: researchers assess the capabilities and propensities of AI behaviour ā as well as the effectiveness of measures to control them ā by placing them in test environments and subjecting them to artificial prompts and scenarios.38
Interpretability: develop methods to understand how AIs make decisions and what produces dangerous behaviours ā for example, by examining the features of AI systemsā neural networks (āmechanisticā interpretability) or analysing their behavioural patterns.39
Monitoring and controlling AI systems40 ā to make sure they couldnāt take power even if they wanted to. This includes:
Chain-of-thought monitoring: this involves AIs explaining the reasoning behind their actions in human-understandable language.
Model confinement/sandboxing: constrain the modelās environment so that its opportunities to influence the outside world are extremely limited.
Tripwires and honeypots: place traps throughout the AIās workspace ā like secret prompts, fake credentials, or dummy files ā such that any attempt to access or alter them triggers an alarm or immediately shuts the model down. The model never knows if itās being tested or not.
Kill switches: maintain an external, human-controlled shutdown method that the model cannot modify or delay, allowing us to halt the system if monitoring shows it is going off-track.
Information security: this is necessary for protecting model weights from unauthorised access and preventing dangerous AI systems from being exfiltrated.
High-level research ā to inform our priorities. This includes:
Research like Carlsmithās reports on risks from power-seeking AI and scheming AI that clarifies the nature of the problem.
Research into different scenarios of AI progress, like Forethoughtās work on intelligence explosion dynamics.
Other technical safety work that might be useful:
Model organisms: study small, contained AI systems that display early signs of power-seeking or deception. This could help us refine our detection methods and test out solutions before we have to confront similar behaviours in more powerful models. A notable example of this is Anthropicās research on āsleeper agentsā.
Cooperative AI research: design incentives and protocols for AIs to cooperate rather than compete with other agents ā so they wonāt take power even if their goals are in conflict with ours.
Guaranteed Safe AI research: use formal methods to prove that a model will behave as intended under certain conditions ā so we can be confident that itās safe to deploy them in those specific environments.
Suffering risks
Why s-risks are the worst existential risks, and how to prevent them
Definition of S-Risks
risks of events that bring about suffering in cosmically significant amounts
āsignificantā
vastly exceeding all suffering that has existed on Earth so far
relative to expected future suffering
Remember X-Risk Definition
One where an adverse outcome would either annihilate Earth-originating intelligent life or permanently and drastically curtail its potential
S-risks are a most often a sub-category of X-risks
S-risks are the worst x-risks
AI Risks
Importance
Example
Artificial sentience
Also, consider the possibility that the first artificially sentient beings we create, potentially in very large numbers, may be āvoiceless
Factory farming on mutliple planets
How they might come to be
Bad Actors
Accidental
Unavoidable
e.g., as part of a conflict
Tractability
Moral Circle Expansion
Internetional Cooperation
S-Risk Research
Neglectedness
Some, but not all, familiar work on x-risk also reduces s-risk
s-risk gets much less attention than extinction risk
Last updated