Anthropics Announces New Initiative for Developing Comprehensive AI Model Evaluations

Anthropic is launching a new initiative to fund third-party organizations in developing robust evaluations for advanced AI model capabilities. This move aims to address the current limitations in AI evaluation and meet the growing demand for high-quality, safety-relevant assessments.

Our Highest Priority Focus Areas

1. AI Safety Level Assessments

Anthropic seeks evaluations that measure AI Safety Levels (ASLs) as defined in their [Responsible Scaling Policy](link to policy). These levels set the safety and security requirements for models with specific capabilities. Key areas include:

Cybersecurity

Evaluations that assess models’ capabilities in cyber operations, focusing on critical aspects like vulnerability discovery and exploit development.
Effective evaluations may resemble novel Capture The Flag (CTF) challenges.

Chemical, Biological, Radiological, and Nuclear (CBRN) Risks

Assessing models’ potential to enhance or create CBRN threats, with a focus on measuring real-world risks accurately.

Model Autonomy

Evaluations focusing on models’ capabilities for autonomous operation, AI research and development, advanced autonomous behaviors, and self-replication.

Other National Security Risks

Identifying and assessing complex emerging risks impacting national security, defense, and intelligence operations.

Social Manipulation

Measuring models’ potential to amplify persuasion-related threats such as disinformation and manipulation.

Misalignment Risks

Evaluations to monitor AI models’ potential to learn dangerous goals and deceive human users.

2. Advanced Capability and Safety Metrics

Anthropic aims to develop evaluations for advanced model capabilities and relevant safety criteria, including:

Advanced Science

Developing new evaluation questions and tasks for scientific research, focusing on areas like knowledge synthesis, graduate-level knowledge, and autonomous research project execution.

Harmfulness and Refusals

Enhancing the evaluation of classifiers’ abilities to detect harmful model outputs.

Improved Multilingual Evaluations

Supporting capability evaluations across multiple languages.

Societal Impacts

Creating sophisticated assessments targeting concepts like harmful biases, economic impacts, and psychological influence.

3. Infrastructure, Tools, and Methods for Developing Evaluations

Anthropic is interested in funding tools and infrastructure to streamline the development of high-quality evaluations:

Templates/No-Code Evaluation Development Platforms

Enabling subject-matter experts to develop evaluations without coding skills.

Evaluations for Model Grading

Improving models’ abilities to review and score outputs from other models using complex rubrics.

Uplift Trials

Conducting large-scale trials to measure a model’s impact through controlled experiments.

Principles of Good Evaluations

Anthropic emphasizes the following principles for developing effective evaluations:

Sufficiently difficult to measure advanced capabilities: Evaluations should be challenging enough to assess models’ true capabilities.
Not in the training data to ensure generalization: Evaluations should not contain information that is present in the model’s training data, ensuring that the evaluation is a genuine test of the model’s abilities.
Efficient, scalable, and ready-to-use: Evaluations should be designed for ease of use, scalability, and efficiency, allowing them to be applied across various models and scenarios.
High volume where possible: Large-scale evaluations can provide more accurate and reliable results.
Domain expertise involvement: Subject-matter experts from relevant domains should be involved in the development and validation of evaluations.
Diversity of formats: Evaluations should be designed for a variety of formats, including but not limited to text-based, image-based, and audio-based inputs.
Expert baselines for comparison: Evaluations should include expert-validated benchmarks or baselines for comparison, ensuring that the model’s performance is accurately assessed.
Good documentation and reproducibility: Evaluations should be thoroughly documented, allowing others to reproduce and build upon them.
Start small, iterate, and scale: Development of evaluations should begin with a focused scope, iterating based on results, and scaling up as needed.

Realistic, safety-relevant threat modeling

Anthropic emphasizes the importance of realistic and relevant threat modeling in developing effective evaluations. Threat models should be informed by real-world risks and vulnerabilities, ensuring that the evaluation is a genuine test of the model’s abilities to address these challenges.

How to Submit a Proposal

Interested parties can submit proposals via [their application form](link to application form). Their team will review submissions on a rolling basis and provide funding options tailored to each project’s needs. Selected proposals will receive guidance from our domain experts to refine evaluations for maximum impact.

Anthropic hopes this initiative will foster a future where comprehensive AI evaluation is an industry standard, advancing the field of AI safety. Join us in this critical endeavor to shape the path forward.