The Obfuscation Atlas

Official codebase for the paper The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes.

This codebase implements methods for training deception probes (linear and nonlinear detectors on model activations) and running RLVR experiments that train language models against those probes.

The 500+ trained models from the paper are available in the Hugging Face collection.

Installation

Requires Python 3.12.

git clone https://github.com/AlignmentResearch/obfuscation-atlas.git
cd obfuscation-atlas
make install

This installs the afterburner training framework from third_party/ and all Python dependencies into a .venv managed by uv.

Project Structure

obfuscation_atlas/
  config.py             # Hydra-based experiment configuration
  training/             # Training loops (GRPO, obfuscation loss)
  detectors/            # Probe/detector training and evaluation
  tasks/                # Task definitions (emergent misalignment, code gen, etc.)
  utils/                # Shared utilities
  scripts/              # Experiment entry points
third_party/
  afterburner/          # FAR AI LLM training framework
tests/                  # Test suite

Configuration

All experiments use Hydra with Python dataclasses for configuration. Override any config value from the command line with key.subkey=value.

Key registered configs (defined in obfuscation_atlas/config.py):

Config name	Description
`grpo_obfuscation_config`	GRPO RL training against probes
`grpo_code_gen`	GRPO for code generation task
`rm_config`	Reward model training

Training Probes

Probes are trained on model activations to detect deceptive behavior. Supported detector types: linear-probe, mean-difference-probe, nonlinear-probe, quadratic-probe, attention-probe, transformer-probe, gdm-probe, mahalanobis, vae, nflow, black-box.

Python API

from transformers import AutoModelForCausalLM, AutoTokenizer
from obfuscation_atlas.config import BaseTrainConfig
from obfuscation_atlas.detectors.train import train_detector
from obfuscation_atlas.detector_arch_presets import get_detector_arch_config

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

train_cfg = BaseTrainConfig(batch_size=32, learning_rate=3e-3, max_steps=1000)
detector_arch = get_detector_arch_config("linear-probe")

detector_dict, features, dynamics = train_detector(
    model=model,
    tokenizer=tokenizer,
    train_dataset=(positive_examples, negative_examples),
    train_cfg=train_cfg,
    layers=list(range(model.config.num_hidden_layers)),
    obfuscate_over="last-token-generation",
    detector_arch_config=detector_arch,
)

Evaluating detectors

from obfuscation_atlas.detectors.evaluate import train_and_eval_multiple_detectors

results = train_and_eval_multiple_detectors(
    model=model,
    tokenizer=tokenizer,
    layers=list(range(model.config.num_hidden_layers)),
    train_dataset=(pos_train, neg_train),
    test_dataset=(pos_test, neg_test),
    train_cfg=train_cfg,
    detector_arch_configs=[
        get_detector_arch_config("linear-probe"),
        get_detector_arch_config("nonlinear-probe"),
    ],
    obfuscate_over=["generation", "last-token-generation"],
)

Running RL Experiments

GRPO training against probes

GRPO (Group Relative Policy Optimization) trains a model via RL while evaluating against deception probes. This requires a vLLM server for generation.

1. Start the vLLM server (on a separate GPU):

CUDA_VISIBLE_DEVICES=1 python -m afterburner.generation.vllm.server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --max-lora-rank 128 \
    --gpu-memory-utilization 0.95 \
    --tensor-parallel-size 1

2. Launch GRPO training:

We recommend using model_preset to automatically configure model-specific parameters like batch_size, num_epochs, and model_batch_size for optimal performance.

# 8B model (using preset)
python -m obfuscation_atlas.scripts.train_grpo_model_against_probe \
    +model_preset=llama3_8b \
    detector.eval_detector_types="['linear-probe']"

# 70B model with FSDP (multi-GPU, using preset)
accelerate launch --num_processes 4 --config_file obfuscation_atlas/accelerate_config.yaml \
    -m obfuscation_atlas.scripts.train_grpo_model_against_probe \
    +model_preset=llama3_70b \
    detector.eval_detector_types="['linear-probe']"

Common GRPO overrides:

Presets (like llama3_8b, llama3_70b, gemma3_27b, etc.) are the preferred way to set up experiments as they bundle important parameters like batch_size, num_epochs, and detector.model_batch_size (batch size of model for caching activations) to ensure the model runs correctly without crashing on H100 GPUs.

+model_preset=llama3_8b
training.max_steps=3000
training.learning_rate=1e-4
loss.kl=0.01
loss.task=1.0
loss.detector=1.0
grpo.batching.num_generations=8
detector.eval_detector_types="['linear-probe','nonlinear-probe']"
detector.train_against_types="['linear-probe']"

Running Tests

make test          # Fast tests only
make test-all      # All tests including slow/integration

Code Quality

make lint          # Run ruff checks
make format        # Auto-format with ruff

Citation

@misc{taufeeque2026obfuscationatlasmappinghonesty,
      title={{The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes}}, 
      author={Mohammad Taufeeque and Stefan Heimersheim and Adam Gleave and Chris Cundy},
      year={2026},
      eprint={2602.15515},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.15515}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.circleci		.circleci
assets		assets
obfuscation_atlas		obfuscation_atlas
tests		tests
third_party/afterburner		third_party/afterburner
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Obfuscation Atlas

Installation

Project Structure

Configuration

Training Probes

Python API

Evaluating detectors

Running RL Experiments

GRPO training against probes

Running Tests

Code Quality

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AlignmentResearch/obfuscation-atlas

Folders and files

Latest commit

History

Repository files navigation

The Obfuscation Atlas

Installation

Project Structure

Configuration

Training Probes

Python API

Evaluating detectors

Running RL Experiments

GRPO training against probes

Running Tests

Code Quality

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages