Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

We present Sim2Reason: a method for turning physics simulators into scalable generators of question–answer pairs to improve LLM reasoning, removing the need of human annotation in the data-generation pipeline. The core idea is to structure the randomization with a domain-specific language (DSL) and use it to procedurally generate reasoning problems, as illustrated in the examples above. LLMs finetuned on this synthetic data get zero-shot improvement on real world benchmarks such as International Physics Olympiad.

Abstract

We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question–answer (QA) pairs—a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack sufficient large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question–answer pairs from simulated interactions using pre-written templates, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by upto 7 percentage points across different model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data.

Sim2Reason pipeline overview — **Figure 1:** Overview of the Sim2Reason pipeline.

Random Scene Generation

We design a domain-specific language (DSL) to structure the randomization of scene graphs, enabling controlled variation along physically meaningful axes while guaranteeing validity and diversity. Randomization is explicitly restricted to parameters that affect system dynamics; for example, the absolute length of a pulley string is irrelevant to the physics and is therefore excluded. Domain knowledge is used only to prune such non-contributing degrees of freedom, avoiding unnecessary variability without constraining the underlying physics. This constitutes a weak, low-cost form of supervision that scales naturally and can be automated using LLMs.

Loading...

DSL Input (YAML)

Loading code...

Simulation Output

Text QA Pair

Loading…

Recorded Data

Loading image…

Numeric

Figure 2: DSL scene description (top left), generated MuJoCo scene (top right), recorded data plots (bottom right), and the corresponding QA pair (bottom left).

Question Answer Pair Generation

Since our scenes are structured, we can write simple natural language descriptions for each atomic element. By combining these descriptions according to the scene graph structure, we can automatically generate diverse question–answer pairs without human annotation. A question is then formed by randomly selecting one of the relevant physical quantities of one of bodies in the scene, and the answer is computed by running the simulation.

We support three question types—numeric, reverse, and symbolic—each requiring a distinct solution strategy:

Numeric: Query the value of a physical quantity (e.g., “What is the tension in string T₁?”).
Reverse: Specify an outcome and infer the underlying parameter (e.g., “If the tension in string T₁ is 5 N, what is the mass of block A?”).
Symbolic: Express physical quantities as symbols and solve for their relationships.

Results

Across the Qwen2.5 and Qwen3 model family, improvements on synthetic tasks consistently translate into gains on IPhO Mechanics, and this relationship holds across model scale. Models that benefit more from synthetic-data RL on numeric, reverse, and symbolic tasks also show larger improvements on real-world physics problems, indicating meaningful transfer without explicit sim-to-real transfer. Beyond IPhO, the same synthetic RL procedure improves performance on a range of established physics and mathematics benchmarks, demonstrating that the gains generalize across domains and evaluation settings.

Model	Synthetic Numeric	Synthetic Symbolic	HCV	IPhO Mechanics
Qwen3-30B	14.8% → 17.4% +2.6	8.8% → 8.0% -0.8	53.9% → 59.0% +5.1	35.6% → 40.0% +4.4
Qwen2.5-32B	8.9% → 21.9% +13.0	5.6% → 10.4% +4.8	50.6% → 53.9% +3.3	19.8% → 25.2% +5.4
Qwen2.5-14B	7.0% → 17.0% +10.0	5.6% → 10.4% +4.8	49.3% → 51.7% +2.4	16.07% → 20.45% +4.4
Qwen2.5-7B	7.7% → 16.3% +8.6	5.6% → 9.6% +4.0	44.5% → 46.3% +1.8	10.7% → 15.1% +4.4
Qwen2.5-3B	4.8% → 12.5% +7.7	3.2% → 9.4% +6.2	31.9% → 39.5% +7.6	5.68% → 13.15% +7.5

Table 1: Qwen2.5 Instruct performance before and after RL on synthetic data and IPhO.
Format: baseline → RL Δ

Benchmark	Score
JEEBench	34.38% → 52.28% +17.90
PHYSICS	39.42% → 43.09% +3.67
OlympiadBench	41.41% → 44.53% +3.12
AIME 25	10.83% → 12.50% +1.67
MATH 500	78.4% → 82.8% +4.4

Table 2: Qwen2.5-32B Instruct on real-world benchmarks.
Format: baseline → RL Δ

Simulator as a Benchmarking Tool

Our simulator-generated synthetic data provides a fast, low-cost proxy for evaluating scientific reasoning. Unlike real-world physics experiments, which are expensive and slow to construct, simulator-based benchmarks are cheap, scalable, and easily reproducible. As shown in the figure, model performance on our synthetic tasks strongly correlates with performance on real-world physics problems (such as IPhO), indicating that synthetic evaluation can reliably predict real-world reasoning ability.

Synthetic vs IPhO Performance

Figure 3: Correlation between accuracy on Sim2Reason synthetic questions and IPhO mechanics questions.

What does the model learn?

After RL fine-tuning on synthetic data, the model shows a clear shift in how it uses physics equations. Arithmetic mistakes largely vanish, but more importantly, equations are selected based on the physical setup rather than applied verbatim.

Additionally, these gains are not constrained to the scope of the simulator used to generate the training data. The fine-tuned model also improves on problems that are not simulatable in MuJoCo, indicating that this recipe can drive learning that generalizes beyond the domain knowledge in the simulator.

Loading...

⚠ Base Model

Loading...

✓ RL Fine-tuned Model

Loading...