SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

Three reasoning tracks

SciR takes a clean computational problem, one with a known and checkable answer, and rewrites it to read like real scientific documents. It covers three kinds of reasoning: deduction, induction, and causal abduction. Pick a track below to build a task step by step and watch how models handle it.

See the figure that summarizes the whole pipeline

Developmental biology. Premises about cell lineages are given. The task: decide whether a hypothesis follows from them.

Start from a formal deduction tree

given premise (listed below) derived step (intermediate conclusion, not a premise) hypothesis (root) missing premise → unknown

Build the task up, layer by layer

Premises in first-order logic

Feed the finished task to a model

Pharmacology. Drug profiles and some interacting drug pairs are given. The task: induce the rule behind the interactions.

Start from a hidden rule

Build the task up

Drug profiles & observed interactions

Feed the finished task to a model

Cell signalling. Intervention data on the Sachs (2005) protein network is given. The task: work out how a new protein, XYZ, connects.

Start from a causal graph

known protein new protein (XYZ) & its true connection

Build the task up

Simulated concentration data

Feed the finished task to a model

Results

We put all three tracks through six models and three solver setups, making both extraction and inference progressively harder. Open the results to see where each one breaks.

Full accuracy table

Accuracy (%) on n = 200 tasks per cell. Pick a track and solver; the best model per column is highlighted. R marks reasoning models.

What the labels mean: NL, Obf, Easy, Hard, CoT, NS, SymbCoT*

Scaling extraction difficulty NL premises in clean natural language Obf premises hidden in multi-document scientific text

Scaling inference complexity Easy smaller formal object, fewer inference steps Hard larger formal object, more inference steps

Which solver CoT the LLM reasons step by step NS the LLM formalises, a symbolic solver answers SymbCoT* the LLM formalises, a second LLM call answers

Accuracy along the two axes (shown as a chart)

Chance-normalised accuracy. Each panel is one (tier, rendering) cell; the red dot is direct CoT, the green square is the neuro-symbolic solver, and the arrow shows the lift. Hover any column to read both numbers.

Each model's inference vs extraction profile

Solving a task takes two steps: pull the premises out of the documents (extraction), then reason over them (inference). Three settings tease these apart, one for each step alone and one for both together.

How we isolate extraction and inference

scientific
documents

extraction

formal object

inference

valid · invalid · unknown answer

NL · CoTinference

natural language · chain-of-thought

Premises are already clean, so extraction is trivial. This isolates reasoning.

Obf · NSextraction

obfuscated · neuro-symbolic

A verified solver handles the inference, so the score reflects finding and formalising premises in messy text.

Obf · CoTboth

obfuscated · chain-of-thought

The model must extract and reason. This is the full task.

In the scatter below, one point per model (averaged over the selected tracks and tiers): x = extraction (Obf · NS), y = inference (NL · CoT), marker size = joint (Obf · CoT). Above the diagonal means it reasons better than it extracts.

FindingReasoning models lead on both axes, but the gap is widest on inference: deepseek-r1 beats gpt-4o by 53 points on inference versus 23 on extraction.

SciR: A Controllable Benchmark forScientific Reasoning in LLMs

Three reasoning tracks

Start from a formal deduction tree

Build the task up, layer by layer

Premises in first-order logic

Feed the finished task to a model

Start from a hidden rule

Build the task up

Drug profiles & observed interactions

Feed the finished task to a model

Start from a causal graph

Build the task up

Simulated concentration data

Feed the finished task to a model

Results

Full accuracy table

Each model's inference vs extraction profile

SciR: A Controllable Benchmark for
Scientific Reasoning in LLMs