SciR: A Controllable Benchmark for
Scientific Reasoning in LLMs

1Idiap Research Institute   2EPFL   3University of Sheffield   4University of Manchester   5Cancer Research UK Manchester Institute

Three reasoning tracks

SciR takes a clean computational problem, one with a known and checkable answer, and rewrites it to read like real scientific documents. It covers three kinds of reasoning: deduction, induction, and causal abduction. Pick a track below to build a task step by step and watch how models handle it.

See the figure that summarizes the whole pipeline
SciR pipeline overview

Developmental biology. Premises about cell lineages are given. The task: decide whether a hypothesis follows from them.

1

Start from a formal deduction tree

given premise (listed below) derived step (intermediate conclusion, not a premise) hypothesis (root) missing premise → unknown
2

Build the task up, layer by layer

or

Premises in first-order logic

3

Feed the finished task to a model

Pharmacology. Drug profiles and some interacting drug pairs are given. The task: induce the rule behind the interactions.

1

Start from a hidden rule

2

Build the task up

Drug profiles & observed interactions

3

Feed the finished task to a model

Cell signalling. Intervention data on the Sachs (2005) protein network is given. The task: work out how a new protein, XYZ, connects.

1

Start from a causal graph

known protein new protein (XYZ) & its true connection
2

Build the task up

Simulated concentration data

3

Feed the finished task to a model


Results

We put all three tracks through six models and three solver setups, making both extraction and inference progressively harder. Open the results to see where each one breaks.

Full accuracy table

Accuracy (%) on n = 200 tasks per cell. Pick a track and solver; the best model per column is highlighted. R marks reasoning models.

What the labels mean: NL, Obf, Easy, Hard, CoT, NS, SymbCoT*
Scaling extraction difficulty NL premises in clean natural language Obf premises hidden in multi-document scientific text
Scaling inference complexity Easy smaller formal object, fewer inference steps Hard larger formal object, more inference steps
Which solver CoT the LLM reasons step by step NS the LLM formalises, a symbolic solver answers SymbCoT* the LLM formalises, a second LLM call answers
Accuracy along the two axes (shown as a chart)

Chance-normalised accuracy. Each panel is one (tier, rendering) cell; the red dot is direct CoT, the green square is the neuro-symbolic solver, and the arrow shows the lift. Hover any column to read both numbers.

Each model's inference vs extraction profile

Solving a task takes two steps: pull the premises out of the documents (extraction), then reason over them (inference). Three settings tease these apart, one for each step alone and one for both together.

How we isolate extraction and inference
scientific
documents
extraction
formal object
inference
valid · invalid · unknown answer
NL · CoTinference
natural language · chain-of-thought

Premises are already clean, so extraction is trivial. This isolates reasoning.

Obf · NSextraction
obfuscated · neuro-symbolic

A verified solver handles the inference, so the score reflects finding and formalising premises in messy text.

Obf · CoTboth
obfuscated · chain-of-thought

The model must extract and reason. This is the full task.

In the scatter below, one point per model (averaged over the selected tracks and tiers): x = extraction (Obf · NS), y = inference (NL · CoT), marker size = joint (Obf · CoT). Above the diagonal means it reasons better than it extracts.

FindingReasoning models lead on both axes, but the gap is widest on inference: deepseek-r1 beats gpt-4o by 53 points on inference versus 23 on extraction.