SciR takes a clean computational problem, one with a known and checkable answer, and rewrites it to read like real scientific documents. It covers three kinds of reasoning: deduction, induction, and causal abduction. Pick a track below to build a task step by step and watch how models handle it.
Developmental biology. Premises about cell lineages are given. The task: decide whether a hypothesis follows from them.
Pharmacology. Drug profiles and some interacting drug pairs are given. The task: induce the rule behind the interactions.
Cell signalling. Intervention data on the Sachs (2005) protein network is given. The task: work out how a new protein, XYZ, connects.
We put all three tracks through six models and three solver setups, making both extraction and inference progressively harder. Open the results to see where each one breaks.
Accuracy (%) on n = 200 tasks per cell. Pick a track and solver; the best model per column is highlighted. R marks reasoning models.
Chance-normalised accuracy. Each panel is one (tier, rendering) cell; the red dot is direct CoT, the green square is the neuro-symbolic solver, and the arrow shows the lift. Hover any column to read both numbers.
Solving a task takes two steps: pull the premises out of the documents (extraction), then reason over them (inference). Three settings tease these apart, one for each step alone and one for both together.
Premises are already clean, so extraction is trivial. This isolates reasoning.
A verified solver handles the inference, so the score reflects finding and formalising premises in messy text.
The model must extract and reason. This is the full task.
In the scatter below, one point per model (averaged over the selected tracks and tiers): x = extraction (Obf · NS), y = inference (NL · CoT), marker size = joint (Obf · CoT). Above the diagonal means it reasons better than it extracts.