The persistent question haunting the machine learning community is deceptively simple: do large language models genuinely understand mathematics, or are they sophisticated pattern-matching engines that have memorized the statistical regularities of mathematical language? Current benchmarks—from MATH to GSM8K—provide little clarity on this fundamental distinction. Models achieve impressive performance on these datasets, yet the evaluation framework itself may be fundamentally flawed. By presenting problems already couched in formal mathematical notation and established symbolic conventions, existing benchmarks allow models to succeed through superficial statistical associations rather than demonstrating the capacity for abstract reasoning. The introduction of Math Takes Two represents a methodological departure that directly confronts this ambiguity by eliminating the very scaffolding that has obscured our understanding of machine mathematical cognition.

The core insight motivating this work draws from cognitive science and evolutionary linguistics: mathematical reasoning in humans did not emerge in isolation but co-evolved with the communicative necessity to share abstract concepts precisely. This hypothesis suggests that genuine mathematical understanding should manifest when agents face the constraint of developing shared symbolic representations de novo. Rather than providing agents with predefined mathematical syntax or symbolic conventions, Math Takes Two instantiates a scenario where two neural agents must jointly discover and agree upon numerical abstractions to solve a visually grounded task. The task design is crucial—it presents a problem where numerical reasoning provides a clear advantage for extrapolation beyond the training distribution, creating selective pressure for the emergence of systematic counting or enumeration protocols.

The experimental framework operates as follows: two agents interact in an environment where they must coordinate to solve tasks that fundamentally require quantitative reasoning. Critically, neither agent arrives with preexisting mathematical knowledge or formal notation. Instead, they must develop a shared protocol—a communicative system—that encodes numerical information. The visually grounded nature of the task ensures that agents cannot rely purely on linguistic pattern-matching; they must ground their emergent symbolic system in perceptual primitives and then abstract upward to systematic numerical representation. Success metrics extend beyond task performance to examine whether the discovered protocols exhibit properties characteristic of genuine numerical systems: compositionality, systematicity, and generalization to unseen cardinalities. This approach circumvents the fundamental confound plaguing traditional benchmarks: it becomes nearly impossible to achieve high performance through pure memorization of statistical patterns when the agents themselves must invent the language of mathematics.

This contribution sits at the intersection of several critical research threads in contemporary AI. First, it engages with the emergentist perspective in cognitive science—the notion that complex reasoning capacities arise from simpler communicative and interactive dynamics rather than being hardcoded into model architecture. Second, it relates to recent work on compositional generalization and systematic reasoning, where researchers have demonstrated that standard language models often fail to develop truly compositional understanding despite strong performance on in-distribution benchmarks. Third, it connects to multi-agent communication protocols and emergent language research, where studies have shown that agents can develop sophisticated communicative systems under appropriate constraints. Math Takes Two synthesizes these threads by asking whether mathematical reasoning—perhaps the most abstract and systematic form of human cognition—can emerge through the same principles that govern simpler communicative phenomena.

The benchmark's design choices reflect careful consideration of what constitutes evidence for genuine mathematical reasoning. By requiring agents to discover latent structure without predefined mathematical language, the framework tests whether models can perform the foundational cognitive work of abstraction and systematization. Traditional benchmarks implicitly assume this work has already been done—they provide the abstract categories (numbers, operations, equations) and ask models to manipulate them. Math Takes Two inverts this assumption, asking whether models can perform the abstraction itself. The visually grounded component ensures that reasoning remains tethered to concrete perceptual primitives, mirroring how human mathematical cognition develops from sensorimotor foundations.

CuraFeed Take: This work exposes a critical blind spot in how the field evaluates mathematical reasoning. The obsession with benchmark performance metrics has created a false sense of understanding—we've been measuring the wrong thing. Models may achieve 90% accuracy on MATH without ever developing the kind of abstract, systematic numerical cognition that Math Takes Two probes. The real significance here isn't just another benchmark; it's a methodological clarification that will likely reshape how researchers think about evaluating reasoning capabilities more broadly. We should expect to see a bifurcation in the field: models that excel on traditional benchmarks may perform surprisingly poorly on Math Takes Two, revealing that current scaling and training paradigms optimize for pattern-matching over genuine abstraction. The winners will be researchers who recognize that emergent reasoning requires fundamentally different evaluation frameworks—ones that force models to construct meaning rather than retrieve it. Watch particularly for results on extrapolation tasks (cardinalities far beyond training distribution) and for evidence of whether discovered protocols exhibit the compositional structure characteristic of human numerical systems. If models fail to develop such structure, we'll have strong evidence that current architectures lack something essential for mathematical cognition.

```