Hypothesis Evaluation Engine
for Plasma Physics

Define symbolic hypotheses with physical units. Score them against data through a multi-layer pipeline. Record provenance.

Hypothesis Builder

Define symbolic expressions with physical units and parameters. Validate dimensional consistency.

Evaluation Pipeline

Score hypotheses through L0–L4: physics gates, fit quality, robustness, parsimony, epistemic uncertainty.

Falsification

Adversarial testing and degeneracy detection. Find where your hypothesis breaks down.

Experiment Planning

Fisher information gain and cost estimation. Determine optimal next measurements.

Search Generators

Evolutionary and symbolic regression. Discover candidate expressions from data.

Provenance

Browse evaluation history. Every run is recorded with full reproducibility metadata.

LAYERS L0 Physics / L1 Fit / L2 Robustness / L3 Parsimony / L4 Epistemic
DATA Ohmic Heating / Spitzer Resistivity / Simple Transport

How It Works

01

Define a Hypothesis

Write a symbolic expression using standard mathematical notation — for example, a * J**2 for ohmic heating power. Assign physical units to each input variable and the output. Declare free parameters with bounds and initial guesses. The system parses your expression into a sympy tree and validates it before anything else runs.

02

L0 — Physics Gates

The first evaluation layer is a hard pass/fail gate. The dimensional consistency checker walks your expression tree and propagates physical units through every operation — multiplication combines units, addition requires matching units, exponentiation with a dimensional base demands a numeric exponent, and transcendental functions require dimensionless arguments. If the inferred output dimension doesn’t match your declared output unit, the hypothesis fails immediately and no further layers run. Positivity constraints and limiting-case checks also live here.

03

L1 — Fit Quality

Your hypothesis is fitted to data using scipy.optimize.curve_fit. The data is split into train and test partitions. The held-out fit scorer measures R² on the test split — if the model can’t explain held-out data above a threshold, it fails. The extrapolation scorer tests predictions on the extreme edges of the data range, catching models that only work in the interpolation regime.

04

L2 — Robustness

The noise sensitivity scorer adds Gaussian noise to the input data multiple times, refits the model on each perturbed dataset, and measures the coefficient of variation of the fitted parameters. A model whose parameters swing wildly under small perturbations is fragile. The falsifier scorer can optionally delegate to an adversarial falsifier or degeneracy detector for deeper probing.

05

L3 — Parsimony

Following the minimum description length principle, this layer penalizes complexity. It counts nodes in the sympy expression tree and computes an MDL score relative to a baseline. Simpler hypotheses that explain the data equally well are preferred — Occam’s razor formalized as an information-theoretic quantity.

06

L4 — Epistemic Uncertainty

Bootstrap resampling of the data produces an ensemble of fits. The predictive uncertainty scorer computes the 5th–95th percentile prediction interval across all bootstrap fits and checks whether real data points fall within it. The out-of-distribution detector computes the Mahalanobis distance of each data point from the mean of the input space, flagging points that lie far from the training distribution.

07

Falsify

After the pipeline, you can stress-test surviving hypotheses. The adversarial falsifier uses multi-start L-BFGS-B optimization to search for input conditions that maximize the gap between prediction and nearest data point. The degeneracy detector fits the model from many random initial conditions — if equally good fits produce wildly different parameter values, the model has structural degeneracy. It also checks for overfitting by comparing train/test RMSE ratios.

08

Plan Next Experiments

Given a surviving hypothesis, the information gain calculator computes the Fisher Information Matrix at proposed experimental conditions via finite-difference Jacobians. The determinant of the FIM tells you how much new information a measurement at those conditions would provide about the model parameters — D-optimality. The cost model estimates the price of running the experiment so you can optimize the information-to-cost ratio.

09

Search for New Hypotheses

If no hypothesis survives, or you want to explore the space, the search generators can propose new candidates. Evolutionary search uses genetic programming to evolve symbolic expression trees, selecting for fitness (R² on data). Symbolic regression via PySR applies a more sophisticated search. The LLM proposer sends a structured prompt with data statistics to a language model and parses the returned expressions. All discovered expressions are automatically converted into Hypothesis objects with extracted parameters.