hanalyze
π English | ζ₯ζ¬θͺ

hanalyze is a Haskell-native statistical engineering toolkit: regression, GLMM, Bayesian inference (HMC/NUTS/Gibbs/ADVI), Gaussian processes, design of experiments, multi-objective optimisation, and HTML reporting integrated under one API.
Core modelling and optimisation logic is implemented in Haskell, with numerical linear algebra delegated to hmatrix/BLAS/LAPACK. No R/Stan/Python bridge required.
Benchmarks (see below) show competitive accuracy with Python/R references in the tested cases. Performance varies by domain: optimisation and small-to-medium MCMC workloads are often faster in these benchmarks, while large-scale ML/GLM workloads are currently slower than sklearn.
Highlights
- Haskell-native: types catch many dtype/API mismatches; shape checks happen at runtime where needed
- Algorithms in Haskell, BLAS for numerics: hmatrix/BLAS/LAPACK powers linear algebra; no R/Stan/Python bridge
- HTML reporting: MathJax/Mermaid + Vega-Lite visualisations in one call; PNG/SVG export available for supported plots
- Dirty-data defence: 8 warning codes + auto-sniff (delim/header/encoding) + cleaning DSL
- Hackage
dataframe: Polars-like DataFrame used directly; CSV native, Parquet/JSON support through dataframe
Capabilities
Features grouped by category. Each capability links to a usage doc and (where relevant) a theory doc.
Statistical inference (Hanalyze.Stat.*)
| Feature |
Module |
Usage |
Theory |
| 12 hypothesis tests (t/ΟΒ²/ANOVA/Wilcoxon/KS/Shapiro/Levene/Bartlett/...) |
Hanalyze.Stat.Test |
stat/01-test.md |
β |
| Multiple-testing correction (Bonferroni/Holm/BH/BY) |
Hanalyze.Stat.MultipleTesting |
stat/06-multipletesting.md |
β |
| Bootstrap CI / permutation tests |
Hanalyze.Stat.Bootstrap |
stat/07-bootstrap.md |
β |
| Effect size + power analysis (Cohen's d/Ξ·Β²/CramΓ©r V/n estimation) |
Hanalyze.Stat.Effect |
stat/09-effect.md |
β |
| Cross-validation (k-fold/stratified/LOO) + Grid search |
Hanalyze.Stat.CV |
stat/04-cv.md |
β |
Regression (Hanalyze.Model.*)
Machine learning (Hanalyze.Model.* / Hanalyze.Stat.*)
Bayesian (Hanalyze.MCMC.* / Hanalyze.Stat.* / Hanalyze.Model.HBM)
Optimisation (Hanalyze.Optim.*)
Design of experiments (Hanalyze.Design.*)
| Feature |
Module |
Usage |
Theory |
| DoE (Factorial / Block / Mixed / RSM / Optimal / Power / Quality) |
Hanalyze.Design.{Factorial,Block,Mixed,RSM,Optimal,Power,Quality,MultiRSM,Anova} |
doe/01-doe.md |
doe/theory-doe.md |
| Orthogonal arrays (L4/L8/L9/L12/L16/L18) + Taguchi (S/N + inner/outer) + process capability (Cp/Cpk) |
Hanalyze.Design.{Orthogonal,Taguchi,Quality} |
doe/02-orthogonal-taguchi.md |
doe/theory-doe.md |
Visualisation (Hanalyze.Viz.*)
Data I/O (Hanalyze.DataIO.*)
| Feature |
Module |
Usage |
CSV/TSV/SSV (cassava) + Parquet/JSON (Hackage dataframe) |
Hanalyze.DataIO.{CSV,External,Convert} |
io/01-dirty-data.md |
| Dirty-data defence (W001-W008 warnings + auto-sniff + clean DSL) |
Hanalyze.DataIO.{Health,Sniff,Clean,Log} |
io/01-dirty-data.md |
| Reshape (pivot_wider / one-hot / lag-lead / rolling window) |
Hanalyze.DataIO.Reshape |
io/02-reshape.md |
| Preprocessing (impute / groupBy / derived columns / melt) |
Hanalyze.DataIO.Preprocess |
io/01-dirty-data.md |
Long-form regrid (regridLong) |
Hanalyze.DataIO.Preprocess + Hanalyze.Stat.Interpolate |
io/03-regrid.md |
Quick start
30 seconds via CLI
git clone https://github.com/frenzieddoll/hanalyze
cd hanalyze
cabal build all
# Regress sales on price + promo, write an HTML report.
hanalyze regress data/readme/sales.csv "price promo" sales --report sales.html
# Ξ²β=185.05 Ξ²(price)=-4.37 Ξ²(promo)=+32.29 RΒ²=0.995
data/readme/sales.csv is a 20-row demo CSV shipped with the repository
(price, promo, sales). The generated sales.html includes coefficients,
fit diagnostics, and an interactive prediction widget β straight from one
command.
30 seconds via Haskell API
import qualified Stat.Test as ST
import qualified Numeric.LinearAlgebra as LA
main = do
let xs = LA.fromList [12, 14, 13, 15, 17, 11]
ys = LA.fromList [18, 22, 20, 19, 25, 17]
result = ST.tTestWelch xs ys ST.TwoSided
print (ST.trPValue result, ST.trEffect result)
-- (0.012, Just ("Cohen's d", -1.85))
See docs/01-quickstart.md for a fuller introduction.
CLI
hanalyze help list subcommands
hanalyze regress <file> <x> <y> LM/GLM/GP/HBM regression + HTML report
hanalyze info <file> per-column type/statistics
hanalyze hist <file> <col> histogram with theoretical PDF overlay
hanalyze ridge <file> ... regularised regression (Ridge/Lasso/EN)
hanalyze kernel <file> ... kernel regression (NW/KR/RFF), multi-D inputs
hanalyze spline <file> ... spline regression
hanalyze multireg <file> ... multi-output regression + interactive HTML
hanalyze melt <file> ... long-form transform
hanalyze regrid <file> ... time-axis grid alignment
hanalyze doe ortho <NAME> -f ... orthogonal-array generation
hanalyze taguchi sn / analyze Taguchi method
hanalyze clean <file> --rule ... dirty-data cleaning
For per-command flags, run hanalyze <cmd> --help or see docs/01-quickstart.md.
Examples / demos
demo/ contains many demos (60+ as of this release). Highlights:
| Demo |
Summary |
demo/regression/HBMRegressionDemo.hs |
HBM Bayesian linear regression with NUTS + HTML |
demo/regression/RFFDemo.hs |
Large-scale GP via Random Fourier Features |
demo/regression/RobustGPDemo.hs |
Robust GP with Student-t observation likelihood |
demo/doe-optim/NSGADemo.hs |
NSGA-II + Pareto on the ZDT suite |
demo/doe-optim/BayesOptDemo.hs |
BO on Branin / Hartmann6 |
demo/bayesian/HBMComparisonDemo.hs |
Compare HBMs with WAIC / LOO |
demo/bayesian/SimpsonParadoxDemo.hs |
Disentangle Simpson's paradox via hierarchical model |
demo/io/DirtyDataDemo.hs |
Auto-defend against 19 dirty CSV variants |
Run: dist-newstyle/build/x86_64-linux/ghc-9.6.7/hanalyze-0.1.0.0/x/<demo-name>/build/<demo-name>/<demo-name>.
Where hanalyze fits
Rather than a complete Python/R replacement, hanalyze targets specific
workflows where Haskell integration, single-binary CLI, and tight reporting
add value.
Strong fit
- Haskell-native pipelines that need stats/Bayes/optim without calling out to Python
- Single-binary CLI distribution (one
hanalyze binary, no Python venv)
- Dirty-CSV defence + cleaning + analysis in one workflow
- DoE / Taguchi / orthogonal arrays for manufacturing and process tuning
- HTML reports straight from the analysis (no separate templating step)
- Type-safe analysis pipelines that catch dtype/API mismatches early
Not a goal β keep using existing tools for
- Large-scale DataFrame work (pandas / polars / data.table)
- GPU deep learning (PyTorch / JAX)
- The full breadth of scikit-learn's mature model zoo
- The full Stan / PyMC MCMC diagnostics ecosystem
- The full expressive range of ggplot2
Comparison vs Python
R is included in the feature map only β no numerical bench against R has been run.
Numbers below come from bench/results/{haskell,python}/*.csv; see
bench/results/SUMMARY.md for the full table and
benchmark conditions (OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1,
single-thread, deterministic seeds).
| Domain |
Result in these benchmarks |
| Single-objective optim (DE/CMAES/L-BFGS/NM) |
Often faster than scipy in tested cases (Rosenbrock_2D/DE 134Γ, Ackley/CMAES 49Γ, Griewank/CMAES 54Γ). On Sphere_30D/L-BFGS the reported objective value is 8.1e-40 vs scipy 2.6e-11 in this run. |
| Multi-objective optim (NSGA-II) |
Comparable or favourable in the ZDT/DTLZ suite (DTLZ2_3 1.43Γ faster, ZDT1/2/3 within Β±5% of pymoo). HV/IGD figures match or slightly improve on pymoo in these runs. |
| Bayesian optim (BO) |
Comparable on Branin (1.15Γ); on Hartmann6 the best objective in this run was -3.07 vs skopt -2.77. |
| Simulated annealing (Tsallis SA) |
Comparable; Rastrigin_10D reaches 0.0 in this run (scipy dual_annealing reports 7.8e-14). |
| Classical regression (LM/Ridge/Lasso/GLMM) |
Comparable in tested cases; LME 30Γ faster than statsmodels in our LME run. |
| Large-scale GLM/Lasso (n β₯ 10k) |
Currently slower than sklearn (3-5Γ in tested cases) β sklearn's Cython inner loops dominate. |
| Kernel/GP |
Currently slower than sklearn (2.5-4.7Γ in tested cases). |
| Bayesian MCMC (NUTS/HMC) |
NUTS with ESS comparable to blackjax (mu: 839 vs 810) on the 8-schools benchmark; 7.4Γ faster than PyMC; 2.8Γ slower than blackjax (JAX-JIT advantage). |
| HBM (probabilistic programming) |
Polymorphic DSL with selected PyMC-style modelling features and selected distributions (Truncated/Censored/MvNormal/LKJ/...). |
| VI / WAIC / LOO |
ADVI 3.0Γ faster than numpyro SVI on a small logistic posterior; LOO 2.9Γ faster than arviz on (S=1000, N=200) log-lik matrix. |
| Hypothesis tests / bootstrap / k-fold |
Welch t-test 39Γ faster, KS 11Γ, k-fold split 2.2Γ faster than scipy/sklearn in tested cases. |
| Time series / Spline / GAM |
ARIMA 128Γ faster than statsmodels; Spline PCHIP comparable to scipy; GAM ~1.6Γ slower than pygam in tested cases. |
| Survival analysis (KM/Cox PH) |
Comparable to lifelines in tested cases (KM/CoxPH). |
| Multi-output regression / Regrid |
MultiLM 2.3Γ faster than sklearn; regridLong 20Γ faster than a hand-written pandas+scipy synthesis. |
| Visualisation |
Vega-Lite specs via hvega (grammar-of-graphics-style); HTML reports built-in. |
See docs/comparison/python-r.md for the feature map, and bench/results/SUMMARY.md for numbers.
Benchmark highlights
Selected results from bench/results/SUMMARY.md. Each entry is a single
benchmark configuration; absolute objective values depend on iteration
counts, seeds, and tolerances β see the SUMMARY for full conditions.
- NUTS 8-schools (warmup 500, samples 1000): hanalyze 1492 ms with ESS(mu) 839 vs blackjax 530 ms / ESS 810 in this run
- Holt-Winters seasonal n=500 p=12: hanalyze 0.19 ms vs statsmodels MLE 96 ms in this run (note: hanalyze uses fixed Ξ±=0.3 closed-form; statsmodels does MLE)
- Sphere_30D/DE: hanalyze 1.0e-26 vs scipy 2.8e-5 on this benchmark
- Sphere_30D/L-BFGS: hanalyze 8.1e-40 vs scipy 2.6e-11 on this benchmark
- Rastrigin_10D/SA: hanalyze 0.0 vs scipy
dual_annealing 7.8e-14 in this run
- Hartmann6/BO: hanalyze -3.07 vs skopt -2.77 in this run
- DTLZ2_3/NSGA-II: hanalyze 528 ms vs pymoo 758 ms (1.43Γ faster in this run)
- DE Rosenbrock_2D: hanalyze 1.2 ms vs scipy 164 ms (134Γ faster in this run)
- Constrained Quad2D (eq): hanalyze 0.062 ms vs scipy SLSQP 0.69 ms in this run
- regridLong on jagged long-form: hanalyze 0.99 ms vs pandas+scipy synthesis 19.4 ms in this run
Reproduce: OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 cabal run bench-{regression,kernel,optim,mo,bo,mcmc-b7,mcmc-extras,ts-extras,optim-plus,stat-util,multi-output,regrid}, then bench/python/bench_*.py (see bench/README.md).
Architecture
graph TD
IO[DataIO.* CSV/Parquet/JSON]
IO --> DF[Hackage dataframe]
DF --> Models[Model.* regression/ML/Bayesian/TS/Survival]
DF --> Stat[Stat.* tests/CV/effect/interpret]
Models --> Optim[Optim.* optimisation]
Models --> MCMC[MCMC.* samplers]
Models --> Viz[Viz.* HTML/PNG/SVG]
Stat --> Viz
MCMC --> Viz
Optim --> Design[Design.* DoE/Taguchi]
All modules talk to Hackage dataframe directly. The internal DataFrame.Core was retired.
Roadmap & API stability
- Stable (API expected to remain backward-compatible within minor versions):
Hanalyze.DataIO.*, Hanalyze.Stat.{Test, Bootstrap, MultipleTesting, ClassMetrics, CV, Effect, Distribution}, Hanalyze.Model.{LM, GLM, Spline, Regularized, RandomForest, DecisionTree, TimeSeries, Survival, GAM}, Hanalyze.Optim.{NelderMead, LBFGS, DifferentialEvolution, CMAES, NSGA, BayesOpt, SimulatedAnnealing, ParticleSwarm}, Hanalyze.Design.*, Hanalyze.Viz.{Scatter, Bar, Histogram}.
- Experimental (API may evolve):
Hanalyze.Model.HBM DSL, Hanalyze.MCMC.NUTS (mass-matrix adaptation is opt-in), Hanalyze.Stat.VI (ADVI), Hanalyze.Model.{GP, RFF, GPRobust, GLMM}, Hanalyze.Viz.ReportBuilder. Behaviour is benchmarked but type signatures may shift.
- Future direction: a unified top-level
Hanalyze.* re-export layer, a Pipeline-style Unfitted β Fitted API, and a backend-abstraction typeclass for swapping hmatrix/Massiv/Accelerate are under consideration but not on a fixed schedule.
Module layout
src/
DataIO/ β CSV/JSON/Parquet IO + health checks + sniff + clean DSL + reshape (9 mods)
Stat/ β tests/distributions/interpolation/effect/CV/bootstrap/interpret etc. (21 mods)
Model/ β LM/GLM/GLMM/Spline/Kernel/GP/RFF/HBM/PCA/Cluster/Tree/TS/Survival (23 mods)
Optim/ β single-obj (NM/LBFGS/DE/CMAES/SA/PSO) + multi-obj (NSGA/BO/Pareto) (18 mods)
Design/ β Factorial/Block/RSM/Optimal/Orthogonal/Taguchi (11 mods)
Viz/ β Vega-Lite-based visualisation + ReportBuilder (15 mods)
MCMC/ β MH/HMC/NUTS/Gibbs/Slice (6 mods)
As of this release: 103 modules, 238 tests.
Build
cabal build all # library + all executables (60+ demos)
cabal test # hspec test suite
cabal repl # interactive REPL
Major dependencies: hmatrix (BLAS/LAPACK), hvega (Vega-Lite), statistics, mwc-random, dataframe (Hackage Polars-like), massiv (parallel arrays), ad (auto-diff), async.
Tested on GHC 9.6.7 + cabal 3.14.2.
Running benchmarks
# 1. Generate shared test data (fixed-seed, deterministic)
cabal run bench-data-gen
# 2. Haskell side
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 \
cabal run bench-regression bench-kernel bench-optim bench-mo bench-bo
# 3. Python side (need bench/venv from bench/requirements.txt)
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 \
bench/venv/bin/python bench/python/bench_regression.py
# (similarly for kernel, optim, mo, bo)
# 4. Aggregate (Markdown table)
bench/venv/bin/python bench/aggregate.py > bench/results/SUMMARY.md
Development
- Issues / PRs: github.com/frenzieddoll/hanalyze
- Adding tests: append hspec specs in
test/Spec.hs
- Adding benchmarks: place
bench/haskell/Bench*.hs and matching Python script
- Coding rules: see
CONTRIBUTING.md (no list-passing on hot paths, minimise unsafe*, ...)
License
BSD-3-Clause License β see LICENSE.
Author
Toshiaki Honda frenzieddoll@gmail.com