dataframe: A fast, safe, and intuitive DataFrame library.

[ data, library, mit, program ] [ Propose Tags ] [ Report a vulnerability ]

A fast, safe, and intuitive DataFrame library for exploratory data analysis.

[Skip to Readme]

library dataframe

Modules

[Index] [Quick Jump]

DataFrame
- DataFrame.DecisionTree
- DataFrame.Display
  - Terminal
    - DataFrame.Display.Terminal.Colours
    - DataFrame.Display.Terminal.Plot
    - DataFrame.Display.Terminal.PrettyPrint
  - Web
    - DataFrame.Display.Web.Plot
- DataFrame.Errors
- DataFrame.Functions
- IO
  - DataFrame.IO.CSV
  - DataFrame.IO.JSON
  - DataFrame.IO.Parquet
    - DataFrame.IO.Parquet.Binary
    - DataFrame.IO.Parquet.Decompress
    - DataFrame.IO.Parquet.Dictionary
    - DataFrame.IO.Parquet.Encoding
    - DataFrame.IO.Parquet.Levels
    - DataFrame.IO.Parquet.Page
    - DataFrame.IO.Parquet.Schema
    - DataFrame.IO.Parquet.Seeking
    - DataFrame.IO.Parquet.Thrift
    - DataFrame.IO.Parquet.Time
    - DataFrame.IO.Parquet.Utils
  - Utils
    - DataFrame.IO.Utils.RandomAccess
- Internal
  - DataFrame.Internal.Binary
  - DataFrame.Internal.Column
  - DataFrame.Internal.DataFrame
  - DataFrame.Internal.Expression
  - DataFrame.Internal.Grouping
  - DataFrame.Internal.Interpreter
  - DataFrame.Internal.Nullable
  - DataFrame.Internal.Parsing
  - DataFrame.Internal.Row
  - DataFrame.Internal.Schema
  - DataFrame.Internal.Statistics
  - DataFrame.Internal.Types
- DataFrame.Lazy
  - IO
    - DataFrame.Lazy.IO.Binary
    - DataFrame.Lazy.IO.CSV
  - Internal
    - DataFrame.Lazy.Internal.DataFrame
    - DataFrame.Lazy.Internal.Executor
    - DataFrame.Lazy.Internal.LogicalPlan
    - DataFrame.Lazy.Internal.Optimizer
    - DataFrame.Lazy.Internal.PhysicalPlan
- DataFrame.Monad
- Operations
  - DataFrame.Operations.Aggregation
  - DataFrame.Operations.Core
  - DataFrame.Operations.Join
  - DataFrame.Operations.Merge
  - DataFrame.Operations.Permutation
  - DataFrame.Operations.Statistics
  - DataFrame.Operations.Subset
  - DataFrame.Operations.Transformations
  - DataFrame.Operations.Typing
- DataFrame.Operators
- DataFrame.Synthesis
- DataFrame.TH
- DataFrame.Typed
  - DataFrame.Typed.Access
  - DataFrame.Typed.Aggregate
  - DataFrame.Typed.Expr
  - DataFrame.Typed.Freeze
  - DataFrame.Typed.Generic
  - DataFrame.Typed.Join
  - DataFrame.Typed.Lazy
  - DataFrame.Typed.Operations
  - DataFrame.Typed.Record
  - DataFrame.Typed.Schema
  - DataFrame.Typed.TH
  - DataFrame.Typed.Types

library dataframe:arrow-bridge

Modules

[Index] [Quick Jump]

DataFrame
- IO
  - DataFrame.IO.Arrow
- DataFrame.IR
  - DataFrame.IR.ExprJson

Flags

Manual Flags

Name	Description	Default
no-csv	Exclude the CSV reader/writer (`dataframe-csv`). Enable with `-f +no-csv` (or `flags: no-csv` in cabal.project) to trim the dep set of the meta package when CSV is not needed.	Disabled
no-parquet	Exclude the Parquet reader/writer (`dataframe-parquet` plus pinch, zstd, snappy, streamly, http-conduit). Enable with `-f +no-parquet`.	Disabled
no-th	Exclude the Template Haskell splices (`dataframe-th`, `dataframe-csv-th`, `dataframe-parquet-th`). Enable with `-f +no-th` to drop the `template-haskell` dep for downstream packages that don't use compile-time schema derivation.	Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

dataframe-2.1.0.0.tar.gz [browse] (Cabal source package)
Package description (revised from the package)

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

mchav

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.1.0.1, 0.1.0.2, 0.1.0.3, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.3.0.0, 0.3.0.1, 0.3.0.2, 0.3.0.3, 0.3.0.4, 0.3.1.1, 0.3.1.2, 0.3.2.0, 0.3.3.0, 0.3.3.1, 0.3.3.2, 0.3.3.3, 0.3.3.4, 0.3.3.5, 0.3.3.6, 0.3.3.7, 0.3.3.8, 0.3.3.9, 0.3.4.0, 0.3.4.1, 0.3.5.0, 0.4.0.0, 0.4.0.2, 0.4.0.3, 0.4.0.4, 0.4.0.5, 0.4.0.6, 0.4.0.7, 0.4.0.8, 0.4.0.9, 0.4.0.10, 0.4.1.0, 0.5.0.0, 0.5.0.1, 0.6.0.0, 0.7.0.0, 1.0.0.0, 1.0.0.1, 1.1.0.0, 1.1.1.0, 1.1.2.0, 1.1.2.1, 1.2.0.0, 1.3.0.0, 2.0.0.0, 2.1.0.0, 2.1.0.1, 2.1.0.2
Change log	CHANGELOG.md
Dependencies	aeson (>=0.11 && <3), base (<0), bytestring (>=0.11 && <0.13), containers (>=0.6.7 && <0.9), dataframe (>=1 && <3), dataframe-core (>=1.0 && <1.1), dataframe-csv (>=1.0 && <1.1), dataframe-csv-th (>=1.0 && <1.1), dataframe-json (>=1.0 && <1.1), dataframe-lazy (>=1.0 && <1.1), dataframe-learn (>=1.0 && <1.1), dataframe-operations (>=1.0 && <1.1), dataframe-parquet (>=1.0 && <1.1), dataframe-parquet-th (>=1.0 && <1.1), dataframe-parsing (>=1.0 && <1.1), dataframe-th (>=1.0 && <1.1), dataframe-viz (>=1.0 && <1.1), directory (>=1.3.0.0 && <2), filepath (>=1.4 && <2), process (>=1.6 && <2), random (>=1 && <2), text (>=2.0 && <3), time (>=1.12 && <2), unix (>=2 && <3), vector (>=0.13 && <0.14) [details]
Tested with	ghc ==9.4.8 \|\| ==9.6.7 \|\| ==9.8.4 \|\| ==9.10.3 \|\| ==9.12.2
License	MIT
Copyright	(c) 2024-2025 Michael Chavinda
Author	Michael Chavinda
Maintainer	mschavinda@gmail.com
Uploaded	by mchav at 2026-05-16T05:45:16Z
Revised	Revision 1 made by mchav at 2026-06-07T22:26:15Z
Category	Data
Bug tracker	https://github.com/mchav/dataframe/issues
Source repo	head: git clone https://github.com/mchav/dataframe
Distributions	Stackage:2.1.0.2
Reverse Dependencies	6 direct, 0 indirect [details]
Executables	lazy-bench, dataframe, synthesis, dataframe-benchmark-example
Downloads	1494 total (298 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2026-05-16 [all 1 reports]

Readme for dataframe-2.1.0.0

[back to package description]

User guide | Discord

DataFrame

Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.

The library ships three API layers — all operating on the same underlying DataFrame type at runtime:

Untyped (import qualified DataFrame as D) — string-based column names, great for exploration and scripting.
Typed (import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation.
Monadic API — write your transformation as a self contained pipeline.

Why this library?

Concise, declarative, composable data pipelines using the |> pipe operator.
Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
High performance from Haskell's optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.

Install

cabal update
cabal install dataframe

To use as a dependency in a project:

build-depends: base >= 4, dataframe

Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:

dataframe

Quick Start

Save this as Example.hs and run with cabal run Example.hs:

#!/usr/bin/env cabal
{- cabal:
  build-depends: base >= 4, dataframe
-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators

main :: IO ()
main = do
    let sales = D.fromNamedColumns
            [ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
            , ("amount",  D.fromList [100, 120, 50, 20, 40, 30 :: Int])
            ]

    -- Group by product and compute totals
    print $ sales
        |> D.groupBy ["product"]
        |> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
                       , F.count (F.col @Int "amount") `as` "orders"
                       ]

-----------------------
product | total | orders
--------|-------|-------
  Int   |  Int  |  Int
--------|-------|-------
1       | 220   | 2
2       | 70    | 2
3       | 70    | 2

Reading from files works the same way:

df <- D.readCsv "data.csv"
df <- D.readParquet "data.parquet"

-- Hugging Face datasets
df <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"

Interactive REPL

The dataframe REPL comes with all imports pre-loaded. Here's a typical exploration session:

dataframe> df <- D.readCsv "./data/housing.csv"
dataframe> D.dimensions df
(20640, 10)

dataframe> D.describeColumns df
------------------------------------------------------------------------
    Column Name     | ## Non-null Values | ## Null Values |     Type
--------------------|--------------------|----------------|-------------
        Text        |         Int        |      Int       |     Text
--------------------|--------------------|----------------|-------------
 total_bedrooms     | 20433              | 207            | Maybe Double
 ocean_proximity    | 20640              | 0              | Text
 median_house_value | 20640              | 0              | Double
 median_income      | 20640              | 0              | Double
 households         | 20640              | 0              | Double
 population         | 20640              | 0              | Double
 total_rooms        | 20640              | 0              | Double
 housing_median_age | 20640              | 0              | Double
 latitude           | 20640              | 0              | Double
 longitude          | 20640              | 0              | Double

The :declareColumns macro generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:

dataframe> :declareColumns df
"longitude :: Expr Double"
"latitude :: Expr Double"
"housing_median_age :: Expr Double"
"total_rooms :: Expr Double"
"total_bedrooms :: Expr (Maybe Double)"
"population :: Expr Double"
"households :: Expr Double"
"median_income :: Expr Double"
"median_house_value :: Expr Double"
"ocean_proximity :: Expr Text"

dataframe> df |> D.groupBy ["ocean_proximity"]
              |> D.aggregate [F.mean median_house_value `as` "avg_value"]
-------------------------------------
 ocean_proximity |     avg_value
-----------------|-------------------
      Text       |       Double
-----------------|-------------------
 <1H OCEAN       | 240084.28546409807
 INLAND          | 124805.39200122119
 ISLAND          | 380440.0
 NEAR BAY        | 259212.31179039303
 NEAR OCEAN      | 249433.97742663656

Create new columns from existing ones:

dataframe> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3
-----------------------------------------------------------------------------------------------------------------
 longitude | latitude | housing_median_age | total_rooms | ... | ocean_proximity | rooms_per_household
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
  Double   |  Double  |       Double       |   Double    | ... |      Text       |       Double
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
 -122.23   | 37.88    | 41.0               | 880.0       | ... | NEAR BAY        | 6.984126984126984
 -122.22   | 37.86    | 21.0               | 7099.0      | ... | NEAR BAY        | 6.238137082601054
 -122.24   | 37.85    | 52.0               | 1467.0      | ... | NEAR BAY        | 8.288135593220339

Type mismatches are caught as compile errors — adding a Double column to a Text column won't silently produce garbage:

dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)

<interactive>:14:47: error: [GHC-83865]
    • Couldn't match type 'Text' with 'Double'
        Expected: Expr Double
          Actual: Expr Text
    • In the second argument of '(+)', namely 'ocean_proximity'
      In the second argument of 'derive', namely
        '(latitude + ocean_proximity)'

Template Haskell

For scripts and projects, Template Haskell can generate column bindings at compile time.

Generate column references from a CSV

declareColumnsFromCsvFile (in DataFrame.TH, also re-exported from DataFrame) reads your CSV at compile time and generates typed Expr bindings for every column:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators

-- Reads housing.csv at compile time and generates:
--   latitude :: Expr Double
--   total_rooms :: Expr Double
--   ocean_proximity :: Expr Text
--   ... one binding per column
$(D.declareColumnsFromCsvFile "./data/housing.csv")

main :: IO ()
main = do
    df <- D.readCsv "./data/housing.csv"
    print $ df
        |> D.derive "rooms_per_household" (total_rooms / households)
        |> D.filterWhere (median_income .>. 5)
        |> D.groupBy ["ocean_proximity"]
        |> D.aggregate [F.mean median_house_value `as` "avg_value"]

Compare this to the manual version which requires spelling out every column name and type:

-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
        (F.col @Double "total_rooms" / F.col @Double "households")
   |> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)

Generate a schema type from a CSV

deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:

{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DataKinds #-}

import qualified DataFrame.Typed as T

-- Generates:
-- type HousingSchema = '[ T.Column "longitude" Double
--                        , T.Column "latitude" Double
--                        , T.Column "total_rooms" Double
--                        , ...
--                        ]
$(T.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")

Generate a schema (and a row bridge) from a record ADT

When the canonical row shape lives in your code as a Haskell record, deriveSchemaFromType produces both the typed schema and a HasSchema instance that converts between [Order] and a DataFrame (or TypedDataFrame OrderSchema) at runtime:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE TypeFamilies #-}

import Data.Int (Int64)
import qualified Data.Text as T
import qualified DataFrame as D
import qualified DataFrame.Typed as DT

data Order = Order
    { orderId :: Int64
    , region  :: T.Text
    , amount  :: Double
    } deriving (Show, Eq)

$(DT.deriveSchemaFromType ''Order)
-- expands to:
--   type OrderSchema =
--     '[DT.Column "order_id" Int64, DT.Column "region" T.Text, DT.Column "amount" Double]
--   instance DT.HasSchema Order where
--     type Schema Order = OrderSchema
--     toColumns   = ...
--     fromColumns = ...

xs :: [Order]
xs = [Order 1 "us" 10.0, Order 2 "eu" 20.5]

-- Untyped: [Order] <-> DataFrame
df :: D.DataFrame
df = D.fromRecords xs

xs' :: Either T.Text [Order]
xs' = D.toRecords df          -- runtime-checked

-- Typed: [Order] <-> TypedDataFrame OrderSchema
tdf :: DT.TypedDataFrame OrderSchema
tdf = DT.fromRecordsTyped xs

Field names are translated camelCase → snake_case by default; override the translation with deriveSchemaFromTypeWith defaultSchemaOptions{nameTransform = id} (or any String -> String).

If all you need is a runtime Schema to drive readCsvWithSchema (no typed-dataframe machinery), there's a companion splice in DataFrame.Internal.Schema (re-exported from DataFrame):

$(D.deriveSchema ''Order)
-- emits:
--   orderSchema     :: Schema
--   orderSchema     = makeSchema [("order_id", schemaType @Int64), ...]
--   orderOrderId    :: Expr Int64
--   orderOrderId    = col "order_id"
--   orderRegion     :: Expr Text
--   orderRegion     = col "region"
--   orderAmount     :: Expr Double
--   orderAmount     = col "amount"

orders :: IO D.DataFrame
orders = do
    df <- D.readCsvWithSchema orderSchema "orders.csv"
    pure (D.filter orderAmount (> 100) df)

Each record field gets a typed accessor named <lower-first TyConName><UpperFirst FieldName>, so data Order { customerId :: Int } yields orderCustomerId :: Expr Int = col "customer_id". That's the same shape as $(D.declareColumns df) produces from a runtime DataFrame, but driven off the ADT instead of an existing frame.

If you'd rather not depend on Template Haskell, the same schema is available via GHC.Generics:

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE UndecidableInstances #-}

import GHC.Generics (Generic)
import DataFrame.Typed (Schema)
import qualified DataFrame.Typed as DT

data Order = Order { … } deriving (Generic)

type OrderSchema = DT.SchemaOf Order

instance DT.HasSchema Order where
    type Schema Order = OrderSchema
    toColumns   = DT.genericToColumns
    fromColumns = DT.genericFromColumns

Typed API

When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:

{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE OverloadedStrings #-}

import qualified DataFrame as D
import qualified DataFrame.Typed as T
import Data.Text (Text)
import DataFrame.Operators

type EmployeeSchema =
    '[ T.Column "name"       Text
     , T.Column "department" Text
     , T.Column "salary"     Double
     ]

main :: IO ()
main = do
    df <- D.readCsv "employees.csv"
    case T.freeze @EmployeeSchema df of
        Nothing  -> putStrLn "Schema mismatch!"
        Just tdf -> do
            let result = tdf
                    |> T.derive @"bonus" (T.col @"salary" * T.lit 0.1)
                    |> T.filterWhere (T.col @"salary" .>. T.lit 50000)
                    |> T.select @'["name", "bonus"]
            print (T.thaw result)

T.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:

-- Typo in column name → compile error
tdf |> T.filterWhere (T.col @"slary" .>. T.lit 50000)
-- error: Column "slary" not found in schema

-- Wrong type → compile error
tdf |> T.filterWhere (T.col @"name" .>. T.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'

filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can't accidentally treat cleaned columns as nullable:

-- Before: TypedDataFrame '[Column "score" (Maybe Double), Column "name" Text]
let cleaned = T.filterAllJust tdf
-- After:  TypedDataFrame '[Column "score" Double, Column "name" Text]

cleaned |> T.derive @"scaled" (T.col @"score" * T.lit 100)

Features

I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.

Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.

Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.

Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.

Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.

Lazy engine: streaming query execution for files that don't fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.

Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.

ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.

Notebooks: IHaskell integration with pre-built Binder examples.

Lazy Queries

For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:

import qualified DataFrame.Lazy as L
import qualified DataFrame.Functions as F
import DataFrame.Operators
import DataFrame.Internal.Schema (Schema, schemaType)
import Data.Text (Text)

mySchema :: Schema
mySchema = [ ("name",   schemaType @Text)
           , ("weight", schemaType @Double)
           , ("height", schemaType @Double)
           ]

main :: IO ()
main = do
    result <- L.runDataFrame $
        L.scanCsv mySchema "large_file.csv"
        |> L.filter  (F.col @Double "height" .>. F.lit 1.7)
        |> L.select  ["name", "weight", "height"]
        |> L.derive  "bmi" (F.col @Double "weight"
                           / (F.col @Double "height" * F.col @Double "height"))
        |> L.take 1000
    print result

The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.

Documentation

User guide: https://dataframe.readthedocs.io/en/latest/
API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
Coming from pandas, Polars, dplyr, or Frames?
Cookbook (SQL-style patterns)
Tutorials
Discord: https://discord.gg/8u8SCWfrNC