dataframe: A fast, safe, and intuitive DataFrame library.

[ data, gpl, library, program ] [ Propose Tags ] [ Report a vulnerability ]

A fast, safe, and intuitive DataFrame library for exploratory data analysis.


[Skip to Readme]

Modules

[Last Documentation]

  • DataFrame
    • DataFrame.Functions
    • DataFrame.Lazy

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0, 0.1.0.1, 0.1.0.2, 0.1.0.3, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.3.0.0, 0.3.0.1
Change log CHANGELOG.md
Dependencies array (>=0.5 && <0.6), attoparsec (>=0.12 && <=0.14.4), base (>=4.17.2.0 && <4.22), bytestring (>=0.11 && <=0.12.2.0), containers (>=0.6.7 && <0.8), directory (>=1.3.0.0 && <=1.3.9.0), filepath (>=1.0.0.0 && <=1.5.4.0), hashable (>=1.2 && <=1.5.0.0), random (>=1 && <=1.3.1), snappy (>=0.2.0.0 && <=0.2.0.4), statistics (>=0.16.2.1 && <=0.16.3.0), template-haskell (>=2.0 && <=2.30), text (>=2.0 && <=2.1.2), time (>=1.12 && <=1.14), vector (>=0.13 && <0.14), vector-algorithms (>=0.9 && <0.10), zstd (>=0.1.2.0 && <=0.1.3.0) [details]
Tested with ghc ==9.8.3 || ==9.6.6 || ==9.4.8 || ==9.10.1 || ==9.12.1 || ==9.12.2
License GPL-3.0-or-later
Copyright (c) 2024-2024 Michael Chavinda
Author Michael Chavinda
Maintainer mschavinda@gmail.com
Category Data
Bug tracker https://github.com/mchav/dataframe/issues
Source repo head: git clone https://github.com/mchav/dataframe
Uploaded by mchav at 2025-07-27T19:07:19Z
Distributions LTSHaskell:0.2.0.2, NixOS:0.1.0.3, Stackage:0.3.0.0
Executables dataframe, one_billion_row_challenge, california_housing, chipotle
Downloads 54 total (16 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs not available [build log]
All reported builds failed as of 2025-07-27 [all 2 reports]

Readme for dataframe-0.3.0.0

[back to package description]

dataframe logo

hackage Latest Release C/I

User guide | Discord

DataFrame

A fast, safe, and intuitive DataFrame library.

Why use this DataFrame library?

  • Encourages concise, declarative, and composable data pipelines.
  • Static typing makes code easier to reason about and catches many bugs at compile time—before your code ever runs.
  • Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model.
  • Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults.

Example usage

Interactive environment

ghci> import qualified DataFrame as D
ghci> import DataFrame ((|>))
ghci> df <- D.readCsv "./data/housing.csv"
ghci> D.columnInfo df
--------------------------------------------------------------------------------------------------------------------
index |    Column Name     | # Non-null Values | # Null Values | # Partially parsed | # Unique Values |     Type    
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
 Int  |        Text        |        Int        |      Int      |        Int         |       Int       |     Text    
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
0     | total_bedrooms     | 20433             | 207           | 0                  | 1924            | Maybe Double
1     | ocean_proximity    | 20640             | 0             | 0                  | 5               | Text        
2     | median_house_value | 20640             | 0             | 0                  | 3842            | Double      
3     | median_income      | 20640             | 0             | 0                  | 12928           | Double      
4     | households         | 20640             | 0             | 0                  | 1815            | Double      
5     | population         | 20640             | 0             | 0                  | 3888            | Double      
6     | total_rooms        | 20640             | 0             | 0                  | 5926            | Double      
7     | housing_median_age | 20640             | 0             | 0                  | 52              | Double      
8     | latitude           | 20640             | 0             | 0                  | 862             | Double      
9     | longitude          | 20640             | 0             | 0                  | 844             | Double
ghci> :exposeColumns df
ghci> import qualified DataFrame.Functions as F
ghci> df |> D.groupBy ["ocean_proximity"] |> D.aggregate [(F.mean median_house_value) `F.as` "avg_house_value" ]
--------------------------------------------
index | ocean_proximity |  avg_house_value  
------|-----------------|-------------------
 Int  |      Text       |       Double      
------|-----------------|-------------------
0     | <1H OCEAN       | 240084.28546409807
1     | INLAND          | 124805.39200122119
2     | ISLAND          | 380440.0          
3     | NEAR BAY        | 259212.31179039303
4     | NEAR OCEAN      | 249433.97742663656
ghci> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 10
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
index | longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households |   median_income    | median_house_value | ocean_proximity | rooms_per_household
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
 Int  |  Double   |  Double  |       Double       |   Double    |  Maybe Double  |   Double   |   Double   |       Double       |       Double       |      Text       |       Double       
------|-----------|----------|--------------------|-------------|----------------|------------|------------|--------------------|--------------------|-----------------|--------------------
0     | -122.23   | 37.88    | 41.0               | 880.0       | Just 129.0     | 322.0      | 126.0      | 8.3252             | 452600.0           | NEAR BAY        | 6.984126984126984  
1     | -122.22   | 37.86    | 21.0               | 7099.0      | Just 1106.0    | 2401.0     | 1138.0     | 8.3014             | 358500.0           | NEAR BAY        | 6.238137082601054  
2     | -122.24   | 37.85    | 52.0               | 1467.0      | Just 190.0     | 496.0      | 177.0      | 7.2574             | 352100.0           | NEAR BAY        | 8.288135593220339  
3     | -122.25   | 37.85    | 52.0               | 1274.0      | Just 235.0     | 558.0      | 219.0      | 5.6431000000000004 | 341300.0           | NEAR BAY        | 5.8173515981735155 
4     | -122.25   | 37.85    | 52.0               | 1627.0      | Just 280.0     | 565.0      | 259.0      | 3.8462             | 342200.0           | NEAR BAY        | 6.281853281853282  
5     | -122.25   | 37.85    | 52.0               | 919.0       | Just 213.0     | 413.0      | 193.0      | 4.0368             | 269700.0           | NEAR BAY        | 4.761658031088083  
6     | -122.25   | 37.84    | 52.0               | 2535.0      | Just 489.0     | 1094.0     | 514.0      | 3.6591             | 299200.0           | NEAR BAY        | 4.9319066147859925 
7     | -122.25   | 37.84    | 52.0               | 3104.0      | Just 687.0     | 1157.0     | 647.0      | 3.12               | 241400.0           | NEAR BAY        | 4.797527047913447  
8     | -122.26   | 37.84    | 42.0               | 2555.0      | Just 665.0     | 1206.0     | 595.0      | 2.0804             | 226700.0           | NEAR BAY        | 4.294117647058823  
9     | -122.25   | 37.84    | 52.0               | 3549.0      | Just 707.0     | 1551.0     | 714.0      | 3.6912000000000003 | 261100.0           | NEAR BAY        | 4.970588235294118
ghci> df |> D.derive "nonsense_feature" (latitude + ocean_proximity) |> D.take 10

<interactive>:14:47: error: [GHC-83865]
    • Couldn't match type ‘Text’ with ‘Double’
      Expected: Expr Double
        Actual: Expr Text
    • In the second argument of ‘(+)’, namely ‘ocean_proximity’
      In the second argument of ‘derive’, namely
        ‘(latitude + ocean_proximity)’
      In the second argument of ‘(|>)’, namely
        ‘derive "nonsense_feature" (latitude + ocean_proximity)’

Key features in example:

  • Intuitive, SQL-like API to get from data to insights.
  • Create type-safe references to columns in a dataframe using :exponseColumns
  • Type-safe column transformations for faster and safer exploration.
  • Fluid, chaining API that makes code easy to reason about.

Standalone script example

-- Useful Haskell extensions.
{-# LANGUAGE OverloadedStrings #-} -- Allow string literal to be interpreted as any other string type.
{-# LANGUAGE TypeApplications #-} -- Convenience syntax for specifiying the type `sum a b :: Int` vs `sum @Int a b'. 

import qualified DataFrame as D -- import for general functionality.
import qualified DataFrame.Functions as F -- import for column expressions.

import DataFrame ((|>)) -- import chaining operator with unqualified.

main :: IO ()
main = do
    df <- D.readTsv "./data/chipotle.tsv"
    let quantity = F.col "quantity" :: D.Expr Int -- A typed reference to a column.
    print (df
      |> D.select ["item_name", "quantity"]
      |> D.groupBy ["item_name"]
      |> D.aggregate [ (F.sum quantity)     `F.as` "sum_quantity"
                     , (F.mean quantity)    `F.as` "mean_quantity"
                     , (F.maximum quantity) `F.as` "maximum_quantity"
                     ]
      |> D.sortBy D.Descending ["sum_quantity"]
      |> D.take 10)

Output:

------------------------------------------------------------------------------------------
index |          item_name           | sum_quantity |    mean_quanity    | maximum_quanity
------|------------------------------|--------------|--------------------|----------------
 Int  |             Text             |     Int      |       Double       |       Int      
------|------------------------------|--------------|--------------------|----------------
0     | Chicken Bowl                 | 761          | 1.0482093663911847 | 3              
1     | Chicken Burrito              | 591          | 1.0687160940325497 | 4              
2     | Chips and Guacamole          | 506          | 1.0563674321503131 | 4              
3     | Steak Burrito                | 386          | 1.048913043478261  | 3              
4     | Canned Soft Drink            | 351          | 1.1661129568106312 | 4              
5     | Chips                        | 230          | 1.0900473933649288 | 3              
6     | Steak Bowl                   | 221          | 1.04739336492891   | 3              
7     | Bottled Water                | 211          | 1.3024691358024691 | 10             
8     | Chips and Fresh Tomato Salsa | 130          | 1.1818181818181819 | 15             
9     | Canned Soda                  | 126          | 1.2115384615384615 | 4 

Full example in ./examples folder using many of the constructs in the API.

Visual example

Screencast of usage in GHCI

Installing

Jupyter notebook

CLI

  • Install Haskell (ghc + cabal) via ghcup selecting all the default options.
  • Install snappy (needed for Parquet support) by running: sudo apt install libsnappy-dev.
  • To install dataframe run cabal update && cabal install dataframe
  • Open a Haskell repl with dataframe loaded by running cabal repl --build-depends dataframe.
  • Follow along any one of the tutorials below.

What is exploratory data analysis?

We provide a primer here and show how to do some common analyses.

Coming from other dataframe libraries

Familiar with another dataframe library? Get started:

Supported input formats

  • CSV
  • Apache Parquet (still buggy and experimental)

Future work

  • Apache arrow compatability
  • Integration with common data formats (currently only supports CSV)
  • Support windowed plotting (currently only supports ASCII plots)
  • Host the whole library + Jupyter lab on Azure with auth and isolation.