Copyright	(c) 2025
License	GPL-3.0
Maintainer	mschavinda@gmail.com
Stability	experimental
Portability	POSIX
Safe Haskell	None
Language	Haskell2010

DataFrame

Contents

Core data structures
Display operations
Core dataframe operations
Types
I/O
Operations
Errors
Plotting
Convenience functions

Description

Batteries-included entry point for the DataFrame library.

This module re-exports the most commonly used pieces of the dataframe library so you can get productive fast in GHCi, IHaskell, or scripts.

Naming convention

Use the D. ("DataFrame") prefix for core table operations.
Use the F. ("Functions") prefix for the expression DSL (columns, math, aggregations).

Example session:

We provide a script that imports the core functionality and defines helpful macros for writing safe code.

$ curl --output dataframe "https://raw.githubusercontent.com/mchav/dataframe/refs/heads/main/scripts/dataframe.sh"
$ chmod +x dataframe
$ export PATH=$PATH:$PWD/dataframe
$ dataframe
Configuring library for fake-package-0...
Warning: No exposed modules
GHCi, version 9.6.7: https://www.haskell.org/ghc/  :? for help
Loaded GHCi configuration from /tmp/cabal-repl.-242816/setcwd.ghci
========================================
              📦Dataframe
========================================

✨  Modules were automatically imported.

💡  Use prefix D for core functionality.
        ● E.g. D.readCsv "/path/to/file"
💡  Use prefix F for expression functions.
        ● E.g. F.sum (F.col @Int "value")

✅ Ready.
Loaded GHCi configuration from ./dataframe.ghci
ghci>

Quick start

Load a CSV, select a few columns, filter, derive a column, then group + aggregate:

-- 1) Load data
ghci> df0 <- D.readCsv "data/housing.csv"
ghci> D.describeColumns df0
--------------------------------------------------------------------------------------------------------------------
index |    Column Name     | # Non-null Values | # Null Values | # Partially parsed | # Unique Values |     Type
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
 Int  |        Text        |        Int        |      Int      |        Int         |       Int       |     Text
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
0     | ocean_proximity    | 20640             | 0             | 0                  | 5               | Text
1     | median_house_value | 20640             | 0             | 0                  | 3842            | Double
2     | median_income      | 20640             | 0             | 0                  | 12928           | Double
3     | households         | 20640             | 0             | 0                  | 1815            | Double
4     | population         | 20640             | 0             | 0                  | 3888            | Double
5     | total_bedrooms     | 20640             | 0             | 0                  | 1924            | Maybe Double
6     | total_rooms        | 20640             | 0             | 0                  | 5926            | Double
7     | housing_median_age | 20640             | 0             | 0                  | 52              | Double
8     | latitude           | 20640             | 0             | 0                  | 862             | Double
9     | longitude          | 20640             | 0             | 0                  | 844             | Double

-- 2) Project & filter
ghci> :exposeColumn df
ghci> df1 = D.filterWhere (ocean_proximity F.== F.lit "ISLAND") df0 D.|> D.select [F.name median_house_value, F.name median_income, F.name ocean_proximity]

-- 3) Add a derived column using the expression DSL
--    (col types are explicit via TypeApplications)
ghci> df2 = D.derive "rooms_per_household" (total_rooms / households) df0

-- 4) Group + aggregate
ghci> let grouped   = D.groupBy ["ocean_proximity"] df0
ghci> let summary   =
         D.aggregate
             [ F.maximum median_house_value `F.as` "max_house_value"]
             grouped
ghci> D.take 5 summary
-----------------------------------------
index | ocean_proximity | max_house_value
------|-----------------|----------------
 Int  |      Text       |     Double
------|-----------------|----------------
0     | <1H OCEAN       | 500001.0
1     | INLAND          | 500001.0
2     | ISLAND          | 450000.0
3     | NEAR BAY        | 500001.0
4     | NEAR OCEAN      | 500001.0

Simple operations (cheat sheet)

Most users only need a handful of verbs:

I/O

```
D.readCsv :: FilePath -> IO DataFrame
```
```
D.readTsv :: FilePath -> IO DataFrame
```

D.writeCsv :: FilePath -> DataFrame -> IO ()

D.readParquet :: FilePath -> IO DataFrame

Exploration

```
D.take :: Int -> DataFrame -> DataFrame
```

D.takeLast :: Int -> DataFrame -> DataFrame

D.describeColumns :: DataFrame -> DataFrame

```
D.summarize :: DataFrame -> DataFrame
```

Row ops

D.filterWhere :: Expr Bool -> DataFrame -> DataFrame

D.sortBy :: SortOrder -> [Text] -> DataFrame -> DataFrame

Column ops

D.select :: [Text] -> DataFrame -> DataFrame

D.exclude :: [Text] -> DataFrame -> DataFrame

D.rename :: [(Text,Text)] -> DataFrame -> DataFrame

D.derive :: Text -> D.Expr a -> DataFrame -> DataFrame

Group & aggregate

D.groupBy :: [Text] -> DataFrame -> GroupedDataFrame

D.aggregate :: [(Text, F.UExpr)] -> GroupedDataFrame -> DataFrame

Joins

D.innerJoin / D.leftJoin / D.rightJoin / D.fullJoin

Expression DSL (F.*) at a glance

Columns (typed):

F.col @Text   "ocean_proximity"
F.col @Double "total_rooms"
F.lit @Double 1.0

Math & comparisons (overloaded by type):

(+), (-), (*), (/), abs, log, exp, round
(F.eq), (F.gt), (F.geq), (F.lt), (F.leq)

Aggregations (for D.aggregate):

F.count @a (F.col @a "c")
F.sum   @Double (F.col @Double "x")
F.mean  @Double (F.col @Double "x")
F.min   @t (F.col @t "x")
F.max   @t (F.col @t "x")

REPL power-tool: ':exposeColumns'

Use :exposeColumns df in GHCi/IHaskell to turn each column of a bound DataFrame into a local binding with the same (mangled if needed) name and the column's concrete vector type. This is great for quick ad-hoc analysis, plotting, or hand-rolled checks.

-- Suppose df has columns: "passengers" :: Int, "fare" :: Double, "payment" :: Text
ghci> :set -XTemplateHaskell
ghci> :exposeColumns df

-- Now you have in scope:
ghci> :type passengers
passengers :: Expr Int

ghci> :type fare
fare :: Expr Double

ghci> :type payment
payment :: Expr Text

-- You can use them directly:
ghci> D.derive "fare_with_tip" (fare * F.lit 1.2)

Notes:

Name mangling: spaces and non-identifier characters are replaced (e.g. "trip id" -> trip_id).
Optional/nullable columns are exposed as Expr (Maybe a).

Synopsis

empty :: DataFrame
null :: DataFrame -> Bool
data DataFrame
data GroupedDataFrame
columnAsDoubleVector :: Text -> DataFrame -> Either DataFrameException (Vector Double)
columnAsFloatVector :: Text -> DataFrame -> Either DataFrameException (Vector Float)
columnAsIntVector :: Text -> DataFrame -> Either DataFrameException (Vector Int)
columnAsVector :: Columnable a => Text -> DataFrame -> Vector a
toDoubleMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Double))
toFloatMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Float))
toIntMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Int))
toMarkdownTable :: DataFrame -> Text
fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column
toList :: Columnable a => Column -> [a]
data Column
fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column
fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column
hasElemType :: Columnable a => Column -> Bool
hasMissing :: Column -> Bool
isNumeric :: Column -> Bool
toVector :: forall a v. (Vector v a, Columnable a) => Column -> Either DataFrameException (v a)
type Row = Vector Any
fromAny :: Columnable a => Any -> Maybe a
toAny :: Columnable a => a -> Any
toRowList :: DataFrame -> [Row]
toRowVector :: [Text] -> DataFrame -> Vector Row
data Expr a
module DataFrame.Display
fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame
rename :: Text -> Text -> DataFrame -> DataFrame
dimensions :: DataFrame -> (Int, Int)
columnNames :: DataFrame -> [Text]
insertVector :: Columnable a => Text -> Vector a -> DataFrame -> DataFrame
insertColumn :: Text -> Column -> DataFrame -> DataFrame
insertVectorWithDefault :: Columnable a => a -> Text -> Vector a -> DataFrame -> DataFrame
insertUnboxedVector :: (Columnable a, Unbox a) => Text -> Vector a -> DataFrame -> DataFrame
cloneColumn :: Text -> Text -> DataFrame -> DataFrame
renameMany :: [(Text, Text)] -> DataFrame -> DataFrame
describeColumns :: DataFrame -> DataFrame
fromNamedColumns :: [(Text, Column)] -> DataFrame
fromUnnamedColumns :: [Column] -> DataFrame
fromRows :: [Text] -> [[Any]] -> DataFrame
valueCounts :: Columnable a => Text -> DataFrame -> [(a, Int)]
schemaType :: Columnable a => SchemaType
data HeaderSpec
- = NoHeader
- | UseFirstRow
- | ProvideNames [Text]
data ReadOptions = ReadOptions {
- headerSpec :: HeaderSpec
- typeSpec :: TypeSpec
- safeRead :: Bool
- chunkSize :: Int
- dateFormat :: String
}
data TypeSpec
- = InferFromSample Int
- | SpecifyTypes [SchemaType]
- | NoInference
defaultReadOptions :: ReadOptions
readCsv :: FilePath -> IO DataFrame
readCsvWithOpts :: ReadOptions -> FilePath -> IO DataFrame
readSeparated :: Char -> ReadOptions -> FilePath -> IO DataFrame
readTsv :: FilePath -> IO DataFrame
writeCsv :: FilePath -> DataFrame -> IO ()
writeSeparated :: Char -> FilePath -> DataFrame -> IO ()
module DataFrame.IO.Unstable.CSV
readParquet :: FilePath -> IO DataFrame
range :: (Int, Int) -> DataFrame -> DataFrame
take :: Int -> DataFrame -> DataFrame
filter :: Columnable a => Text -> (a -> Bool) -> DataFrame -> DataFrame
drop :: Int -> DataFrame -> DataFrame
select :: [Text] -> DataFrame -> DataFrame
selectBy :: [SelectionCriteria] -> DataFrame -> DataFrame
data SelectionCriteria
byIndexRange :: (Int, Int) -> SelectionCriteria
byName :: Text -> SelectionCriteria
byNameProperty :: (Text -> Bool) -> SelectionCriteria
byNameRange :: (Text, Text) -> SelectionCriteria
byProperty :: (Column -> Bool) -> SelectionCriteria
cube :: (Int, Int) -> DataFrame -> DataFrame
dropLast :: Int -> DataFrame -> DataFrame
exclude :: [Text] -> DataFrame -> DataFrame
filterAllJust :: DataFrame -> DataFrame
filterBy :: Columnable a => (a -> Bool) -> Text -> DataFrame -> DataFrame
filterJust :: Text -> DataFrame -> DataFrame
filterNothing :: Text -> DataFrame -> DataFrame
filterWhere :: Expr Bool -> DataFrame -> DataFrame
kFolds :: RandomGen g => g -> Int -> DataFrame -> [DataFrame]
randomSplit :: RandomGen g => g -> Double -> DataFrame -> (DataFrame, DataFrame)
sample :: RandomGen g => g -> Double -> DataFrame -> DataFrame
takeLast :: Int -> DataFrame -> DataFrame
module DataFrame.Operations.Transformations
groupBy :: [Text] -> DataFrame -> GroupedDataFrame
aggregate :: [(Text, UExpr)] -> GroupedDataFrame -> DataFrame
distinct :: DataFrame -> DataFrame
module DataFrame.Operations.Sorting
module DataFrame.Operations.Merge
module DataFrame.Operations.Join
sum :: (Columnable a, Num a, Unbox a) => Text -> DataFrame -> Maybe a
correlation :: Text -> Text -> DataFrame -> Maybe Double
frequencies :: Text -> DataFrame -> DataFrame
interQuartileRange :: Text -> DataFrame -> Maybe Double
mean :: Text -> DataFrame -> Maybe Double
median :: Text -> DataFrame -> Maybe Double
skewness :: Text -> DataFrame -> Maybe Double
standardDeviation :: Text -> DataFrame -> Maybe Double
summarize :: DataFrame -> DataFrame
variance :: Text -> DataFrame -> Maybe Double
module DataFrame.Errors
module DataFrame.Display.Terminal.Plot
(|>) :: a -> (a -> b) -> b

Core data structures

empty :: DataFrame Source #

O(1) Creates an empty dataframe

null :: DataFrame -> Bool Source #

Checks if the dataframe is empty (has no columns).

Returns True if the dataframe has no columns, False otherwise. Note that a dataframe with columns but no rows is not considered null.

data DataFrame Source #

Instances

Instances details

Monoid DataFrame Source #
Instance details Defined in DataFrame.Operations.Merge Methods mempty :: DataFrame # mappend :: DataFrame -> DataFrame -> DataFrame # mconcat :: [DataFrame] -> DataFrame #
Semigroup DataFrame Source #	Vertically merge two dataframes using shared columns. Columns that exist in only one dataframe are padded with Nothing.
Instance details Defined in DataFrame.Operations.Merge Methods (<>) :: DataFrame -> DataFrame -> DataFrame # sconcat :: NonEmpty DataFrame -> DataFrame # stimes :: Integral b => b -> DataFrame -> DataFrame #
Show DataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods showsPrec :: Int -> DataFrame -> ShowS # show :: DataFrame -> String # showList :: [DataFrame] -> ShowS #
Eq DataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods (==) :: DataFrame -> DataFrame -> Bool # (/=) :: DataFrame -> DataFrame -> Bool #

data GroupedDataFrame Source #

A record that contains information about how and what rows are grouped in the dataframe. This can only be used with aggregate.

Instances

Instances details

Show GroupedDataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods showsPrec :: Int -> GroupedDataFrame -> ShowS # show :: GroupedDataFrame -> String # showList :: [GroupedDataFrame] -> ShowS #
Eq GroupedDataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods (==) :: GroupedDataFrame -> GroupedDataFrame -> Bool # (/=) :: GroupedDataFrame -> GroupedDataFrame -> Bool #

columnAsDoubleVector :: Text -> DataFrame -> Either DataFrameException (Vector Double) Source #

Retrieves a column as an unboxed vector of Double values.

Returns Left with a DataFrameException if the column cannot be converted to doubles. This may occur if the column contains non-numeric data.

columnAsFloatVector :: Text -> DataFrame -> Either DataFrameException (Vector Float) Source #

Retrieves a column as an unboxed vector of Float values.

Returns Left with a DataFrameException if the column cannot be converted to floats. This may occur if the column contains non-numeric data.

columnAsIntVector :: Text -> DataFrame -> Either DataFrameException (Vector Int) Source #

Retrieves a column as an unboxed vector of Int values.

Returns Left with a DataFrameException if the column cannot be converted to ints. This may occur if the column contains non-numeric data or values outside the Int range.

columnAsVector :: Columnable a => Text -> DataFrame -> Vector a Source #

Get a specific column as a vector.

You must specify the type via type applications.

Examples

Expand

>>> columnAsVector @Int "age" df
[25, 30, 35, ...]

>>> columnAsVector @Text "name" df
["Alice", "Bob", "Charlie", ...]

Throws

Expand

error - if the column type doesn't match the requested type

toDoubleMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Double)) Source #

Returns a dataframe as a two dimensional vector of doubles.

Converts all columns in the dataframe to double vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to doubles.

toFloatMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Float)) Source #

Returns a dataframe as a two dimensional vector of floats.

Converts all columns in the dataframe to float vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to floats.

toIntMatrix :: DataFrame -> Either DataFrameException (Vector (Vector Int)) Source #

Returns a dataframe as a two dimensional vector of ints.

Converts all columns in the dataframe to int vectors and transposes them into a row-major matrix representation.

This is useful for handing data over into ML systems.

Returns Left with an error if any column cannot be converted to ints.

toMarkdownTable :: DataFrame -> Text Source #

For showing the dataframe as markdown in notebooks.

fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column Source #

O(n) Convert a list to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> fromList [(1 :: Int), 2, 3, 4]
[1,2,3,4]

toList :: Columnable a => Column -> [a] Source #

O(n) Converts a column to a list. Throws an exception if the wrong type is specified.

Examples:

> column = fromList [(1 :: Int), 2, 3, 4]
> toList Int column
[1,2,3,4]
> toList Double column
exception: ...

data Column Source #

Our representation of a column is a GADT that can store data based on the underlying data.

This allows us to pattern match on data kinds and limit some operations to only some kinds of vectors. E.g. operations for missing data only happen in an OptionalColumn.

Instances

Instances details

Show Column Source #
Instance details Defined in DataFrame.Internal.Column Methods showsPrec :: Int -> Column -> ShowS # show :: Column -> String # showList :: [Column] -> ShowS #
Eq Column Source #
Instance details Defined in DataFrame.Internal.Column Methods (==) :: Column -> Column -> Bool # (/=) :: Column -> Column -> Bool #
Ord Column Source #
Instance details Defined in DataFrame.Internal.Column Methods compare :: Column -> Column -> Ordering # (<) :: Column -> Column -> Bool # (<=) :: Column -> Column -> Bool # (>) :: Column -> Column -> Bool # (>=) :: Column -> Column -> Bool # max :: Column -> Column -> Column # min :: Column -> Column -> Column #

fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column Source #

O(n) Convert an unboxed vector to a column. This avoids the extra conversion if you already have the data in an unboxed vector.

Examples:

> import qualified Data.Vector.Unboxed as V
> fromUnboxedVector (VB.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column Source #

O(n) Convert a vector to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> import qualified Data.Vector as V
> fromVector (VB.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

hasElemType :: Columnable a => Column -> Bool Source #

Checks if a column is of a given type values.

hasMissing :: Column -> Bool Source #

Checks if a column contains missing values.

isNumeric :: Column -> Bool Source #

Checks if a column contains numeric values.

toVector :: forall a v. (Vector v a, Columnable a) => Column -> Either DataFrameException (v a) Source #

Converts a column to a vector of a specific type.

This is a type-safe conversion that requires the column's element type to exactly match the requested type. You must specify the desired type via type applications.

Type Parameters

Expand

a: The element type to convert to
v: The vector type (e.g., Vector, Vector)

Examples

Expand

>>> toVector @Int @VU.Vector column
Right (unboxed vector of Ints)

>>> toVector @Text @VB.Vector column
Right (boxed vector of Text)

Returns

Expand

Right - The converted vector if types match
Left TypeMismatchException - If the column's type doesn't match the requested type

Examples

Expand

>>> toRowList df
[Row {name = "Alice", age = 25, ...}, Row {name = "Bob", age = 30, ...}, ...]

Performance note

Expand

This function materializes all rows into a list, which may be memory-intensive for large dataframes. Consider using toRowVector if you need random access or streaming operations.

toRowVector :: [Text] -> DataFrame -> Vector Row Source #

Converts the dataframe to a vector of rows with only the specified columns.

Each row will contain only the columns named in the names parameter. This is useful when you only need a subset of columns or want to control the column order in the resulting rows.

Parameters

Expand

names: List of column names to include in each row. The order of names determines the order of fields in the resulting rows.
df: The dataframe to convert.

Examples

Expand

>>> toRowVector ["name", "age"] df
Vector of rows with only name and age fields

>>> toRowVector [] df  -- Empty column list
Vector of empty rows (one per dataframe row)

data Expr a Source #

Instances

Instances details

(IsString a, Columnable a) => IsString (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods fromString :: String -> Expr a #
(Floating a, Columnable a) => Floating (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods pi :: Expr a # exp :: Expr a -> Expr a # log :: Expr a -> Expr a # sqrt :: Expr a -> Expr a # (**) :: Expr a -> Expr a -> Expr a # logBase :: Expr a -> Expr a -> Expr a # sin :: Expr a -> Expr a # cos :: Expr a -> Expr a # tan :: Expr a -> Expr a # asin :: Expr a -> Expr a # acos :: Expr a -> Expr a # atan :: Expr a -> Expr a # sinh :: Expr a -> Expr a # cosh :: Expr a -> Expr a # tanh :: Expr a -> Expr a # asinh :: Expr a -> Expr a # acosh :: Expr a -> Expr a # atanh :: Expr a -> Expr a # log1p :: Expr a -> Expr a # expm1 :: Expr a -> Expr a # log1pexp :: Expr a -> Expr a # log1mexp :: Expr a -> Expr a #
(Num a, Columnable a) => Num (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods (+) :: Expr a -> Expr a -> Expr a # (-) :: Expr a -> Expr a -> Expr a # (*) :: Expr a -> Expr a -> Expr a # negate :: Expr a -> Expr a # abs :: Expr a -> Expr a # signum :: Expr a -> Expr a # fromInteger :: Integer -> Expr a #
(Fractional a, Columnable a) => Fractional (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods (/) :: Expr a -> Expr a -> Expr a # recip :: Expr a -> Expr a # fromRational :: Rational -> Expr a #
Show a => Show (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods showsPrec :: Int -> Expr a -> ShowS # show :: Expr a -> String # showList :: [Expr a] -> ShowS #
(Eq a, Columnable a) => Eq (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods (==) :: Expr a -> Expr a -> Bool # (/=) :: Expr a -> Expr a -> Bool #
(Ord a, Columnable a) => Ord (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods compare :: Expr a -> Expr a -> Ordering # (<) :: Expr a -> Expr a -> Bool # (<=) :: Expr a -> Expr a -> Bool # (>) :: Expr a -> Expr a -> Bool # (>=) :: Expr a -> Expr a -> Bool # max :: Expr a -> Expr a -> Expr a # min :: Expr a -> Expr a -> Expr a #

:: Columnable a
=> Text	Column Name
-> Vector a	Vector to add to column
-> DataFrame	DataFrame to add column to
-> DataFrame

Adds a vector to the dataframe. If the vector has less elements than the dataframe and the dataframe is not empty the vector is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly, if the vector has more elements than what's currently in the dataframe, the other columns in the dataframe are change to `Maybe Type` and filled with Nothing.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> D.insertVector "numbers" (V.fromList [1..10]) D.empty

---------------
index | numbers
------|--------
 Int  |   Int
------|--------
0     | 1
1     | 2
2     | 3
3     | 4
4     | 5
5     | 6
6     | 7
7     | 8
8     | 9
9     | 10

insertColumn Source #

Arguments

:: Text	Column Name
-> Column	Column to add
-> DataFrame	DataFrame to add the column to
-> DataFrame

O(n) Add a column to the dataframe.

Example

Expand

ghci> D.insertColumn "numbers" (D.fromList [1..10]) D.empty

---------------
index | numbers
------|--------
 Int  |   Int
------|--------
0     | 1
1     | 2
2     | 3
3     | 4
4     | 5
5     | 6
6     | 7
7     | 8
8     | 9
9     | 10

insertVectorWithDefault Source #

Arguments

:: Columnable a
=> a	Default Value
-> Text	Column name
-> Vector a	Data to add to column
-> DataFrame	DataFrame to add the column to
-> DataFrame

O(k) Add a column to the dataframe providing a default. This constructs a new vector and also may convert it to an unboxed vector if necessary. Since columns are usually large the runtime is dominated by the length of the list, k.

insertUnboxedVector Source #

Arguments

:: (Columnable a, Unbox a)
=> Text	Column Name
-> Vector a	Unboxed vector to add to column
-> DataFrame	DataFrame to add the column to
-> DataFrame

O(n) Adds an unboxed vector to the dataframe.

Same as insertVector but takes an unboxed vector. If you insert a vector of numbers through insertVector it will either way be converted into an unboxed vector so this function saves that extra work/conversion.

cloneColumn :: Text -> Text -> DataFrame -> DataFrame Source #

O(n) Clones a column and places it under a new name in the dataframe.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = insertVector "numbers" (V.fromList [1..10]) D.empty

ghci> D.cloneColumn "numbers" "others" df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 1
1     | 2       | 2
2     | 3       | 3
3     | 4       | 4
4     | 5       | 5
5     | 6       | 6
6     | 7       | 7
7     | 8       | 8
8     | 9       | 9
9     | 10      | 10

renameMany :: [(Text, Text)] -> DataFrame -> DataFrame Source #

O(n) Renames many columns.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)

ghci> df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 11
1     | 2       | 12
2     | 3       | 13
3     | 4       | 14
4     | 5       | 15
5     | 6       | 16
6     | 7       | 17
7     | 8       | 18
8     | 9       | 19
9     | 10      | 20

ghci> D.renameMany [("numbers", "first_10"), ("others", "next_10")] df

--------------------------
index | first_10 | next_10
------|----------|--------
 Int  |   Int    |   Int
------|----------|--------
0     | 1        | 11
1     | 2        | 12
2     | 3        | 13
3     | 4        | 14
4     | 5        | 15
5     | 6        | 16
6     | 7        | 17
7     | 8        | 18
8     | 9        | 19
9     | 10       | 20

describeColumns :: DataFrame -> DataFrame Source #

O(n * k ^ 2) Returns the number of non-null columns in the dataframe and the type associated with each column.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)

ghci> D.describeColumns df

-----------------------------------------------------------------------------------------------------
index | Column Name | # Non-null Values | # Null Values | # Partially parsed | # Unique Values | Type
------|-------------|-------------------|---------------|--------------------|-----------------|-----
 Int  |    Text     |        Int        |      Int      |        Int         |       Int       | Text
------|-------------|-------------------|---------------|--------------------|-----------------|-----
0     | others      | 10                | 0             | 0                  | 10              | Int
1     | numbers     | 10                | 0             | 0                  | 10              | Int

fromNamedColumns :: [(Text, Column)] -> DataFrame Source #

Creates a dataframe from a list of tuples with name and column.

Example

Expand

ghci> df = D.fromNamedColumns [("numbers", D.fromList [1..10]), ("others", D.fromList [11..20])]

ghci> df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 11
1     | 2       | 12
2     | 3       | 13
3     | 4       | 14
4     | 5       | 15
5     | 6       | 16
6     | 7       | 17
7     | 8       | 18
8     | 9       | 19
9     | 10      | 20

fromUnnamedColumns :: [Column] -> DataFrame Source #

Create a dataframe from a list of columns. The column names are "0", "1"... etc. Useful for quick exploration but you should probably always rename the columns after or drop the ones you don't want.

Example

Expand

ghci> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

ghci> df

-----------------
index |  0  |  1
------|-----|----
 Int  | Int | Int
------|-----|----
0     | 1   | 11
1     | 2   | 12
2     | 3   | 13
3     | 4   | 14
4     | 5   | 15
5     | 6   | 16
6     | 7   | 17
7     | 8   | 18
8     | 9   | 19
9     | 10  | 20

fromRows :: [Text] -> [[Any]] -> DataFrame Source #

Create a dataframe from a list of column names and rows.

Example

Expand

ghci> df = D.fromRows [A, B] [[D.toAny 1, D.toAny 11], [D.toAny 2, D.toAny 12], [D.toAny 3, D.toAny 13]]

ghci> df

-----------------
index |  A  |  B
------|-----|----
 Int  | Int | Int
------|-----|----
0     | 1   | 11
1     | 2   | 12
2     | 3   | 13

valueCounts :: Columnable a => Text -> DataFrame -> [(a, Int)] Source #

O (k * n) Counts the occurences of each value in a given column.

Example

Expand

ghci> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

ghci> D.valueCounts @Int "0" df

[(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1)]

Types

schemaType :: Columnable a => SchemaType Source #

Construct a SchemaType for the given a.

Examples

Expand

>>> :set -XTypeApplications
>>> schemaType @T.Text == schemaType @T.Text
True

>>> show (schemaType @Double)
"Double"

I/O

data HeaderSpec Source #

Constructors

NoHeader	File has no header row
UseFirstRow	Use first row as column names
ProvideNames [Text]	Supply names for a no-header file

Instances

Instances details

Show HeaderSpec Source #
Instance details Defined in DataFrame.IO.CSV Methods showsPrec :: Int -> HeaderSpec -> ShowS # show :: HeaderSpec -> String # showList :: [HeaderSpec] -> ShowS #
Eq HeaderSpec Source #
Instance details Defined in DataFrame.IO.CSV Methods (==) :: HeaderSpec -> HeaderSpec -> Bool # (/=) :: HeaderSpec -> HeaderSpec -> Bool #

data ReadOptions Source #

CSV read parameters.

Constructors

ReadOptions

Fields

headerSpec :: HeaderSpec
Where to get the headers from. (default: UseFirstRow)
typeSpec :: TypeSpec
Whether/how to infer types. (default: InferFromSample 100)
safeRead :: Bool
Whether to partially parse values into Maybe/Either. (default: True)
chunkSize :: Int
Default chunk size (in bytes) for csv reader. (default: 512'000)

dateFormat :: String

Format of date fields as recognized by the Data.Time.Format module.

Examples:

   > parseTimeM True defaultTimeLocale "%Y%-m%-d" "2010304" :: Maybe Day
   Just 2010-03-04
   > parseTimeM True defaultTimeLocale "%d%-m%-Y" "0432010" :: Maybe Day
   Just 2010-03-04

data TypeSpec Source #

Constructors

InferFromSample Int
SpecifyTypes [SchemaType]
NoInference

defaultReadOptions :: ReadOptions Source #

readCsv :: FilePath -> IO DataFrame Source #

Read CSV file from path and load it into a dataframe.

Example

Expand

ghci> D.readCsv "./data/taxi.csv"

readCsvWithOpts :: ReadOptions -> FilePath -> IO DataFrame Source #

Read CSV file from path and load it into a dataframe.

Example

Expand

ghci> D.readCsvWithOpts "./data/taxi.csv" (D.defaultReadOptions { dateFormat = "%d%-m%-Y" })

readSeparated :: Char -> ReadOptions -> FilePath -> IO DataFrame Source #

Read text file with specified delimiter into a dataframe.

Example

Expand

ghci> D.readSeparated ';' D.defaultReadOptions "./data/taxi.txt"

readTsv :: FilePath -> IO DataFrame Source #

Read TSV (tab separated) file from path and load it into a dataframe.

Example

Expand

ghci> D.readTsv "./data/taxi.tsv"

writeCsv :: FilePath -> DataFrame -> IO () Source #

writeSeparated Source #

Arguments

:: Char	Separator
-> FilePath	Path to write to
-> DataFrame
-> IO ()

module DataFrame.IO.Unstable.CSV

readParquet :: FilePath -> IO DataFrame Source #

Read a parquet file from path and load it into a dataframe.

Example

Expand

ghci> D.readParquet "./data/mtcars.parquet"

Operations

range :: (Int, Int) -> DataFrame -> DataFrame Source #

O(k * n) Take a range of rows of a DataFrame.

take :: Int -> DataFrame -> DataFrame Source #

O(k * n) Take the first n rows of a DataFrame.

filter Source #

Arguments

:: Columnable a
=> Text	Column to filter by
-> (a -> Bool)	Filter condition
-> DataFrame	Dataframe to filter
-> DataFrame

O(n * k) Filter rows by a given condition.

filter "x" even df

drop :: Int -> DataFrame -> DataFrame Source #

O(k * n) Drop the first n rows of a DataFrame.

select :: [Text] -> DataFrame -> DataFrame Source #

O(n) Selects a number of columns in a given dataframe.

select ["name", "age"] df

selectBy :: [SelectionCriteria] -> DataFrame -> DataFrame Source #

O(n) select columns by column predicate name.

data SelectionCriteria Source #

byIndexRange :: (Int, Int) -> SelectionCriteria Source #

Criteria for selecting columns whose indices are in the given (inclusive) range.

selectBy [byIndexRange (0, 5)] df

byName :: Text -> SelectionCriteria Source #

Criteria for selecting a column by name.

selectBy [byName "Age"] df

equivalent to:

select ["Age"] df

byNameProperty :: (Text -> Bool) -> SelectionCriteria Source #

Criteria for selecting columns whose name satisfies given predicate.

selectBy [byNameProperty (T.isPrefixOf "weight")] df

byNameRange :: (Text, Text) -> SelectionCriteria Source #

Criteria for selecting columns whose names are in the given lexicographic range (inclusive).

selectBy [byNameRange ("a", "c")] df

byProperty :: (Column -> Bool) -> SelectionCriteria Source #

Criteria for selecting columns whose property satisfies given predicate.

selectBy [byProperty isNumeric] df

cube :: (Int, Int) -> DataFrame -> DataFrame Source #

O(k) cuts the dataframe in a cube of size (a, b) where a is the length and b is the width.

cube (10, 5) df

dropLast :: Int -> DataFrame -> DataFrame Source #

O(k * n) Drop the last n rows of a DataFrame.

exclude :: [Text] -> DataFrame -> DataFrame Source #

O(n) inverse of select

exclude ["Name"] df

filterAllJust :: DataFrame -> DataFrame Source #

O(n * k) removes all rows with Nothing from the dataframe.

filterAllJust df

filterBy :: Columnable a => (a -> Bool) -> Text -> DataFrame -> DataFrame Source #

O(k) a version of filter where the predicate comes first.

filterBy even "x" df

filterJust :: Text -> DataFrame -> DataFrame Source #

O(k) removes all rows with Nothing in a given column from the dataframe.

filterJust "col" df

filterNothing :: Text -> DataFrame -> DataFrame Source #

O(k) returns all rows with Nothing in a give column.

filterNothing "col" df

filterWhere :: Expr Bool -> DataFrame -> DataFrame Source #

O(k) filters the dataframe with a boolean expression.

filterWhere (F.col @Int x + F.col y F.> 5) df

kFolds :: RandomGen g => g -> Int -> DataFrame -> [DataFrame] Source #

Creates n folds of a dataframe.

Example

Expand

ghci> import System.Random
ghci> D.kFolds (mkStdGen 137) 5 df

randomSplit :: RandomGen g => g -> Double -> DataFrame -> (DataFrame, DataFrame) Source #

Split a dataset into two. The first in the tuple gets a sample of p (0 <= p <= 1) and the second gets (1 - p). This is useful for creating test and train splits.

Example

Expand

ghci> import System.Random
ghci> D.randomSplit (mkStdGen 137) 0.9 df

sample :: RandomGen g => g -> Double -> DataFrame -> DataFrame Source #

Sample a dataframe. The double parameter must be between 0 and 1 (inclusive).

Example

Expand

ghci> import System.Random
ghci> D.sample (mkStdGen 137) 0.1 df

takeLast :: Int -> DataFrame -> DataFrame Source #

O(k * n) Take the last n rows of a DataFrame.

module DataFrame.Operations.Transformations

groupBy :: [Text] -> DataFrame -> GroupedDataFrame Source #

O(k * n) groups the dataframe by the given rows aggregating the remaining rows into vector that should be reduced later.

aggregate :: [(Text, UExpr)] -> GroupedDataFrame -> DataFrame Source #

Aggregate a grouped dataframe using the expressions given. All ungrouped columns will be dropped.

distinct :: DataFrame -> DataFrame Source #

Filter out all non-unique values in a dataframe.

module DataFrame.Operations.Sorting

module DataFrame.Operations.Merge

module DataFrame.Operations.Join

sum :: (Columnable a, Num a, Unbox a) => Text -> DataFrame -> Maybe a Source #

Calculates the sum of a given column as a standalone value.

correlation :: Text -> Text -> DataFrame -> Maybe Double Source #

Calculates the Pearson's correlation coefficient between two given columns as a standalone value.

frequencies :: Text -> DataFrame -> DataFrame Source #

Show a frequency table for a categorical feaure.

Examples:

ghci> df <- D.readCsv "./data/housing.csv"

ghci> D.frequencies "ocean_proximity" df

----------------------------------------------------------------------------
index |   Statistic    | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN
------|----------------|-----------|--------|--------|----------|-----------
 Int  |      Text      |    Any    |  Any   |  Any   |   Any    |    Any
------|----------------|-----------|--------|--------|----------|-----------
0     | Count          | 9136      | 6551   | 5      | 2290     | 2658
1     | Percentage (%) | 44.26%    | 31.74% | 0.02%  | 11.09%   | 12.88%

interQuartileRange :: Text -> DataFrame -> Maybe Double Source #

Calculates the inter-quartile range of a given column as a standalone value.

mean :: Text -> DataFrame -> Maybe Double Source #

Calculates the mean of a given column as a standalone value.

median :: Text -> DataFrame -> Maybe Double Source #

Calculates the median of a given column as a standalone value.

skewness :: Text -> DataFrame -> Maybe Double Source #

Calculates the skewness of a given column as a standalone value.

standardDeviation :: Text -> DataFrame -> Maybe Double Source #

Calculates the standard deviation of a given column as a standalone value.

summarize :: DataFrame -> DataFrame Source #

Descriptive statistics of the numeric columns.