Copyright	(c) 2025
License	GPL-3.0
Maintainer	mschavinda@gmail.com
Stability	experimental
Portability	POSIX
Safe Haskell	None
Language	Haskell2010

DataFrame

Contents

Core data structures
Core dataframe operations
I/O
Operations
Errors
Plotting
Convenience functions

Description

Batteries-included entry point for the DataFrame library.

This module re-exports the most commonly used pieces of the dataframe library so you can get productive fast in GHCi, IHaskell, or scripts.

Naming convention * Use the D. ("DataFrame") prefix for core table operations. * Use the F. ("Functions") prefix for the expression DSL (columns, math, aggregations).

Example session:

-- GHCi quality-of-life:
ghci> :set -XOverloadedStrings -XTypeApplications
ghci> :module + DataFrame as D, DataFrame.Functions as F, Data.Text (Text)

Quick start

Load a CSV, select a few columns, filter, derive a column, then group + aggregate:

-- 1) Load data
ghci> df0 <- D.readCsv "data/housing.csv"
ghci> D.describeColumns df0
--------------------------------------------------------------------------------------------------------------------
index |    Column Name     | # Non-null Values | # Null Values | # Partially parsed | # Unique Values |     Type
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
 Int  |        Text        |        Int        |      Int      |        Int         |       Int       |     Text
------|--------------------|-------------------|---------------|--------------------|-----------------|-------------
0     | ocean_proximity    | 20640             | 0             | 0                  | 5               | Text
1     | median_house_value | 20640             | 0             | 0                  | 3842            | Double
2     | median_income      | 20640             | 0             | 0                  | 12928           | Double
3     | households         | 20640             | 0             | 0                  | 1815            | Double
4     | population         | 20640             | 0             | 0                  | 3888            | Double
5     | total_bedrooms     | 20640             | 0             | 0                  | 1924            | Maybe Double
6     | total_rooms        | 20640             | 0             | 0                  | 5926            | Double
7     | housing_median_age | 20640             | 0             | 0                  | 52              | Double
8     | latitude           | 20640             | 0             | 0                  | 862             | Double
9     | longitude          | 20640             | 0             | 0                  | 844             | Double

-- 2) Project & filter
ghci> let df1 = df1 = D.filter Text "ocean_proximity" (== ISLAND) df0 D.|> D.select ["median_house_value", "median_income", "ocean_proximity"]

-- 3) Add a derived column using the expression DSL
--    (col types are explicit via TypeApplications)
ghci> df2 = D.derive "rooms_per_household" (F.col Double "total_rooms" / F.col Double "households") df0

-- 4) Group + aggregate
ghci> let grouped   = D.groupBy ["ocean_proximity"] df0
ghci> let summary   =
         D.aggregate
             [ F.maximum (F.col Double "median_house_value") as "max_house_value"]
             grouped
ghci> D.take 5 summary
-----------------------------------------
index | ocean_proximity | max_house_value
------|-----------------|----------------
 Int  |      Text       |     Double
------|-----------------|----------------
0     | <1H OCEAN       | 500001.0
1     | INLAND          | 500001.0
2     | ISLAND          | 450000.0
3     | NEAR BAY        | 500001.0
4     | NEAR OCEAN      | 500001.0

Simple operations (cheat sheet)

Most users only need a handful of verbs:

I/O

```
D.readCsv :: FilePath -> IO DataFrame
```

D.writeCsv :: FilePath -> DataFrame -> IO ()

D.readParquet :: FilePath -> IO DataFrame

Exploration

```
D.take :: Int -> DataFrame -> DataFrame
```

D.takeLast :: Int -> DataFrame -> DataFrame

D.describeColumns :: DataFrame -> DataFrame

```
D.summarize :: DataFrame -> DataFrame
```

Row ops

D.filter  :: Columnable a => Text -> (a -> Bool) -> DataFrame -> DataFrame

D.sortBy  :: SortOrder -> [Text] -> DataFrame -> DataFrame

Column ops

D.select     :: [Text] -> DataFrame -> DataFrame

D.exclude       :: [Text] -> DataFrame -> DataFrame

D.rename     :: [(Text,Text)] -> DataFrame -> DataFrame

D.derive :: Text -> D.Expr a -> DataFrame -> DataFrame

Group & aggregate

D.groupBy   :: [Text] -> DataFrame -> GroupedDataFrame

D.aggregate :: [(Text, F.UExpr)] -> GroupedDataFrame -> DataFrame

Joins

D.innerJoin  D.leftJoin  D.rightJoin / D.fullJoin

Expression DSL (F.*) at a glance

Columns (typed):

F.col   Text   "ocean_proximity"
F.col   Double "total_rooms"
F.lit   @Double 1.0

Math & comparisons (overloaded by type):

(+), (-), (*), (/), abs, log, exp, round
(F.eq), (F.gt), (F.geq), (F.lt), (F.leq)

Aggregations (for aggregate):

F.count a (F.col a "c")
F.sum   Double (F.col Double "x")
F.mean  Double (F.col Double "x")
F.min   t (F.col t "x")
F.max   t (F.col t "x")

REPL power-tool: ':exposeColumns'

Use :exposeColumns df in GHCi/IHaskell to turn each column of a bound DataFrame into a local binding with the same (mangled if needed) name and the column's concrete vector type. This is great for quick ad-hoc analysis, plotting, or hand-rolled checks.

-- Suppose df has columns: "passengers" :: Int, "fare" :: Double, "payment" :: Text
ghci> :set -XTemplateHaskell
ghci> :exposeColumns df

-- Now you have in scope:
ghci> :type passengers
passengers :: Expr Int

ghci> :type fare
fare :: Expr Double

ghci> :type payment
payment :: Expr Text

-- You can use them directly:
ghci> D.derive "fare_with_tip" (fare * F.lit 1.2)

Notes:

Name mangling: spaces and non-identifier characters are replaced (e.g. "trip id" -> trip_id).
Optional/nullable columns are exposed as Expr (Maybe a).

Synopsis

empty :: DataFrame
data DataFrame
data GroupedDataFrame
columnAsVector :: Columnable a => Text -> DataFrame -> Vector a
toMatrix :: DataFrame -> Vector (Vector Float)
fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column
toList :: Columnable a => Column -> [a]
fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column
toVector :: Columnable a => Column -> Vector a
data Column
fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column
data Expr a
fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame
rename :: Text -> Text -> DataFrame -> DataFrame
dimensions :: DataFrame -> (Int, Int)
columnNames :: DataFrame -> [Text]
insertVector :: Columnable a => Text -> Vector a -> DataFrame -> DataFrame
insertColumn :: Text -> Column -> DataFrame -> DataFrame
insertVectorWithDefault :: Columnable a => a -> Text -> Vector a -> DataFrame -> DataFrame
insertUnboxedVector :: (Columnable a, Unbox a) => Text -> Vector a -> DataFrame -> DataFrame
cloneColumn :: Text -> Text -> DataFrame -> DataFrame
renameMany :: [(Text, Text)] -> DataFrame -> DataFrame
describeColumns :: DataFrame -> DataFrame
fromNamedColumns :: [(Text, Column)] -> DataFrame
fromUnnamedColumns :: [Column] -> DataFrame
valueCounts :: Columnable a => Text -> DataFrame -> [(a, Int)]
defaultOptions :: ReadOptions
data ReadOptions = ReadOptions {
- hasHeader :: Bool
- inferTypes :: Bool
- safeRead :: Bool
- chunkSize :: Int
}
readCsv :: String -> IO DataFrame
readSeparated :: Char -> ReadOptions -> String -> IO DataFrame
readTsv :: String -> IO DataFrame
readParquet :: String -> IO DataFrame
filter :: Columnable a => Text -> (a -> Bool) -> DataFrame -> DataFrame
range :: (Int, Int) -> DataFrame -> DataFrame
take :: Int -> DataFrame -> DataFrame
drop :: Int -> DataFrame -> DataFrame
select :: [Text] -> DataFrame -> DataFrame
selectBy :: (Text -> Bool) -> DataFrame -> DataFrame
cube :: (Int, Int) -> DataFrame -> DataFrame
dropLast :: Int -> DataFrame -> DataFrame
exclude :: [Text] -> DataFrame -> DataFrame
filterAllJust :: DataFrame -> DataFrame
filterBy :: Columnable a => (a -> Bool) -> Text -> DataFrame -> DataFrame
filterJust :: Text -> DataFrame -> DataFrame
filterWhere :: Expr Bool -> DataFrame -> DataFrame
selectIntRange :: (Int, Int) -> DataFrame -> DataFrame
selectRange :: (Text, Text) -> DataFrame -> DataFrame
takeLast :: Int -> DataFrame -> DataFrame
apply :: (Columnable b, Columnable c) => (b -> c) -> Text -> DataFrame -> DataFrame
safeApply :: (Columnable b, Columnable c) => (b -> c) -> Text -> DataFrame -> Either DataFrameException DataFrame
derive :: Columnable a => Text -> Expr a -> DataFrame -> DataFrame
applyMany :: (Columnable b, Columnable c) => (b -> c) -> [Text] -> DataFrame -> DataFrame
applyInt :: Columnable b => (Int -> b) -> Text -> DataFrame -> DataFrame
applyDouble :: Columnable b => (Double -> b) -> Text -> DataFrame -> DataFrame
applyWhere :: (Columnable a, Columnable b) => (a -> Bool) -> Text -> (b -> b) -> Text -> DataFrame -> DataFrame
applyAtIndex :: Columnable a => Int -> (a -> a) -> Text -> DataFrame -> DataFrame
impute :: Columnable b => Text -> b -> DataFrame -> DataFrame
groupBy :: [Text] -> DataFrame -> GroupedDataFrame
aggregate :: [(Text, UExpr)] -> GroupedDataFrame -> DataFrame
distinct :: DataFrame -> DataFrame
sortBy :: SortOrder -> [Text] -> DataFrame -> DataFrame
data SortOrder
- = Ascending
- | Descending
(|||) :: DataFrame -> DataFrame -> DataFrame
join :: JoinType -> [Text] -> DataFrame -> DataFrame -> DataFrame
data JoinType
- = INNER
- | LEFT
- | RIGHT
- | FULL_OUTER
innerJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame
sum :: (Columnable a, Num a, Unbox a) => Text -> DataFrame -> Maybe a
correlation :: Text -> Text -> DataFrame -> Maybe Double
median :: Text -> DataFrame -> Maybe Double
variance :: Text -> DataFrame -> Maybe Double
mean :: Text -> DataFrame -> Maybe Double
skewness :: Text -> DataFrame -> Maybe Double
frequencies :: Text -> DataFrame -> DataFrame
interQuartileRange :: Text -> DataFrame -> Maybe Double
standardDeviation :: Text -> DataFrame -> Maybe Double
summarize :: DataFrame -> DataFrame
data DataFrameException where
- TypeMismatchException :: forall a b. (Typeable a, Typeable b) => TypeErrorContext a b -> DataFrameException
- ColumnNotFoundException :: Text -> Text -> [Text] -> DataFrameException
- EmptyDataSetException :: Text -> DataFrameException
data TypeErrorContext a b = MkTypeErrorContext {
- userType :: Either String (TypeRep a)
- expectedType :: Either String (TypeRep b)
- errorColumnName :: Maybe String
- callingFunctionName :: Maybe String
}
typeMismatchError :: String -> String -> String
addCallPointInfo :: Maybe String -> Maybe String -> String -> String
columnNotFound :: Text -> Text -> [Text] -> String
emptyDataSetError :: Text -> String
guessColumnName :: Text -> [Text] -> Text
typeAnnotationSuggestion :: String -> String
editDistance :: Text -> Text -> Int
data PlotType
- = Histogram
- | Scatter
- | Line
- | Bar
- | BoxPlot
- | Pie
- | StackedBar
- | Heatmap
data PlotConfig = PlotConfig {
- plotType :: PlotType
- plotSettings :: Plot
}
defaultPlotConfig :: PlotType -> PlotConfig
plotHistogram :: HasCallStack => Text -> DataFrame -> IO ()
plotHistogramWith :: HasCallStack => Text -> PlotConfig -> DataFrame -> IO ()
extractNumericColumn :: HasCallStack => Text -> DataFrame -> [Double]
plotScatter :: HasCallStack => Text -> Text -> DataFrame -> IO ()
plotScatterWith :: HasCallStack => Text -> Text -> PlotConfig -> DataFrame -> IO ()
plotScatterBy :: HasCallStack => Text -> Text -> Text -> DataFrame -> IO ()
plotScatterByWith :: HasCallStack => Text -> Text -> Text -> PlotConfig -> DataFrame -> IO ()
extractStringColumn :: HasCallStack => Text -> DataFrame -> [Text]
plotLines :: HasCallStack => Text -> [Text] -> DataFrame -> IO ()
plotLinesWith :: HasCallStack => Text -> [Text] -> PlotConfig -> DataFrame -> IO ()
plotBoxPlots :: HasCallStack => [Text] -> DataFrame -> IO ()
plotBoxPlotsWith :: HasCallStack => [Text] -> PlotConfig -> DataFrame -> IO ()
plotStackedBars :: HasCallStack => Text -> [Text] -> DataFrame -> IO ()
plotStackedBarsWith :: HasCallStack => Text -> [Text] -> PlotConfig -> DataFrame -> IO ()
plotHeatmap :: HasCallStack => DataFrame -> IO ()
plotHeatmapWith :: HasCallStack => PlotConfig -> DataFrame -> IO ()
isNumericColumn :: DataFrame -> Text -> Bool
plotAllHistograms :: HasCallStack => DataFrame -> IO ()
plotCorrelationMatrix :: HasCallStack => DataFrame -> IO ()
plotBars :: HasCallStack => Text -> DataFrame -> IO ()
plotBarsWith :: HasCallStack => Text -> Maybe Text -> PlotConfig -> DataFrame -> IO ()
plotSingleBars :: HasCallStack => Text -> PlotConfig -> DataFrame -> IO ()
plotGroupedBarsWith :: HasCallStack => Text -> Text -> PlotConfig -> DataFrame -> IO ()
getCategoricalCounts :: HasCallStack => Text -> DataFrame -> Maybe [(Text, Double)]
groupWithOther :: Int -> [(Text, Double)] -> [(Text, Double)]
plotBarsTopN :: HasCallStack => Int -> Text -> DataFrame -> IO ()
plotBarsTopNWith :: HasCallStack => Int -> Text -> PlotConfig -> DataFrame -> IO ()
plotGroupedBarsWithN :: HasCallStack => Int -> Text -> Text -> PlotConfig -> DataFrame -> IO ()
isNumericColumnCheck :: Text -> DataFrame -> Bool
plotValueCounts :: HasCallStack => Text -> DataFrame -> IO ()
plotValueCountsWith :: HasCallStack => Text -> Int -> PlotConfig -> DataFrame -> IO ()
plotBarsWithPercentages :: HasCallStack => Text -> DataFrame -> IO ()
smartPlotBars :: HasCallStack => Text -> DataFrame -> IO ()
plotCategoricalSummary :: HasCallStack => DataFrame -> IO ()
isNumericType :: Typeable a => Bool
vectorToDoubles :: (Typeable a, Show a) => Vector a -> [Double]
unboxedVectorToDoubles :: (Typeable a, Unbox a, Show a) => Vector a -> [Double]
plotPie :: HasCallStack => Text -> Maybe Text -> DataFrame -> IO ()
plotPieWith :: HasCallStack => Text -> Maybe Text -> PlotConfig -> DataFrame -> IO ()
groupWithOtherForPie :: Int -> [(Text, Double)] -> [(Text, Double)]
plotPieWithPercentages :: HasCallStack => Text -> DataFrame -> IO ()
plotPieWithPercentagesConfig :: HasCallStack => Text -> PlotConfig -> DataFrame -> IO ()
plotPieTopN :: HasCallStack => Int -> Text -> DataFrame -> IO ()
plotPieTopNWith :: HasCallStack => Int -> Text -> PlotConfig -> DataFrame -> IO ()
smartPlotPie :: HasCallStack => Text -> DataFrame -> IO ()
plotPieGrouped :: HasCallStack => Text -> Text -> DataFrame -> IO ()
plotPieGroupedWith :: HasCallStack => Text -> Text -> PlotConfig -> DataFrame -> IO ()
plotPieComparison :: HasCallStack => [Text] -> DataFrame -> IO ()
plotBinaryPie :: HasCallStack => Text -> DataFrame -> IO ()
plotMarketShare :: HasCallStack => Text -> DataFrame -> IO ()
plotMarketShareWith :: HasCallStack => Text -> PlotConfig -> DataFrame -> IO ()
(|>) :: a -> (a -> b) -> b

Core data structures

empty :: DataFrame Source #

O(1) Creates an empty dataframe

data DataFrame Source #

Instances

Instances details

Monoid DataFrame Source #
Instance details Defined in DataFrame.Operations.Merge Methods mempty :: DataFrame # mappend :: DataFrame -> DataFrame -> DataFrame # mconcat :: [DataFrame] -> DataFrame #
Semigroup DataFrame Source #
Instance details Defined in DataFrame.Operations.Merge Methods (<>) :: DataFrame -> DataFrame -> DataFrame # sconcat :: NonEmpty DataFrame -> DataFrame # stimes :: Integral b => b -> DataFrame -> DataFrame #
Show DataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods showsPrec :: Int -> DataFrame -> ShowS # show :: DataFrame -> String # showList :: [DataFrame] -> ShowS #
Eq DataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods (==) :: DataFrame -> DataFrame -> Bool # (/=) :: DataFrame -> DataFrame -> Bool #

data GroupedDataFrame Source #

A record that contains information about how and what rows are grouped in the dataframe. This can only be used with aggregate.

Instances

Instances details

Show GroupedDataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods showsPrec :: Int -> GroupedDataFrame -> ShowS # show :: GroupedDataFrame -> String # showList :: [GroupedDataFrame] -> ShowS #
Eq GroupedDataFrame Source #
Instance details Defined in DataFrame.Internal.DataFrame Methods (==) :: GroupedDataFrame -> GroupedDataFrame -> Bool # (/=) :: GroupedDataFrame -> GroupedDataFrame -> Bool #

columnAsVector :: Columnable a => Text -> DataFrame -> Vector a Source #

Get a specific column as a vector.

You must specify the type via type applications.

toMatrix :: DataFrame -> Vector (Vector Float) Source #

Returns a dataframe as a two dimentions vector of floats.

All entries in the dataframe must be doubles. This is useful for handing data over into ML systems.

fromList :: (Columnable a, ColumnifyRep (KindOf a) a) => [a] -> Column Source #

O(n) Convert a list to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> fromList [(1 :: Int), 2, 3, 4]
[1,2,3,4]

toList :: Columnable a => Column -> [a] Source #

O(n) Converts a column to a list. Throws an exception if the wrong type is specified.

Examples:

> column = fromList [(1 :: Int), 2, 3, 4]
> toList Int column
[1,2,3,4]
> toList Double column
exception: ...

fromVector :: (Columnable a, ColumnifyRep (KindOf a) a) => Vector a -> Column Source #

O(n) Convert a vector to a column. Automatically picks the best representation of a vector to store the underlying data in.

Examples:

> import qualified Data.Vector as V
> fromVector (V.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

toVector :: Columnable a => Column -> Vector a Source #

O(n) Converts a column to a boxed vector. Throws an exception if the wrong type is specified.

Examples:

> column = fromList [(1 :: Int), 2, 3, 4]
> toVector Int column
[1,2,3,4]
> toVector Double column
exception: ...

data Column Source #

Our representation of a column is a GADT that can store data based on the underlying data.

This allows us to pattern match on data kinds and limit some operations to only some kinds of vectors. E.g. operations for missing data only happen in an OptionalColumn.

Instances

Instances details

Show Column Source #
Instance details Defined in DataFrame.Internal.Column Methods showsPrec :: Int -> Column -> ShowS # show :: Column -> String # showList :: [Column] -> ShowS #
Eq Column Source #
Instance details Defined in DataFrame.Internal.Column Methods (==) :: Column -> Column -> Bool # (/=) :: Column -> Column -> Bool #

fromUnboxedVector :: (Columnable a, Unbox a) => Vector a -> Column Source #

O(n) Convert an unboxed vector to a column. This avoids the extra conversion if you already have the data in an unboxed vector.

Examples:

> import qualified Data.Vector.Unboxed as V
> fromUnboxedVector (V.fromList [(1 :: Int), 2, 3, 4])
[1,2,3,4]

data Expr a Source #

Instances

Instances details

(Floating a, Columnable a) => Floating (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods pi :: Expr a # exp :: Expr a -> Expr a # log :: Expr a -> Expr a # sqrt :: Expr a -> Expr a # (**) :: Expr a -> Expr a -> Expr a # logBase :: Expr a -> Expr a -> Expr a # sin :: Expr a -> Expr a # cos :: Expr a -> Expr a # tan :: Expr a -> Expr a # asin :: Expr a -> Expr a # acos :: Expr a -> Expr a # atan :: Expr a -> Expr a # sinh :: Expr a -> Expr a # cosh :: Expr a -> Expr a # tanh :: Expr a -> Expr a # asinh :: Expr a -> Expr a # acosh :: Expr a -> Expr a # atanh :: Expr a -> Expr a # log1p :: Expr a -> Expr a # expm1 :: Expr a -> Expr a # log1pexp :: Expr a -> Expr a # log1mexp :: Expr a -> Expr a #
(Num a, Columnable a) => Num (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods (+) :: Expr a -> Expr a -> Expr a # (-) :: Expr a -> Expr a -> Expr a # (*) :: Expr a -> Expr a -> Expr a # negate :: Expr a -> Expr a # abs :: Expr a -> Expr a # signum :: Expr a -> Expr a # fromInteger :: Integer -> Expr a #
(Fractional a, Columnable a) => Fractional (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods (/) :: Expr a -> Expr a -> Expr a # recip :: Expr a -> Expr a # fromRational :: Rational -> Expr a #
Show a => Show (Expr a) Source #
Instance details Defined in DataFrame.Internal.Expression Methods showsPrec :: Int -> Expr a -> ShowS # show :: Expr a -> String # showList :: [Expr a] -> ShowS #

Core dataframe operations

fold :: (a -> DataFrame -> DataFrame) -> [a] -> DataFrame -> DataFrame Source #

A left fold for dataframes that takes the dataframe as the last object. this makes it easier to chain operations.

Example

Expand

ghci> D.fold (const id) [1..5] df

-----------------
index |  0  |  1
------|-----|----
 Int  | Int | Int
------|-----|----
0     | 1   | 11
1     | 2   | 12
2     | 3   | 13
3     | 4   | 14
4     | 5   | 15
5     | 6   | 16
6     | 7   | 17
7     | 8   | 18
8     | 9   | 19
9     | 10  | 20

rename :: Text -> Text -> DataFrame -> DataFrame Source #

O(n) Renames a single column.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = insertVector "numbers" (V.fromList [1..10]) D.empty

ghci> D.rename "numbers" "others" df

--------------
index | others
------|-------
 Int  |  Int
------|-------
0     | 1
1     | 2
2     | 3
3     | 4
4     | 5
5     | 6
6     | 7
7     | 8
8     | 9
9     | 10

dimensions :: DataFrame -> (Int, Int) Source #

O(1) Get DataFrame dimensions i.e. (rows, columns)

Example

Expand

ghci> D.dimensions df

(100, 3)

columnNames :: DataFrame -> [Text] Source #

O(k) Get column names of the DataFrame in order of insertion.

Example

Expand

ghci> D.columnNames df

["col_a", "col_b", "col_c"]

insertVector Source #

Arguments

:: Columnable a
=> Text	Column Name
-> Vector a	Vector to add to column
-> DataFrame	DataFrame to add column to
-> DataFrame

Adds a vector to the dataframe. If the vector has less elements than the dataframe and the dataframe is not empty the vector is converted to type `Maybe a` filled with Nothing to match the size of the dataframe. Similarly, if the vector has more elements than what's currently in the dataframe, the other columns in the dataframe are change to `Maybe Type` and filled with Nothing.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> D.insertVector "numbers" (V.fromList [1..10]) D.empty

---------------
index | numbers
------|--------
 Int  |   Int
------|--------
0     | 1
1     | 2
2     | 3
3     | 4
4     | 5
5     | 6
6     | 7
7     | 8
8     | 9
9     | 10

insertColumn Source #

Arguments

:: Text	Column Name
-> Column	Column to add
-> DataFrame	DataFrame to add to column
-> DataFrame

O(n) Add a column to the dataframe.

Example

Expand

ghci> D.insertColumn "numbers" (D.fromList [1..10]) D.empty

---------------
index | numbers
------|--------
 Int  |   Int
------|--------
0     | 1
1     | 2
2     | 3
3     | 4
4     | 5
5     | 6
6     | 7
7     | 8
8     | 9
9     | 10

insertVectorWithDefault Source #

Arguments

:: Columnable a
=> a	Default Value
-> Text	Column name
-> Vector a	Data to add to column
-> DataFrame	DataFrame to add to column
-> DataFrame

O(k) Add a column to the dataframe providing a default. This constructs a new vector and also may convert it to an unboxed vector if necessary. Since columns are usually large the runtime is dominated by the length of the list, k.

insertUnboxedVector Source #

Arguments

:: (Columnable a, Unbox a)
=> Text	Column Name
-> Vector a	Unboxed vector to add to column
-> DataFrame	DataFrame to add to column
-> DataFrame

O(n) Adds an unboxed vector to the dataframe.

Same as insertVector but takes an unboxed vector. If you insert a vector of numbers through insertVector it will either way be converted into an unboxed vector so this function saves that extra work/conversion.

cloneColumn :: Text -> Text -> DataFrame -> DataFrame Source #

O(n) Clones a column and places it under a new name in the dataframe.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = insertVector "numbers" (V.fromList [1..10]) D.empty

ghci> D.cloneColumn "numbers" "others" df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 1
1     | 2       | 2
2     | 3       | 3
3     | 4       | 4
4     | 5       | 5
5     | 6       | 6
6     | 7       | 7
7     | 8       | 8
8     | 9       | 9
9     | 10      | 10

renameMany :: [(Text, Text)] -> DataFrame -> DataFrame Source #

O(n) Renames many columns.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)

ghci> df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 11
1     | 2       | 12
2     | 3       | 13
3     | 4       | 14
4     | 5       | 15
5     | 6       | 16
6     | 7       | 17
7     | 8       | 18
8     | 9       | 19
9     | 10      | 20

ghci> D.renameMany [("numbers", "first_10"), ("others", "next_10")] df

--------------------------
index | first_10 | next_10
------|----------|--------
 Int  |   Int    |   Int
------|----------|--------
0     | 1        | 11
1     | 2        | 12
2     | 3        | 13
3     | 4        | 14
4     | 5        | 15
5     | 6        | 16
6     | 7        | 17
7     | 8        | 18
8     | 9        | 19
9     | 10       | 20

describeColumns :: DataFrame -> DataFrame Source #

O(n * k ^ 2) Returns the number of non-null columns in the dataframe and the type associated with each column.

Example

Expand

ghci> import qualified Data.Vector as V

ghci> df = D.insertVector "others" (V.fromList [11..20]) (D.insertVector "numbers" (V.fromList [1..10]) D.empty)

ghci> D.describeColumns df

-----------------------------------------------------------------------------------------------------
index | Column Name | # Non-null Values | # Null Values | # Partially parsed | # Unique Values | Type
------|-------------|-------------------|---------------|--------------------|-----------------|-----
 Int  |    Text     |        Int        |      Int      |        Int         |       Int       | Text
------|-------------|-------------------|---------------|--------------------|-----------------|-----
0     | others      | 10                | 0             | 0                  | 10              | Int
1     | numbers     | 10                | 0             | 0                  | 10              | Int

fromNamedColumns :: [(Text, Column)] -> DataFrame Source #

Creates a dataframe from a list of tuples with name and column.

Example

Expand

ghci> df = D.fromNamedColumns [("numbers", D.fromList [1..10]), ("others", D.fromList [11..20])]

ghci> df

------------------------
index | numbers | others
------|---------|-------
 Int  |   Int   |  Int
------|---------|-------
0     | 1       | 11
1     | 2       | 12
2     | 3       | 13
3     | 4       | 14
4     | 5       | 15
5     | 6       | 16
6     | 7       | 17
7     | 8       | 18
8     | 9       | 19
9     | 10      | 20

fromUnnamedColumns :: [Column] -> DataFrame Source #

Create a dataframe from a list of columns. The column names are "0", "1"... etc. Useful for quick exploration but you should probably alwyas rename the columns after or drop the ones you don't want.

Example

Expand

ghci> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

ghci> df

-----------------
index |  0  |  1
------|-----|----
 Int  | Int | Int
------|-----|----
0     | 1   | 11
1     | 2   | 12
2     | 3   | 13
3     | 4   | 14
4     | 5   | 15
5     | 6   | 16
6     | 7   | 17
7     | 8   | 18
8     | 9   | 19
9     | 10  | 20

valueCounts :: Columnable a => Text -> DataFrame -> [(a, Int)] Source #

O (k * n) Counts the occurences of each value in a given column.

Example

Expand

ghci> df = D.fromUnnamedColumns [D.fromList [1..10], D.fromList [11..20]]

ghci> D.valueCounts @Int "0" df

[(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1),(9,1),(10,1)]

I/O

defaultOptions :: ReadOptions Source #

data ReadOptions Source #

CSV read parameters.

Constructors

ReadOptions
Fields hasHeader :: Bool Whether or not the CSV file has a header. (default: True) inferTypes :: Bool Whether to try and infer types. (default: True) safeRead :: Bool Whether to partially parse values into `Maybe`/Either`. (default: True) chunkSize :: Int Default chunk size (in bytes) for csv reader. (default: 512'000)

readCsv :: String -> IO DataFrame Source #

Read CSV file from path and load it into a dataframe.

Example

Expand

ghci> D.readCsv ".datataxi.csv" df

readSeparated :: Char -> ReadOptions -> String -> IO DataFrame Source #

Read text file with specified delimiter into a dataframe.

Example

Expand

ghci> D.readSeparated ';' D.defaultOptions ".datataxi.txt" df

readTsv :: String -> IO DataFrame Source #

Read TSV (tab separated) file from path and load it into a dataframe.

Example

Expand

ghci> D.readTsv ".datataxi.tsv" df

readParquet :: String -> IO DataFrame Source #

Read a parquet file from path and load it into a dataframe.

Example

Expand

ghci> D.readParquet ".datamtcars.parquet" df

Operations

filter Source #

Arguments

:: Columnable a
=> Text	Column to filter by
-> (a -> Bool)	Filter condition
-> DataFrame	Dataframe to filter
-> DataFrame

O(n * k) Filter rows by a given condition.

filter "x" even df

range :: (Int, Int) -> DataFrame -> DataFrame Source #

O(k * n) Take a range of rows of a DataFrame.

take :: Int -> DataFrame -> DataFrame Source #

O(k * n) Take the first n rows of a DataFrame.

drop :: Int -> DataFrame -> DataFrame Source #

O(k * n) Drop the first n rows of a DataFrame.

select :: [Text] -> DataFrame -> DataFrame Source #

O(n) Selects a number of columns in a given dataframe.

select ["name", "age"] df

selectBy :: (Text -> Bool) -> DataFrame -> DataFrame Source #

O(n) select columns by column predicate name.

cube :: (Int, Int) -> DataFrame -> DataFrame Source #

O(k) cuts the dataframe in a cube of size (a, b) where a is the length and b is the width.

cube (10, 5) df

dropLast :: Int -> DataFrame -> DataFrame Source #

O(k * n) Drop the last n rows of a DataFrame.

exclude :: [Text] -> DataFrame -> DataFrame Source #

O(n) inverse of select

exclude ["Name"] df

filterAllJust :: DataFrame -> DataFrame Source #

O(n * k) removes all rows with Nothing from the dataframe.

filterJust df

filterBy :: Columnable a => (a -> Bool) -> Text -> DataFrame -> DataFrame Source #

O(k) a version of filter where the predicate comes first.

filterBy even "x" df

filterJust :: Text -> DataFrame -> DataFrame Source #

O(k) removes all rows with Nothing in a given column from the dataframe.

filterJust df

filterWhere :: Expr Bool -> DataFrame -> DataFrame Source #

O(k) filters the dataframe with a row predicate. The arguments in the function must appear in the same order as they do in the list.

filterWhere (["x", "y"], func (\x y -> x + y > 5)) df

selectIntRange :: (Int, Int) -> DataFrame -> DataFrame Source #

O(n) select columns by index range of column names.

selectRange :: (Text, Text) -> DataFrame -> DataFrame Source #

O(n) select columns by index range of column names.

takeLast :: Int -> DataFrame -> DataFrame Source #

O(k * n) Take the last n rows of a DataFrame.

apply Source #

Arguments

:: (Columnable b, Columnable c)
=> (b -> c)	function to apply
-> Text	Column name
-> DataFrame	DataFrame to apply operation to
-> DataFrame

O(k) Apply a function to a given column in a dataframe.

safeApply Source #

Arguments

:: (Columnable b, Columnable c)
=> (b -> c)	function to apply
-> Text	Column name
-> DataFrame	DataFrame to apply operation to
-> Either DataFrameException DataFrame

O(k) Safe version of the apply function. Returns (instead of throwing) the error.

derive :: Columnable a => Text -> Expr a -> DataFrame -> DataFrame Source #

O(k) Apply a function to a combination of columns in a dataframe and add the result into alias column.

applyMany :: (Columnable b, Columnable c) => (b -> c) -> [Text] -> DataFrame -> DataFrame Source #

O(k * n) Apply a function to given column names in a dataframe.

applyInt Source #

Arguments

:: Columnable b
=> (Int -> b)	Column name \| function to apply
-> Text
-> DataFrame	DataFrame to apply operation to
-> DataFrame

O(k) Convenience function that applies to an int column.

applyDouble Source #

Arguments

:: Columnable b
=> (Double -> b)	Column name \| function to apply
-> Text
-> DataFrame	DataFrame to apply operation to
-> DataFrame

O(k) Convenience function that applies to an double column.

applyWhere :: (Columnable a, Columnable b) => (a -> Bool) -> Text -> (b -> b) -> Text -> DataFrame -> DataFrame Source #

O(k * n) Apply a function to a column only if there is another column value that matches the given criterion.

applyWhere "Age" (<20) "Generation" (const "Gen-Z")

applyAtIndex Source #

Arguments

:: Columnable a
=> Int	Index
-> (a -> a)	function to apply
-> Text	Column name
-> DataFrame	DataFrame to apply operation to
-> DataFrame

O(k) Apply a function to the column at a given index.

impute :: Columnable b => Text -> b -> DataFrame -> DataFrame Source #

Replace all instances of Nothing in a column with the given value.

groupBy :: [Text] -> DataFrame -> GroupedDataFrame Source #

O(k * n) groups the dataframe by the given rows aggregating the remaining rows into vector that should be reduced later.

aggregate :: [(Text, UExpr)] -> GroupedDataFrame -> DataFrame Source #

Aggregate a grouped dataframe using the expressions give. All ungrouped columns will be dropped.

distinct :: DataFrame -> DataFrame Source #

Filter out all non-unique values in a dataframe.

sortBy :: SortOrder -> [Text] -> DataFrame -> DataFrame Source #

O(k log n) Sorts the dataframe by a given row.

sortBy "Age" df

data SortOrder Source #

Sort order taken as a parameter by the sortby function.

Constructors

Ascending
Descending

Instances

Instances details

Eq SortOrder Source #
Instance details Defined in DataFrame.Operations.Sorting Methods (==) :: SortOrder -> SortOrder -> Bool # (/=) :: SortOrder -> SortOrder -> Bool #

(|||) :: DataFrame -> DataFrame -> DataFrame Source #

Add two dataframes side by side/horizontally.

join :: JoinType -> [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Join two dataframes using SQL join semantics.

Only inner join is implemented for now.

data JoinType Source #

Equivalent to SQL join types.

Constructors

INNER
LEFT
RIGHT
FULL_OUTER

innerJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Inner join of two dataframes. Note: for chaining, the left dataframe is actually on the right side.

sum :: (Columnable a, Num a, Unbox a) => Text -> DataFrame -> Maybe a Source #

Calculates the sum of a given column as a standalone value.

correlation :: Text -> Text -> DataFrame -> Maybe Double Source #

Calculates the Pearson's correlation coefficient between two given columns as a standalone value.

median :: Text -> DataFrame -> Maybe Double Source #

Calculates the median of a given column as a standalone value.

variance :: Text -> DataFrame -> Maybe Double Source #

Calculates the variance of a given column as a standalone value.

mean :: Text -> DataFrame -> Maybe Double Source #

Calculates the mean of a given column as a standalone value.

skewness :: Text -> DataFrame -> Maybe Double Source #

Calculates the skewness of a given column as a standalone value.

frequencies :: Text -> DataFrame -> DataFrame Source #

Show a frequency table for a categorical feaure.

Examples:

ghci> df <- D.readCsv ".datahousing.csv"

ghci> D.frequencies "ocean_proximity" df

----------------------------------------------------------------------------
index |   Statistic    | <1H OCEAN | INLAND | ISLAND | NEAR BAY | NEAR OCEAN
------|----------------|-----------|--------|--------|----------|-----------
 Int  |      Text      |    Any    |  Any   |  Any   |   Any    |    Any
------|----------------|-----------|--------|--------|----------|-----------
0     | Count          | 9136      | 6551   | 5      | 2290     | 2658
1     | Percentage (%) | 44.26%    | 31.74% | 0.02%  | 11.09%   | 12.88%

interQuartileRange :: Text -> DataFrame -> Maybe Double Source #

Calculates the inter-quartile range of a given column as a standalone value.

standardDeviation :: Text -> DataFrame -> Maybe Double Source #

Calculates the standard deviation of a given column as a standalone value.

summarize :: DataFrame -> DataFrame Source #

Descriprive statistics of the numeric columns.

Errors

data DataFrameException where Source #

Constructors

TypeMismatchException :: forall a b. (Typeable a, Typeable b) => TypeErrorContext a b -> DataFrameException
ColumnNotFoundException :: Text -> Text -> [Text] -> DataFrameException
EmptyDataSetException :: Text -> DataFrameException

Instances

Instances details

Exception DataFrameException Source #
Instance details Defined in DataFrame.Errors Methods toException :: DataFrameException -> SomeException # fromException :: SomeException -> Maybe DataFrameException # displayException :: DataFrameException -> String #
Show DataFrameException Source #
Instance details Defined in DataFrame.Errors Methods showsPrec :: Int -> DataFrameException -> ShowS # show :: DataFrameException -> String # showList :: [DataFrameException] -> ShowS #

data TypeErrorContext a b Source #

Constructors

MkTypeErrorContext
Fields userType :: Either String (TypeRep a) expectedType :: Either String (TypeRep b) errorColumnName :: Maybe String callingFunctionName :: Maybe String

typeMismatchError :: String -> String -> String Source #

addCallPointInfo :: Maybe String -> Maybe String -> String -> String Source #

columnNotFound :: Text -> Text -> [Text] -> String Source #

emptyDataSetError :: Text -> String Source #

guessColumnName :: Text -> [Text] -> Text Source #

typeAnnotationSuggestion :: String -> String Source #

editDistance :: Text -> Text -> Int Source #

Plotting

data PlotType Source #

Constructors

Histogram
Scatter
Line
Bar
BoxPlot
Pie
StackedBar
Heatmap

Instances

Instances details

Show PlotType Source #
Instance details Defined in DataFrame.Display.Terminal.Plot Methods showsPrec :: Int -> PlotType -> ShowS # show :: PlotType -> String # showList :: [PlotType] -> ShowS #
Eq PlotType Source #
Instance details Defined in DataFrame.Display.Terminal.Plot Methods (==) :: PlotType -> PlotType -> Bool # (/=) :: PlotType -> PlotType -> Bool #