dataframe-0.7.0.0: A fast, safe, and intuitive DataFrame library.
Safe HaskellNone
LanguageHaskell2010

DataFrame.IO.Parquet

Synopsis

Documentation

data ParquetReadOptions Source #

Options for reading Parquet data.

These options are applied in this order:

  1. predicate filtering
  2. column projection
  3. row range

Column selection for selectedColumns uses leaf column names only.

Constructors

ParquetReadOptions 

Fields

  • selectedColumns :: Maybe [Text]

    Columns to keep in the final dataframe. If set, only these columns are returned. Predicate-referenced columns are read automatically when needed and projected out after filtering.

  • predicate :: Maybe (Expr Bool)

    Optional row filter expression applied before projection.

  • rowRange :: Maybe (Int, Int)

    Optional row slice (start, end) with start-inclusive/end-exclusive semantics.

defaultParquetReadOptions :: ParquetReadOptions Source #

Default Parquet read options.

Equivalent to:

ParquetReadOptions
    { selectedColumns = Nothing
    , predicate = Nothing
    , rowRange = Nothing
    }

readParquet :: FilePath -> IO DataFrame Source #

Read a parquet file from path and load it into a dataframe.

Example

Expand
ghci> D.readParquet "./data/mtcars.parquet"

cleanColPath :: [SNode] -> [String] -> [String] Source #

Read a Parquet file using explicit read options.

Example

Expand
ghci> D.readParquetWithOpts
ghci|   (D.defaultParquetReadOptions{D.selectedColumns = Just ["id"], D.rowRange = Just (0, 10)})
ghci|   ".testsdata/alltypes_plain.parquet"

When selectedColumns is set and predicate references other columns, those predicate columns are auto-included for decoding, then projected back to the requested output columns.

Strip Parquet encoding artifact names (REPEATED wrappers and their single list-element children) from a raw column path, leaving user-visible names.

readParquetFiles :: FilePath -> IO DataFrame Source #

Read Parquet files from a directory or glob path.

This is equivalent to calling readParquetFilesWithOpts with defaultParquetReadOptions.

readParquetFilesWithOpts :: ParquetReadOptions -> FilePath -> IO DataFrame Source #

Read multiple Parquet files (directory or glob) using explicit options.

If path is a directory, all non-directory entries are read. If path is a glob, matching files are read.

For multi-file reads, rowRange is applied once after concatenation (global range semantics).

Example

Expand
ghci> D.readParquetFilesWithOpts
ghci|   (D.defaultParquetReadOptions{D.selectedColumns = Just ["id"], D.rowRange = Just (0, 5)})
ghci|   ".testsdata/alltypes_plain*.parquet"