dataframe-0.7.0.0: A fast, safe, and intuitive DataFrame library.
Safe HaskellNone
LanguageHaskell2010

DataFrame.Operations.Join

Synopsis

Documentation

data JoinType Source #

Equivalent to SQL join types.

Constructors

INNER 
LEFT 
RIGHT 
FULL_OUTER 

Instances

Instances details
Show JoinType Source # 
Instance details

Defined in DataFrame.Operations.Join

join :: JoinType -> [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Join two dataframes using SQL join semantics.

joinStrategyThreshold :: Int Source #

Row-count threshold for the build side. When the build side exceeds this, sort-merge join is used instead of hash join to avoid L3 cache thrashing.

data CompactIndex Source #

A compact index mapping hash values to contiguous slices of original row indices. All indices live in a single unboxed vector; the HashMap stores (offset, length) into that vector.

Constructors

CompactIndex 

buildCompactIndex :: Vector Int -> CompactIndex Source #

Build a compact index from a vector of row hashes. Sorts (hash, originalIndex) pairs by hash, then scans for contiguous runs to populate the offset map.

findGroupEnd :: Vector Int -> Int -> Int -> Int -> Int Source #

Find the end of a contiguous run of equal values starting at j.

sortWithIndices :: Vector Int -> (Vector Int, Vector Int) Source #

Sort a hash vector, returning sorted hashes and corresponding original indices. Sorts an index array using hash values as the comparison key, avoiding the intermediate pair vector used by the naive zip-then-sort approach.

fillCrossProduct :: Vector Int -> Vector Int -> Int -> Int -> Int -> Int -> MVector s Int -> MVector s Int -> Int -> ST s () Source #

Write the cross product of two index ranges into mutable vectors.

keyColIndices :: Set Text -> DataFrame -> [Int] Source #

Compute key-column indices from the column index map.

innerJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Performs an inner join on two dataframes using the specified key columns. Returns only rows where the key values exist in both dataframes.

Example

Expand
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1, K2]), (B, D.fromList [B0, B1, B2])]
ghci> D.innerJoin ["key"] df other

-----------------
 key  |  A  |  B
------|-----|----
 Text | Text| Text
------|-----|----
 K0   | A0  | B0
 K1   | A1  | B1
 K2   | A2  | B2

buildHashColumn :: [Text] -> DataFrame -> Vector Int Source #

Compute hashes for the given key column names in a DataFrame.

hashProbeKernel Source #

Arguments

:: CompactIndex

Built once from the full right/build side.

-> Vector Int

Probe hashes (one batch).

-> (Vector Int, Vector Int) 

Probe one batch of rows against a pre-built CompactIndex. Returns (probeExpandedIxs, buildExpandedIxs). Unlike hashInnerKernel, does not build the index (it is pre-built once) and has no cross-product row guard — the caller controls probe batch size.

maxJoinOutputRows :: Int Source #

Hash-based inner join kernel. Builds compact index on buildHashes (second arg), probes with probeHashes (first arg). Returns (probeExpandedIndices, buildExpandedIndices). Uses a dynamically growing output buffer to avoid pre-allocating the full cross-product size (which can be astronomically large for low-cardinality keys).

Maximum number of output rows allowed from a join kernel. Exceeding this limit indicates a cross-product explosion (e.g. low-cardinality keys).

sortMergeInnerKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #

Sort-merge inner join kernel. Sorts both sides by hash, walks in lockstep. Returns (leftExpandedIndices, rightExpandedIndices). Uses a dynamically growing output buffer instead of a two-pass count-then-allocate strategy, which OOMs when low-cardinality keys produce large cross products.

assembleInner :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame Source #

Assemble the result DataFrame for an inner join from expanded index vectors.

leftJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Performs a left join on two dataframes using the specified key columns. Returns all rows from the left dataframe, with matching rows from the right dataframe. Non-matching rows will have Nothing/null values for columns from the right dataframe.

Example

Expand
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1, K2]), (B, D.fromList [B0, B1, B2])]
ghci> D.leftJoin ["key"] df other

------------------------
 key  |  A  |     B
------|-----|----------
 Text | Text| Maybe Text
------|-----|----------
 K0   | A0  | Just B0
 K1   | A1  | Just B1
 K2   | A2  | Just B2
 K3   | A3  | Nothing

hashLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #

Hash-based left join kernel. Returns (leftExpandedIndices, rightExpandedIndices) where right indices use -1 as sentinel for unmatched rows. Uses a dynamically growing output buffer to avoid pre-allocating the full cross-product size (which can be astronomically large for low-cardinality keys).

sortMergeLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #

Sort-merge left join kernel. Returns (leftExpandedIndices, rightExpandedIndices) with -1 sentinel. Uses a dynamically growing output buffer instead of a two-pass count-then-allocate strategy, which OOMs when low-cardinality keys produce large cross products.

assembleLeft :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame Source #

Assemble the result DataFrame for a left join. Right index vectors use -1 sentinel, gathered via gatherWithSentinel.

rightJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #

Performs a right join on two dataframes using the specified key columns. Returns all rows from the right dataframe, with matching rows from the left dataframe. Non-matching rows will have Nothing/null values for columns from the left dataframe.

Example

Expand
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1]), (B, D.fromList [B0, B1])]
ghci> D.rightJoin ["key"] df other

-----------------
 key  |  A  |  B
------|-----|----
 Text | Text| Text
------|-----|----
 K0   | A0  | B0
 K1   | A1  | B1

hashFullOuterKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #

Hash-based full outer join kernel. Builds compact indices on both sides. Returns (leftExpandedIndices, rightExpandedIndices) with -1 sentinels.

sortMergeFullOuterKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #

Sort-merge full outer join kernel. Returns (leftExpandedIndices, rightExpandedIndices) with -1 sentinels.

assembleFullOuter :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame Source #

Assemble the result DataFrame for a full outer join. Both index vectors use -1 sentinel; all columns gathered via gatherWithSentinel. Key columns are coalesced (first non-null wins).