| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
DataFrame.Operations.Join
Synopsis
- data JoinType
- = INNER
- | LEFT
- | RIGHT
- | FULL_OUTER
- join :: JoinType -> [Text] -> DataFrame -> DataFrame -> DataFrame
- joinStrategyThreshold :: Int
- data CompactIndex = CompactIndex {}
- buildCompactIndex :: Vector Int -> CompactIndex
- findGroupEnd :: Vector Int -> Int -> Int -> Int -> Int
- sortWithIndices :: Vector Int -> (Vector Int, Vector Int)
- fillCrossProduct :: Vector Int -> Vector Int -> Int -> Int -> Int -> Int -> MVector s Int -> MVector s Int -> Int -> ST s ()
- keyColIndices :: Set Text -> DataFrame -> [Int]
- innerJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame
- buildHashColumn :: [Text] -> DataFrame -> Vector Int
- hashProbeKernel :: CompactIndex -> Vector Int -> (Vector Int, Vector Int)
- maxJoinOutputRows :: Int
- hashInnerKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- sortMergeInnerKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- assembleInner :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame
- leftJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame
- hashLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- sortMergeLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- assembleLeft :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame
- rightJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame
- fullOuterJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame
- hashFullOuterKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- sortMergeFullOuterKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int)
- assembleFullOuter :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame
Documentation
Equivalent to SQL join types.
Constructors
| INNER | |
| LEFT | |
| RIGHT | |
| FULL_OUTER |
join :: JoinType -> [Text] -> DataFrame -> DataFrame -> DataFrame Source #
Join two dataframes using SQL join semantics.
joinStrategyThreshold :: Int Source #
Row-count threshold for the build side. When the build side exceeds this, sort-merge join is used instead of hash join to avoid L3 cache thrashing.
data CompactIndex Source #
A compact index mapping hash values to contiguous slices of
original row indices. All indices live in a single unboxed vector;
the HashMap stores (offset, length) into that vector.
Constructors
| CompactIndex | |
buildCompactIndex :: Vector Int -> CompactIndex Source #
Build a compact index from a vector of row hashes.
Sorts (hash, originalIndex) pairs by hash, then scans for
contiguous runs to populate the offset map.
findGroupEnd :: Vector Int -> Int -> Int -> Int -> Int Source #
Find the end of a contiguous run of equal values starting at j.
sortWithIndices :: Vector Int -> (Vector Int, Vector Int) Source #
Sort a hash vector, returning sorted hashes and corresponding original indices. Sorts an index array using hash values as the comparison key, avoiding the intermediate pair vector used by the naive zip-then-sort approach.
fillCrossProduct :: Vector Int -> Vector Int -> Int -> Int -> Int -> Int -> MVector s Int -> MVector s Int -> Int -> ST s () Source #
Write the cross product of two index ranges into mutable vectors.
keyColIndices :: Set Text -> DataFrame -> [Int] Source #
Compute key-column indices from the column index map.
innerJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #
Performs an inner join on two dataframes using the specified key columns. Returns only rows where the key values exist in both dataframes.
Example
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1, K2]), (B, D.fromList [B0, B1, B2])]
ghci> D.innerJoin ["key"] df other
-----------------
key | A | B
------|-----|----
Text | Text| Text
------|-----|----
K0 | A0 | B0
K1 | A1 | B1
K2 | A2 | B2
buildHashColumn :: [Text] -> DataFrame -> Vector Int Source #
Compute hashes for the given key column names in a DataFrame.
Arguments
| :: CompactIndex | Built once from the full right/build side. |
| -> Vector Int | Probe hashes (one batch). |
| -> (Vector Int, Vector Int) |
Probe one batch of rows against a pre-built CompactIndex.
Returns (probeExpandedIxs, buildExpandedIxs).
Unlike hashInnerKernel, does not build the index (it is pre-built once)
and has no cross-product row guard — the caller controls probe batch size.
maxJoinOutputRows :: Int Source #
Hash-based inner join kernel.
Builds compact index on buildHashes (second arg), probes with
probeHashes (first arg).
Returns (probeExpandedIndices, buildExpandedIndices).
Uses a dynamically growing output buffer to avoid pre-allocating the full
cross-product size (which can be astronomically large for low-cardinality keys).
Maximum number of output rows allowed from a join kernel. Exceeding this limit indicates a cross-product explosion (e.g. low-cardinality keys).
sortMergeInnerKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #
Sort-merge inner join kernel.
Sorts both sides by hash, walks in lockstep.
Returns (leftExpandedIndices, rightExpandedIndices).
Uses a dynamically growing output buffer instead of a two-pass count-then-allocate
strategy, which OOMs when low-cardinality keys produce large cross products.
assembleInner :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame Source #
Assemble the result DataFrame for an inner join from expanded index vectors.
leftJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #
Performs a left join on two dataframes using the specified key columns. Returns all rows from the left dataframe, with matching rows from the right dataframe. Non-matching rows will have Nothing/null values for columns from the right dataframe.
Example
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1, K2]), (B, D.fromList [B0, B1, B2])]
ghci> D.leftJoin ["key"] df other
------------------------
key | A | B
------|-----|----------
Text | Text| Maybe Text
------|-----|----------
K0 | A0 | Just B0
K1 | A1 | Just B1
K2 | A2 | Just B2
K3 | A3 | Nothing
hashLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #
Hash-based left join kernel.
Returns (leftExpandedIndices, rightExpandedIndices) where
right indices use -1 as sentinel for unmatched rows.
Uses a dynamically growing output buffer to avoid pre-allocating the full
cross-product size (which can be astronomically large for low-cardinality keys).
sortMergeLeftKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #
Sort-merge left join kernel.
Returns (leftExpandedIndices, rightExpandedIndices) with -1 sentinel.
Uses a dynamically growing output buffer instead of a two-pass count-then-allocate
strategy, which OOMs when low-cardinality keys produce large cross products.
assembleLeft :: Set Text -> DataFrame -> DataFrame -> Vector Int -> Vector Int -> DataFrame Source #
Assemble the result DataFrame for a left join.
Right index vectors use -1 sentinel, gathered via gatherWithSentinel.
rightJoin :: [Text] -> DataFrame -> DataFrame -> DataFrame Source #
Performs a right join on two dataframes using the specified key columns. Returns all rows from the right dataframe, with matching rows from the left dataframe. Non-matching rows will have Nothing/null values for columns from the left dataframe.
Example
ghci> df = D.fromNamedColumns [("key", D.fromList [K0, K1, K2, K3]), (A, D.fromList [A0, A1, A2, A3])]
ghci> other = D.fromNamedColumns [("key", D.fromList [K0, K1]), (B, D.fromList [B0, B1])]
ghci> D.rightJoin ["key"] df other
-----------------
key | A | B
------|-----|----
Text | Text| Text
------|-----|----
K0 | A0 | B0
K1 | A1 | B1
hashFullOuterKernel :: Vector Int -> Vector Int -> (Vector Int, Vector Int) Source #
Hash-based full outer join kernel.
Builds compact indices on both sides.
Returns (leftExpandedIndices, rightExpandedIndices) with -1 sentinels.