{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE DeriveAnyClass #-}
{-# OPTIONS_GHC -fno-warn-unused-imports #-}
-- |
-- Module      :  $header
-- Copyright   :  (c) Laurent P. René de Cotret
-- License     :  MIT
-- Maintainer  :  laurent.decotret@outlook.com
-- Portability :  portable
--
module Data.Frame.Tutorial (
    -- * Introduction
    -- $introduction

    -- * Quick start
    -- $quickstart

    -- * Defining types
    -- $construction

    -- * Advanced indexing
    -- $advindexing

    -- * Merging dataframes
    -- ** Zipping
    -- $zipping

    -- ** Merging by key
    -- $merging

) where

import Data.Frame as Frame
import Data.Functor.Identity (Identity)
import qualified Data.List (zipWith)
import Data.Vector (Vector)
import qualified Data.Vector as Vector
import GHC.Generics (Generic)

{- $introduction

This is a short user guide on how to get started using @javelin-frames@.

The central data structure at the heart of this package is the dataframe.
A dataframe, represented by @`Frame` t@ for some record-type @t@, is a
record whose values are arrays representing columns.

-}

{- $quickstart
Let's look at a real example. We'll import the "Data.Frame" module to disambiguate
some functions:

>>> import Data.Frame as Frame

and we need extensions to derive instances automatically:

>>> :set -XDeriveGeneric
>>> :set -XDeriveAnyClass

We want to process simple data having to do with students (name, age)
and their grade in math class.

== Defining dataframes

All dataframes must be defined as record types with a type parameter @f@,
where each field involves the `Column` type family, like so:

>>> :{
    data Student f
         = MkStudent { studentName      :: Column f String
                     , studentAge       :: Column f Int
                     , studentMathGrade :: Column f Char
                     }
         deriving (Generic)
:}

To use the functionality of this package, we must derive an instance
of `Frameable` for our type @Student@:

>>> deriving instance Frameable Student

Note that the derivation is automatically done for you, through the `Generic`
instance for @Student@.

One caveat of this approach is that instances for other typeclasses (e.g. `Show`, `Eq`)
must be derived separately. Here's the derivation for rendering a single student
to a string (we'll explain what `Row` shortly):

>>> deriving instance Show (Row Student)

== Constructing dataframes

The type parameter @f@ allows us to represent the data in two separate contexts:
a single student or a dataframe of students. For a single student, @f@ becomes `Identity`,
while for a dataframe of students, @f@ becomes `Vector`. To simplify reading the code,
two type synonyms are provided: @`Row` Student@ for a single student, and @`Frame` Student@
for a dataframe.

Let's now build a dataframe. We use `fromRows` to pack individual students into a dataframe:

>>> :{
    students = fromRows
             [ MkStudent "Albert" 12 'C'
             , MkStudent "Beatrice" 13 'B'
             , MkStudent "Clara" 12 'A'
             ]
    :}

Individual students like @MkStudent "Albert" 23 'C'@ are of type @`Row` Student@, but
the dataframe @students@ has type @`Frame` Student@.

We can render the dataframe @students@ into a nice string using `display`
(and print that string using using `putStrLn`):

>>> putStrLn (display students)
studentName | studentAge | studentMathGrade
----------- | ---------- | ----------------
   "Albert" |         12 |              'C'
 "Beatrice" |         13 |              'B'
    "Clara" |         12 |              'A'

== Operations on columns

A dataframe is a columnar data structure; operations on columns are very efficient.

We can query for a column using a field selector, just like a normal record:

>>> studentName students
["Albert","Beatrice","Clara"]

Although the notation suggests that this is a list, columns are really `Vector`:

>>> :t (studentName students)
(studentName students) :: Vector [Char]

This means that you can use the efficient operations provided by the "Data.Vector"
module to operate on columns.

== Operations on rows

Many operations that treat a dataframe as an array
of rows are provided.

There's `mapRows` to map each row to a new structure:

>>> :{
    putStrLn
        $ display
            $ mapRows
                (\(MkStudent name age grade) -> MkStudent name (2*age) grade)
                students
:}
studentName | studentAge | studentMathGrade
----------- | ---------- | ----------------
   "Albert" |         24 |              'C'
 "Beatrice" |         26 |              'B'
    "Clara" |         24 |              'A'

There's `filterRows` to keep specific rows:

>>> :{
    putStrLn
        $ display
            $ filterRows
                (\(MkStudent _ _ grade) -> grade < 'C')
                students
:}
studentName | studentAge | studentMathGrade
----------- | ---------- | ----------------
 "Beatrice" |         13 |              'B'
    "Clara" |         12 |              'A'

Finally, there's `foldlRows` to summarize a dataframe by using whole rows:

>>> import Data.Char (ord)
>>> :{
    foldlRows
        (\acc (MkStudent _ age grade) -> acc + age + ord grade)
        (0 :: Int)
        students
:}
235

== Lookups

Since dataframes are highly structured, we can efficiently query
them in two flavours: querying by integer index, and querying by key.

=== Querying by integer index

Querying by integer index is supported for all dataframes. Use
the `ilookup` function to retrieve a row:

>>> ilookup 0 students
Just (MkStudent {studentName = "Albert", studentAge = 12, studentMathGrade = 'C'})
>>> ilookup 2 students
Just (MkStudent {studentName = "Clara", studentAge = 12, studentMathGrade = 'A'})
>>> ilookup 1000 students
Nothing

If we need the specific field of a specific row, it is much more efficient
to use `iat`:

>>> students `iat` (1, studentMathGrade)
Just 'B'

=== Querying by key

Querying by integer index may not be natural, and could be error-prone.
We can specify a column (or set of columns) that represent our
dataframe index. Just like a database table, an index speeds up
the lookup of a dataframe by key.

To do this, we must write an instance of `Indexable` for our type @Student@,
where we want to be able to efficiently search by student name:

>>> :set -XTypeFamilies
>>> :{
    instance Indexable Student where
        type Key Student = String
        index = studentName
:}

Now, we can use the functions `Frame.lookup` and `at` (similar to `ilookup`
and `iat`, respectively) which take key (in our case, student names)
instead of integer indices.

>>> Frame.lookup "Beatrice" students
Just (MkStudent {studentName = "Beatrice", studentAge = 13, studentMathGrade = 'B'})

>>> Frame.lookup "Vivienne" students
Nothing

>>> students `at` ("Albert", studentAge)
Just 12

And there you have it! This was a quick tour. Read on to learn more details
and more advanced functionality.
-}

{- $construction

To start using the machinery of this package, one must define the appropriate type.
Types that can be turned into dataframes are non-empty, higher-kinded, record types.
In particular, every field must make use of the `Column` type family.

Let's look at an example:

>>> newtype Address = MkAddress String deriving (Show, Eq)
>>> data Merchandise = Clothes | Food | Cars deriving (Show, Eq)
>>> :{
    data Store f
        = MkStore { storeName        :: Column f String
                  , storeAddress     :: Column f Address
                  , storeId          :: Column f Int
                  , storeMerchandise :: Column f Merchandise
                  }
        deriving (Generic)
:}

Here, we define a higher-kinded record type @Store@ with four fields.
The type parameter @f@ allows the various functions in this package
to switch between a column-oriented format and single-rows.

In practice the type @f@ can only be `Identity` (for a single row),
or `Vector` (for a dataframe)

For ergonomics, the type synonym @`Row` t@ is provided to represent a
single row. The type synonym @`Frame` t@ is provided to represent
a dataframe.

One caveat of this design is that instances (e.g. for `Show` or `Eq`)
must be defined in separate expressions:

>>> deriving instance Show (Row Store)
>>> deriving instance Eq (Row Store)
>>> deriving instance Eq (Frame Store)

Let's consider a single @Store@:

>>> (MkStore "Maxi" (MkAddress "17 Delicious Av.") 1 Food) :: Row Store
MkStore {storeName = "Maxi", storeAddress = MkAddress "17 Delicious Av.", storeId = 1, storeMerchandise = Food}

so @`Row` Store@ is exactly what we would expect from Haskell's regular
record types.

In order to access dataframe functionality, we need to ask our code
to generate some boilerplate automatically. We do this by deriving an
instance of `Frameable`:

>>> :set -XDeriveAnyClass
>>> deriving instance Frameable Store

Note that deriving an instance of `Frameable` requires that our type @Store@ have
a `Generic` instance. This allows @javelin-frames@ to inspect our type @Store@
and write an implementation of `Frameable` automatically.

== Limitations

At this time, `Frameable` can only be derived for higher-kinded record types that
do NOT nest. For example, consider the following hierarchy:

>>> :{
    data Location f
        = MkLocation { longitude :: Column f Double
                     , latitude  :: Column f Double
                     , elevation :: Column f Double
                     }
        deriving (Generic, Frameable)
    data Company f
        = MkCompany { companyName    :: Column f String
                    , companyId      :: Column f Int
                    , companyAddress :: Location f -- Nesting happens here
                  }
        deriving (Generic)
:}

The following will unfortunately fail with a potentially confusing type error:

@
deriving instance Frameable Company
@

Are you an expert in generics who wants to help us figure it out? Feel free to
[raise an issue or open a pull request](https://github.com/LaurentRDC/javelin).
-}

{- $advindexing

For some record type @t@ with an instance of `Frameable`, we can query for specific
rows and elements using `ilookup` and `iat` respectively.

However, many types can naturally be indexed by a subset of the columns, which becomes a key
This key is similar to primary keys in databases.

We can derive an instance of `Indexable` to allow us to query data from a
dataframe not by the integer index of the rows, but by some key instead.

== Simple keys

The simplest example is that of keys derived from a single column.

We start with a data definition:

>>> :{
    newtype Address = Addr String deriving (Show)
    data Store f
        = MkStore { storeName        :: Column f String
                  , storeAddress     :: Column f Address
                  , storeId          :: Column f Int
                  }
        deriving (Generic, Frameable)
    deriving instance Show (Row Store)
:}

In this example, we assume that the @storeId@ column is unique. We will
therefore use it as a key. All we need to do is derive an instance of `Indexable`:

>>> :set -XTypeFamilies
>>> :{
    instance Indexable Store where
        type Key Store = Int
        index = storeId
:}

As an example, let's build a dataframe of stores:

>>> :{
    stores = fromRows
           [ MkStore "Store A" (Addr "8712 1st Avenue") 787123745
           , MkStore "Store B" (Addr "90 2st Street")   188712313
           , MkStore "Store C" (Addr "109 3rd Street")  910823870
           ]
:}

Finally, we can look up @Store A@ by its unique ID using `Frame.lookup`:

>>> Frame.lookup 787123745 stores
Just (MkStore {storeName = "Store A", storeAddress = Addr "8712 1st Avenue", storeId = 787123745})

== Compound keys

Sometimes, it is preferable to identify rows through multiple columns. Again in
in analogy with databases, the key is a _compound key_.

Let's consider another example, that of movie actors:

>>> :{
    data Actor f
        = MkActor { actorFirstName   :: Column f String
                  , actorLastName    :: Column f String
                  , actorAge         :: Column f Int
                  }
        deriving (Generic, Frameable)
    deriving instance Show (Row Actor)
:}

In this case, we can identify actors by their first and last name,
which creates a compound key:

>>> :{
    instance Indexable Actor where
        type Key Actor = (String, String)
        index :: Frame Actor -> Vector (Key Actor)
        index df = Vector.zipWith (,) (actorFirstName df) (actorLastName df)
:}

We define some data

>>> :{
    actors = fromRows
           [ MkActor "George" "Clooney" 63
           , MkActor "Brad"   "Pitt"    61
           , MkActor "George" "Takei"   87
           ]
:}

Finally, we can look up George Clooney's age using `at`:

>>> actors `at` ( ("George", "Clooney"), actorAge )
Just 63

-}

{- $zipping

The simplest way to combine dataframes is analogous to the `Data.List.zipWith` operation
for lists: two dataframes can be combined row-by-row, in order, using `zipRowsWith`.

Here is an example:

>>> data Race = Cat | Dog deriving Show
>>> :{
    data Pet f
        = MkPet { petName :: Column f String
                , petAge  :: Column f Int
                }
        deriving (Generic, Frameable)
    pets = fromRows
         [ MkPet "Milo"    10
         , MkPet "Litchi"  4
         , MkPet "Piccolo" 15
         , MkPet "Cloud"   3
         ]
:}

>>> :{
    data PetInfo f
        = MkPetInfo { petInfoName  :: Column f String
                    , petInfoRace  :: Column f Race
                    }
        deriving (Generic, Frameable)
    petInfos = fromRows
             [ MkPetInfo "Milo"    Cat
             , MkPetInfo "Cloud"   Cat
             , MkPetInfo "Piccolo" Dog
             , MkPetInfo "Litchi"  Dog
             ]
:}

>>> :{
    data PetSummary f
        = MkPetSummary { petSummaryName :: Column f String
                       , petSummaryAge  :: Column f Int
                       , petSummaryRace :: Column f Race
                       }
        deriving (Generic, Frameable)
    deriving instance Show (Row PetSummary)
:}

>>> :{
    putStrLn
        $ display
            $ zipRowsWith
                (\(MkPet name age) (MkPetInfo _ race) -> MkPetSummary name age race)
                pets
                petInfos
:}
petSummaryName | petSummaryAge | petSummaryRace
-------------- | ------------- | --------------
        "Milo" |            10 |            Cat
      "Litchi" |             4 |            Cat
     "Piccolo" |            15 |            Dog
       "Cloud" |             3 |            Dog


Hmm this doesn't look right, if you manually inspect the two source dataframes.
This is because rows are combined in order. You may want to sort rows using
`sortRowsBy` or `sortRowsByUnique`, before applying `zipRowsWith`:

>>> import Data.Function (on)
>>> :{
    putStrLn
        $ display
            $ zipRowsWith
                (\(MkPet name age) (MkPetInfo _ race) -> MkPetSummary name age race)
                (sortRowsBy (compare `on` petName) pets)
                (sortRowsBy (compare `on` petInfoName) petInfos)
:}
petSummaryName | petSummaryAge | petSummaryRace
-------------- | ------------- | --------------
       "Cloud" |             3 |            Cat
      "Litchi" |             4 |            Dog
        "Milo" |            10 |            Cat
     "Piccolo" |            15 |            Dog

There is a more robust way to merge dataframes, if each dataframe has a natural
key (in the case above, pet names). See below.
-}

{- $merging

If you want to merge dataframes whose rows have a natural key (i.e. have an instance of `Indexable`),
then you should take a look at `mergeWithStrategy`.
In this function, for each key present in __either__ dataframe,
a merging strategy is applied. This strategy encodes how the merge should proceed in three cases:

* The key is present in the left dataframe, but not the right;
* The key is present in the right dataframe, but not the left;
* The key is present in both dataframes.

Let's see how to make use of this functionality by combining the information
about containers being shipped. Unfortunately, the data is spotty, so
we will need to make decisions about missing data.

>>> :{
    data ContainerOrigin f
        = MkContainerOrigin { containerOriginId      :: Column f Int
                            , containerOriginCountry :: Column f String
                            }
        deriving (Generic, Frameable)
    instance Indexable ContainerOrigin where
        type Key ContainerOrigin = Int
        index = containerOriginId
    containerOrigins = fromRows
                     [ MkContainerOrigin 1 "Canada"
                     , MkContainerOrigin 2 "Mexico"
                     -- missing container origin for container #3
                     , MkContainerOrigin 4 "Poland"
                     , MkContainerOrigin 5 "N/A" -- bad data
                     ]
:}

>>> :{
    data ContainerDest f -- Container destination
        = MkContainerDest { containerDestId      :: Column f Int
                          , containerDestCountry :: Column f String
                          }
        deriving (Generic, Frameable)
    instance Indexable ContainerDest where
        type Key ContainerDest = Int
        index = containerDestId
    containerDests = fromRows
                   [ MkContainerDest 1 "Japan"
                   , MkContainerDest 2 "Canada"
                   , MkContainerDest 3 "USA"
                   -- missing container destination for #4
                   , MkContainerDest 5 "France"
                   ]
:}

We will first start by merging the dataframes only when we have complete data
(i.e. an inner join). We first define the shape of the resulting dataframe:

>>> :{
    data ContainerJourney f
        = MkContainerJourney { containerJourneyId   :: Column f Int
                             , containerJourneyOrig :: Column f String
                             , containerJourneyDest :: Column f String
                             }
        deriving (Generic, Frameable)
    deriving instance Show (Row ContainerJourney)
:}

and define our merging strategy. The three row-wise merge cases are handled by the constructor
t`These`, namely the constructors:

* v`This`: The key is present in the left dataframe, but not the right;
* v`That`: The key is present in the right dataframe, but not the left;
* v`These`: The key is present in both dataframes (not to be confused with the type constructor t`These`).

In the simplest case, we only care about keys present in both dataframe (`These`)

>>> :{
    completeDataStrategy :: Int -> These (Row ContainerOrigin) (Row ContainerDest) -> Maybe (Row ContainerJourney)
    completeDataStrategy containerId (These (MkContainerOrigin _ origin) (MkContainerDest _ dest))
        = Just $ MkContainerJourney containerId origin dest
    completeDataStrategy _ _ = Nothing -- not enough data
:}

Sidenote: @completeDataStrategy@ is equivalent to `matchedStrategy`. We re-defined it for illustrative purposes.

>>> :{
    putStrLn
        $ display
            $ mergeWithStrategy
                completeDataStrategy
                containerOrigins
                containerDests
:}
containerJourneyId | containerJourneyOrig | containerJourneyDest
------------------ | -------------------- | --------------------
                 1 |             "Canada" |              "Japan"
                 2 |             "Mexico" |             "Canada"
                 5 |                "N/A" |             "France"

As expected, we do not have enough information to reconstruct the journey for container 3 (no known origin)
and container 4 (no known destination).
However, container 5's origin isn't valid data. We can further tweak the merge strategy to take this into account.

We crudely define what is a valid country name:

>>> validCountry name = not (name == "N/A")

and we can now define a new merging strategy. Returning a `Nothing` result from a merging strategy
effectively cancels the merge:

>>> :{
    completeDataStrategy' :: Int -> These (Row ContainerOrigin) (Row ContainerDest) -> Maybe (Row ContainerJourney)
    completeDataStrategy' containerId (These (MkContainerOrigin _ origin) (MkContainerDest _ dest))
        | validCountry origin && validCountry dest = Just $ MkContainerJourney containerId origin dest
        | otherwise                                = Nothing
    completeDataStrategy' _ _ = Nothing -- not enough data
:}

>>> :{
    putStrLn
        $ display
            $ mergeWithStrategy
                completeDataStrategy'
                containerOrigins
                containerDests
:}
containerJourneyId | containerJourneyOrig | containerJourneyDest
------------------ | -------------------- | --------------------
                 1 |             "Canada" |              "Japan"
                 2 |             "Mexico" |             "Canada"

What if we can tolerate some missing data? Here, we only care where the container is going, but not
necessarily its origin. Let's redefine our resulting dataframe to take this into account:

>>> :{
    data PartialContainerJourney f
        = MkPartialContainerJourney { partialContainerJourneyId   :: Column f Int
                                    , partialContainerJourneyOrig :: Column f (Maybe String)
                                    , partialContainerJourneyDest :: Column f String
                                    }
        deriving (Generic, Frameable)
    deriving instance Show (Row PartialContainerJourney)
:}

>>> :{
    maybeOriginStrategy :: Int -> These (Row ContainerOrigin) (Row ContainerDest) -> Maybe (Row PartialContainerJourney)
    maybeOriginStrategy containerId (These (MkContainerOrigin _ origin) (MkContainerDest _ dest))
        | validCountry origin && validCountry dest = Just $ MkPartialContainerJourney containerId (Just origin) dest
        | validCountry dest                        = Just $ MkPartialContainerJourney containerId Nothing       dest
        | otherwise                                = Nothing
    maybeOriginStrategy containerId (That (MkContainerDest _ dest))
                                                   = Just $ MkPartialContainerJourney containerId Nothing       dest
    maybeOriginStrategy _           (This _)       = Nothing -- we require a destination
:}

>>> :{
    putStrLn
        $ display
            $ mergeWithStrategy
                maybeOriginStrategy
                containerOrigins
                containerDests
:}
partialContainerJourneyId | partialContainerJourneyOrig | partialContainerJourneyDest
------------------------- | --------------------------- | ---------------------------
                        1 |               Just "Canada" |                     "Japan"
                        2 |               Just "Mexico" |                    "Canada"
                        3 |                     Nothing |                       "USA"
                        5 |                     Nothing |                    "France"
-}