dataframe

An intuitive, dynamically-typed DataFrame library.

Stackage Nightly 2025-06-16:	0.2.0.1
Latest on Hackage:	0.2.0.1

See all snapshots dataframe appears in

GPL-3.0-or-later licensed by Michael Chavinda

Maintained by [email protected]

This version can be pinned in stack with:dataframe-0.2.0.1@sha256:d9721182035f409d1667181008cf70c70591eb1cbc877439ec17b4843e2ea320,5041

Module documentation for 0.2.0.1

DataFrame

Depends on 12 packages(full list with versions):

array, attoparsec, base, bytestring, containers, directory, hashable, statistics, text, time, vector, vector-algorithms

DataFrame

An intuitive, dynamically-typed DataFrame library.

A tool for exploratory data analysis.

Installing

CLI

Install Haskell (ghc + cabal) via ghcup selecting all the default options.
To install dataframe run cabal update && cabal install dataframe
Open a Haskell repl with dataframe loaded by running cabal repl --build-depends dataframe.
Follow along any one of the tutorials below.

Jupyter notebook

Use the Dockerfile in the ihaskell-dataframe to build and run an image with dataframe integration.
For a preview check out the California Housing notebook.

What is exploratory data analysis?

We provide a primer here and show how to do some common analyses.

Coming from other dataframe libraries

Familiar with another dataframe library? Get started:

Example usage

Code example

import qualified DataFrame as D

import DataFrame ((|>))

main :: IO ()
    df <- D.readTsv "./data/chipotle.tsv"
    print $ df
      |> D.select ["item_name", "quantity"]
      |> D.groupBy ["item_name"]
      |> D.aggregate (zip (repeat "quantity") [D.Maximum, D.Mean, D.Sum])
      |> D.sortBy D.Descending ["Sum_quantity"]

Output:

----------------------------------------------------------------------------------------------------
index |               item_name               | Sum_quantity |   Mean_quantity    | Maximum_quantity
------|---------------------------------------|--------------|--------------------|-----------------
 Int  |                 Text                  |     Int      |       Double       |       Int       
------|---------------------------------------|--------------|--------------------|-----------------
0     | Chips and Fresh Tomato Salsa          | 130          | 1.1818181818181819 | 15              
1     | Izze                                  | 22           | 1.1                | 3               
2     | Nantucket Nectar                      | 31           | 1.1481481481481481 | 3               
3     | Chips and Tomatillo-Green Chili Salsa | 35           | 1.1290322580645162 | 3               
4     | Chicken Bowl                          | 761          | 1.0482093663911847 | 3               
5     | Side of Chips                         | 110          | 1.0891089108910892 | 8               
6     | Steak Burrito                         | 386          | 1.048913043478261  | 3               
7     | Steak Soft Tacos                      | 56           | 1.018181818181818  | 2               
8     | Chips and Guacamole                   | 506          | 1.0563674321503131 | 4               
9     | Chicken Crispy Tacos                  | 50           | 1.0638297872340425 | 2

Full example in ./app folder using many of the constructs in the API.

Visual example

Screencast of usage in GHCI

Future work

Apache arrow and Parquet compatability
Integration with common data formats (currently only supports CSV)
Support windowed plotting (currently only supports ASCII plots)
Create a lazy API that builds an execution graph instead of running eagerly (will be used to compute on files larger than RAM)

Contributing

Please first submit an issue and we can discuss there.

Changes

Revision history for dataframe

0.2.0.1

Fix bug with new comparison expressions. gt and geq were actually implemented as lt and leq.
Changes to make library work with ghc 9.10.1 and 9.12.2

0.2.0.0

Replace `Function` adt with a column expression syntax.

Previously, we tried to stay as close to Haskell as possible. We used the explicit ordering of the column names in the first part of the tuple to determine the function arguments and the a regular Haskell function that we evaluated piece-wise on each row.

let multiply (a :: Int) (b :: Double) = fromIntegral a * b
let withTotalPrice = D.deriveFrom (["quantity", "item_price"], D.func multiply) "total_price" df

Now, we have a column expression syntax that mirrors Pyspark and Polars.

let withTotalPrice = D.derive "total_price" (D.lift fromIntegral (D.col @Int "quantity") * (D.col @Double"item_price")) df

Adds a coverage report to the repository (thanks to @oforero)

We don’t have good test coverage right now. This will help us determine where to invest. @oforero provided a script to make an HPC HTML report for coverage.

Convenience functions for comparisons

Instead of lifting all bool operations we provide eq, leq etc.

0.1.0.3

Use older version of correlation for ihaskell itegration

0.1.0.2

Change namespace from Data.DataFrame to DataFrame
Add toVector function for converting columns to vectors.
Add impute function for replacing Nothing values in optional columns.
Add filterAllJust to filter out all rows with missing data.
Add distinct function that returns a dataframe with distict rows.

0.1.0.1

Fixed parse failure on nested, escaped quotation.
Fixed column info when field name isn’t found.

0.1.0.0

Initial release