dataframe

A fast, safe, and intuitive DataFrame library.

Version on this page:	0.3.3.0
LTS Haskell 24.11:	0.2.0.2
Stackage Nightly 2025-10-08:	0.3.3.0
Latest on Hackage:	0.3.3.1

See all snapshots dataframe appears in

GPL-3.0-or-later licensed by Michael Chavinda

Maintained by [email protected]

This version can be pinned in stack with:dataframe-0.3.3.0@sha256:041fb026ee899c873d0819454e3f5b46fa8e013c9a76281d51033271b0ea80b3,5540

Module documentation for 0.3.3.0

DataFrame
- DataFrame.Display
  - DataFrame.Display.Terminal
  - DataFrame.Display.Web
    - DataFrame.Display.Web.Plot
- DataFrame.Errors
- DataFrame.Functions
- DataFrame.IO
  - DataFrame.IO.CSV
  - DataFrame.IO.Parquet
- DataFrame.Internal
- DataFrame.Lazy
  - DataFrame.Lazy.IO
    - DataFrame.Lazy.IO.CSV
  - DataFrame.Lazy.Internal
    - DataFrame.Lazy.Internal.DataFrame
- DataFrame.Operations

Depends on 19 packages(full list with versions):

array, attoparsec, base, bytestring, bytestring-lexing, containers, dataframe, directory, granite, hashable, process, random, snappy-hs, template-haskell, text, time, vector, vector-algorithms, zstd

Used by 1 package in nightly-2025-10-08(full list with versions):

dataframe

DataFrame

A fast, safe, and intuitive DataFrame library.

Why use this DataFrame library?

Encourages concise, declarative, and composable data pipelines.
Static typing makes code easier to reason about and catches many bugs at compile time—before your code ever runs.
Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model.
Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults.
Works seamlessly in both command-line and notebook environments—great for exploration and scripting alike.

Example usage

Interactive environment

Screencast of usage in GHCI

Key features in example:

Intuitive, SQL-like API to get from data to insights.
Create typed, completion-ready references to columns in a dataframe using :exposeColumns
Type-safe column transformations for faster and safer exploration.
Fluid, chaining API that makes code easy to reason about.

Standalone script example

-- Useful Haskell extensions.
{-# LANGUAGE OverloadedStrings #-} -- Allow string literal to be interpreted as any other string type.
{-# LANGUAGE TypeApplications #-} -- Convenience syntax for specifiying the type `sum a b :: Int` vs `sum @Int a b'. 

import qualified DataFrame as D -- import for general functionality.
import qualified DataFrame.Functions as F -- import for column expressions.

import DataFrame ((|>)) -- import chaining operator with unqualified.

main :: IO ()
main = do
    df <- D.readTsv "./data/chipotle.tsv"
    let quantity = F.col "quantity" :: D.Expr Int -- A typed reference to a column.
    print (df
      |> D.select ["item_name", "quantity"]
      |> D.groupBy ["item_name"]
      |> D.aggregate [ (F.sum quantity)     `F.as` "sum_quantity"
                     , (F.mean quantity)    `F.as` "mean_quantity"
                     , (F.maximum quantity) `F.as` "maximum_quantity"
                     ]
      |> D.sortBy D.Descending ["sum_quantity"]
      |> D.take 10)

Output:

------------------------------------------------------------------------------------------
index |          item_name           | sum_quantity |    mean_quanity    | maximum_quanity
------|------------------------------|--------------|--------------------|----------------
 Int  |             Text             |     Int      |       Double       |       Int      
------|------------------------------|--------------|--------------------|----------------
0     | Chicken Bowl                 | 761          | 1.0482093663911847 | 3              
1     | Chicken Burrito              | 591          | 1.0687160940325497 | 4              
2     | Chips and Guacamole          | 506          | 1.0563674321503131 | 4              
3     | Steak Burrito                | 386          | 1.048913043478261  | 3              
4     | Canned Soft Drink            | 351          | 1.1661129568106312 | 4              
5     | Chips                        | 230          | 1.0900473933649288 | 3              
6     | Steak Bowl                   | 221          | 1.04739336492891   | 3              
7     | Bottled Water                | 211          | 1.3024691358024691 | 10             
8     | Chips and Fresh Tomato Salsa | 130          | 1.1818181818181819 | 15             
9     | Canned Soda                  | 126          | 1.2115384615384615 | 4

Full example in ./examples folder using many of the constructs in the API.

Installing

Jupyter notebook

We have a hosted version of the Jupyter notebook on azure sites. This is hosted on Azure’s free tier so it can only support 3 or 4 kernels at a time.
To get started quickly, use the Dockerfile in the ihaskell-dataframe to build and run an image with dataframe integration.
For a preview check out the California Housing notebook.

CLI

Run the installation script curl '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/mchav/dataframe/refs/heads/main/scripts/install.sh | sh
Download the run script with: curl --output dataframe "https://raw.githubusercontent.com/mchav/dataframe/refs/heads/main/scripts/dataframe.sh"
Make the script executable: chmod +x dataframe
Add the script your path: export PATH=$PATH:./dataframe
Run the script with: dataframe

What is exploratory data analysis?

We provide a primer here and show how to do some common analyses.

Coming from other dataframe libraries

Familiar with another dataframe library? Get started:

Supported input formats

CSV
Apache Parquet

Supported output formats

Future work

Apache arrow compatability
Integration with more data formats (SQLite, Postgres, json lines, xlsx).
Host the whole library + Jupyter lab on Azure with auth and isolation.

Changes

Revision history for dataframe

0.3.3.0

Better error messaging for expression failures.
Fix bug where exponentials were not being properly during CSV parsing.
toMatrix now returns either an exception or the a vector of vector doubles.
Add sample, kFolds, and randomSplit to sample dataframes.

0.3.2.0

Fix dataframe semigroup instance. Appending two rows of the same name but different types now gives a row of Either a b (work by @jhrcek).
Fix left expansion of semigroup instance (work by @jhrcek).
Added hasElemType function that can be used with selectBy to filter columns by type. E.g. selectBy [byProperty (hasElemType @Int)] df.
Added basic support for program synthesis for feature generation (synthesizeFeatureExpr) and symbolic regression (fitRegression).
Web plotting doesn’t embed entire script anymore.
Added relu, min, and max functions for expressions.
Add fromRows function to build a dataframe from rows. Also add toAny function that converts a value to a dynamic-like Columnable value.
isNumeric function now recognises Integer types.
Added readCsvWithOpts function that allows read specification.
Expose option to specify data formats when parsing CSV.
Added setup script for Hasktorch example.

0.3.1.2

Update granite version, again, for stackage.

0.3.1.1

Aggregation now works on expressions rather than just column references.
Export writeCsv
Loosen bounds for dependencies to keep library on stackage.
Add filterNothing function that returns all empty rows of a column.
Add IfThenElse function for conditional expressions.
Add synthesizeFeatureExpr function that does a search for a predictive variable in a Double dataframe.

0.3.1.0

Add new selectBy function which subsumes all the other select functions. Specifically we can:
- selectBy [byName "x"] df: normal select.
- selectBy [byProperty isNumeric] df: all columns with a given property.
- selectBy [byNameProperty (T.isPrefixOf "weight")] df: select by column name predicate.
- selectBy [byIndexRange (0, 5)] df: picks the first size columns.
- selectBy [byNameRange ("a", "c")] df: select names within a range.
Cut down dependencies to reduce binary/installation size.
Add module for web plots that uses chartjs.
Web plots can open in the browser.

0.3.0.4

Fix bug with parquet reader.

0.3.0.3

Improved parquet reader. The reader now supports most parquet files downloaded from internet sources
- Supports all primitive parquet types plain and uncompressed.
- Can decode both v1 and v2 data pages.
- Supports Snappy and ZSTD compression.
- Supports RLE/bitpacking encoding for primitive types
- Backward compatible with INT96 type.
- From the parquet-testing repo we can successfully read the following:
  - alltypes_dictionary.parquet
  - alltypes_plain.parquet
  - alltypes_plain.snappy.parquet
  - alltypes_tiny_pages_plain.parquet
  - binary_truncated_min_max.parquet
  - datapage_v1-corrupt-checksum.parquet
  - datapage_v1-snappy-compressed-checksum.parquet
  - datapage_v1-uncompressed-checksum.parquet
Improve CSV parsing: Parse bytestring and convert to text only at the end. Remove some redundancies in parsing with suggestions from @Jhingon.
Faster correlation computation.
Update version of granite that ships with dataframe and add new scatterBy plot.

0.3.0.2

Re-enable Parquet.
Change columnInfo to describeColumns
We can now convert columns to lists.
Fast reductions and groupings. GroupBys are now a dataframe construct not a column construct (thanks to @stites).
Filter is now faster because we do mutation on the index vector.
Frequencies table nnow correctly display percentages (thanks @kayvank)
Show table implementations have been unified (thanks @metapho-re)
We now compute statistics on null columns
Drastic improvement in plotting since we now use granite.

0.3.0.1

Temporarily remove Parquet support. I think it’ll be worth creating a spin off of snappy that doesn’t rely on C bindings. Also I’ll probably spin Parquet off into a separate library.

0.3.0.0

Now supports inner joins

ghci> df |> D.innerJoin ["key_1", "key_2"] other

Aggregations are now expressions allowing for more expressive aggregation logic. Previously: D.aggregate [("quantity", D.Mean), ("price", D.Sum)] df now D.aggregate [(F.sum (F.col @Double "label") / (F.count (F.col @Double "label")) `F.as` "positive_rate")]
In GHCI, you can now create type-safe bindings for each column and use those in expressions.

ghci> :exposeColumns df
ghci> D.aggregate  [(F.sum label / F.count label) `F.as` "positive_rate"]

Added pandas and polars benchmarks.
Performance improvements to groupBy.
Various bug fixes.

0.2.0.2

Experimental Apache Parquet support.
Rename conversion columns (changed from toColumn and toColumn’ to fromVector and fromList).
Rename constructor for dataframe to fromNamedColumns
Create an error context for error messages so we can change the exceptions as they are thrown.
Provide safe versions of building block functions that allow us to build good traces.
Add readthedocs support.

0.2.0.1

Fix bug with new comparison expressions. gt and geq were actually implemented as lt and leq.
Changes to make library work with ghc 9.10.1 and 9.12.2

0.2.0.0

Replace `Function` adt with a column expression syntax.

Previously, we tried to stay as close to Haskell as possible. We used the explicit ordering of the column names in the first part of the tuple to determine the function arguments and the a regular Haskell function that we evaluated piece-wise on each row.

let multiply (a :: Int) (b :: Double) = fromIntegral a * b
let withTotalPrice = D.deriveFrom (["quantity", "item_price"], D.func multiply) "total_price" df

Now, we have a column expression syntax that mirrors Pyspark and Polars.

let withTotalPrice = D.derive "total_price" (D.lift fromIntegral (D.col @Int "quantity") * (D.col @Double"item_price")) df

Adds a coverage report to the repository (thanks to @oforero)

We don’t have good test coverage right now. This will help us determine where to invest. @oforero provided a script to make an HPC HTML report for coverage.

Convenience functions for comparisons

Instead of lifting all bool operations we provide eq, leq etc.

0.1.0.3

Use older version of correlation for ihaskell itegration

0.1.0.2

Change namespace from Data.DataFrame to DataFrame
Add toVector function for converting columns to vectors.
Add impute function for replacing Nothing values in optional columns.
Add filterAllJust to filter out all rows with missing data.
Add distinct function that returns a dataframe with distict rows.

0.1.0.1

Fixed parse failure on nested, escaped quotation.
Fixed column info when field name isn’t found.

0.1.0.0

Initial release

dataframe

Module documentation for 0.3.3.0

DataFrame

Why use this DataFrame library?

Example usage

Interactive environment

Standalone script example

Installing

Jupyter notebook

CLI

What is exploratory data analysis?

Coming from other dataframe libraries

Supported input formats

Supported output formats

Future work

Changes

Revision history for dataframe

0.3.3.0

0.3.2.0

0.3.1.2

0.3.1.1

0.3.1.0

0.3.0.4

0.3.0.3

0.3.0.2

0.3.0.1

0.3.0.0

0.2.0.2

0.2.0.1

0.2.0.0

Replace Function adt with a column expression syntax.

Adds a coverage report to the repository (thanks to @oforero)

Convenience functions for comparisons

0.1.0.3

0.1.0.2

0.1.0.1

0.1.0.0

Replace `Function` adt with a column expression syntax.