dataframe
An intuitive, dynamically-typed DataFrame library.
Stackage Nightly 2025-06-16: | 0.2.0.1 |
Latest on Hackage: | 0.2.0.1 |
dataframe-0.2.0.1@sha256:d9721182035f409d1667181008cf70c70591eb1cbc877439ec17b4843e2ea320,5041
Module documentation for 0.2.0.1
DataFrame
An intuitive, dynamically-typed DataFrame library.
A tool for exploratory data analysis.
Installing
CLI
- Install Haskell (ghc + cabal) via ghcup selecting all the default options.
- To install dataframe run
cabal update && cabal install dataframe
- Open a Haskell repl with dataframe loaded by running
cabal repl --build-depends dataframe
. - Follow along any one of the tutorials below.
Jupyter notebook
- Use the Dockerfile in the ihaskell-dataframe to build and run an image with dataframe integration.
- For a preview check out the California Housing notebook.
What is exploratory data analysis?
We provide a primer here and show how to do some common analyses.
Coming from other dataframe libraries
Familiar with another dataframe library? Get started:
Example usage
Code example
import qualified DataFrame as D
import DataFrame ((|>))
main :: IO ()
df <- D.readTsv "./data/chipotle.tsv"
print $ df
|> D.select ["item_name", "quantity"]
|> D.groupBy ["item_name"]
|> D.aggregate (zip (repeat "quantity") [D.Maximum, D.Mean, D.Sum])
|> D.sortBy D.Descending ["Sum_quantity"]
Output:
----------------------------------------------------------------------------------------------------
index | item_name | Sum_quantity | Mean_quantity | Maximum_quantity
------|---------------------------------------|--------------|--------------------|-----------------
Int | Text | Int | Double | Int
------|---------------------------------------|--------------|--------------------|-----------------
0 | Chips and Fresh Tomato Salsa | 130 | 1.1818181818181819 | 15
1 | Izze | 22 | 1.1 | 3
2 | Nantucket Nectar | 31 | 1.1481481481481481 | 3
3 | Chips and Tomatillo-Green Chili Salsa | 35 | 1.1290322580645162 | 3
4 | Chicken Bowl | 761 | 1.0482093663911847 | 3
5 | Side of Chips | 110 | 1.0891089108910892 | 8
6 | Steak Burrito | 386 | 1.048913043478261 | 3
7 | Steak Soft Tacos | 56 | 1.018181818181818 | 2
8 | Chips and Guacamole | 506 | 1.0563674321503131 | 4
9 | Chicken Crispy Tacos | 50 | 1.0638297872340425 | 2
Full example in ./app
folder using many of the constructs in the API.
Visual example
Future work
- Apache arrow and Parquet compatability
- Integration with common data formats (currently only supports CSV)
- Support windowed plotting (currently only supports ASCII plots)
- Create a lazy API that builds an execution graph instead of running eagerly (will be used to compute on files larger than RAM)
Contributing
- Please first submit an issue and we can discuss there.
Changes
Revision history for dataframe
0.2.0.1
- Fix bug with new comparison expressions. gt and geq were actually implemented as lt and leq.
- Changes to make library work with ghc 9.10.1 and 9.12.2
0.2.0.0
Replace Function
adt with a column expression syntax.
Previously, we tried to stay as close to Haskell as possible. We used the explicit ordering of the column names in the first part of the tuple to determine the function arguments and the a regular Haskell function that we evaluated piece-wise on each row.
let multiply (a :: Int) (b :: Double) = fromIntegral a * b
let withTotalPrice = D.deriveFrom (["quantity", "item_price"], D.func multiply) "total_price" df
Now, we have a column expression syntax that mirrors Pyspark and Polars.
let withTotalPrice = D.derive "total_price" (D.lift fromIntegral (D.col @Int "quantity") * (D.col @Double"item_price")) df
Adds a coverage report to the repository (thanks to @oforero)
We don’t have good test coverage right now. This will help us determine where to invest. @oforero provided a script to make an HPC HTML report for coverage.
Convenience functions for comparisons
Instead of lifting all bool operations we provide eq
, leq
etc.
0.1.0.3
- Use older version of correlation for ihaskell itegration
0.1.0.2
- Change namespace from
Data.DataFrame
toDataFrame
- Add
toVector
function for converting columns to vectors. - Add
impute
function for replacingNothing
values in optional columns. - Add
filterAllJust
to filter out all rows with missing data. - Add
distinct
function that returns a dataframe with distict rows.
0.1.0.1
- Fixed parse failure on nested, escaped quotation.
- Fixed column info when field name isn’t found.
0.1.0.0
- Initial release