dataframe
A fast, safe, and intuitive DataFrame library.
| LTS Haskell 24.11: | 0.2.0.2 |
| Stackage Nightly 2026-04-21: | 1.1.2.0 |
| Latest on Hackage: | 1.1.2.0 |
dataframe-1.1.2.0@sha256:093ec876b2192f06e10a090310a75aa76b5dd6c9886023209de029b68241472e,10982Module documentation for 1.1.2.0
- DataFrame
- DataFrame.DecisionTree
- DataFrame.Display
- DataFrame.Display.Terminal
- DataFrame.Display.Web
- DataFrame.Errors
- DataFrame.Functions
- DataFrame.IO
- DataFrame.IO.CSV
- DataFrame.IO.JSON
- DataFrame.IO.Parquet
- DataFrame.IO.Parquet.Binary
- DataFrame.IO.Parquet.ColumnStatistics
- DataFrame.IO.Parquet.Compression
- DataFrame.IO.Parquet.Dictionary
- DataFrame.IO.Parquet.Encoding
- DataFrame.IO.Parquet.Levels
- DataFrame.IO.Parquet.Page
- DataFrame.IO.Parquet.Seeking
- DataFrame.IO.Parquet.Thrift
- DataFrame.IO.Parquet.Time
- DataFrame.IO.Parquet.Types
- DataFrame.Internal
- DataFrame.Internal.Binary
- DataFrame.Internal.Column
- DataFrame.Internal.DataFrame
- DataFrame.Internal.Expression
- DataFrame.Internal.Grouping
- DataFrame.Internal.Interpreter
- DataFrame.Internal.Nullable
- DataFrame.Internal.Parsing
- DataFrame.Internal.Row
- DataFrame.Internal.Schema
- DataFrame.Internal.Statistics
- DataFrame.Internal.Types
- DataFrame.Lazy
- DataFrame.Monad
- DataFrame.Operations
- DataFrame.Operators
- DataFrame.Synthesis
- DataFrame.Typed
DataFrame
Tabular data analysis in Haskell. Read CSV, Parquet, and JSON files, transform columns with a typed expression DSL, and optionally lock down your entire schema at the type level for compile-time safety.
The library ships three API layers — all operating on the same underlying DataFrame type at runtime:
- Untyped (
import qualified DataFrame as D) — string-based column names, great for exploration and scripting. - Typed (
import qualified DataFrame.Typed as T) — phantom-type schema tracking with compile-time column validation. - Monadic API — write your transformation as a self contained pipeline.
Why this library?
- Concise, declarative, composable data pipelines using the
|>pipe operator. - Choose your level of type safety: keep it lightweight for quick analysis, or lock it down for production pipelines.
- High performance from Haskell’s optimizing compiler and an efficient columnar memory model with bitmap-backed nullability.
- Designed for interactivity: a custom REPL, IHaskell notebook support, terminal and web plotting, and helpful error messages.
Install
cabal update
cabal install dataframe
To use as a dependency in a project:
build-depends: base >= 4, dataframe
Works with GHC 9.4 through 9.12. A custom REPL with all imports pre-loaded is available after installing:
dataframe
Quick Start
Save this as Example.hs and run with cabal run Example.hs:
#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4, dataframe
-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators
main :: IO ()
main = do
let sales = D.fromNamedColumns
[ ("product", D.fromList [1, 1, 2, 2, 3, 3 :: Int])
, ("amount", D.fromList [100, 120, 50, 20, 40, 30 :: Int])
]
-- Group by product and compute totals
print $ sales
|> D.groupBy ["product"]
|> D.aggregate [ F.sum (F.col @Int "amount") `as` "total"
, F.count (F.col @Int "amount") `as` "orders"
]
-----------------------
product | total | orders
--------|-------|-------
Int | Int | Int
--------|-------|-------
1 | 220 | 2
2 | 70 | 2
3 | 70 | 2
Reading from files works the same way:
df <- D.readCsv "data.csv"
df <- D.readParquet "data.parquet"
-- Hugging Face datasets
df <- D.readParquet "hf://datasets/scikit-learn/iris/default/train/0000.parquet"
Interactive REPL
The dataframe REPL comes with all imports pre-loaded. Here’s a typical exploration session:
dataframe> df <- D.readCsv "./data/housing.csv"
dataframe> D.dimensions df
(20640, 10)
dataframe> D.describeColumns df
------------------------------------------------------------------------
Column Name | ## Non-null Values | ## Null Values | Type
--------------------|--------------------|----------------|-------------
Text | Int | Int | Text
--------------------|--------------------|----------------|-------------
total_bedrooms | 20433 | 207 | Maybe Double
ocean_proximity | 20640 | 0 | Text
median_house_value | 20640 | 0 | Double
median_income | 20640 | 0 | Double
households | 20640 | 0 | Double
population | 20640 | 0 | Double
total_rooms | 20640 | 0 | Double
housing_median_age | 20640 | 0 | Double
latitude | 20640 | 0 | Double
longitude | 20640 | 0 | Double
The :declareColumns macro generates typed column references from a dataframe, so you can use column names directly in expressions instead of writing F.col @Double "median_income" every time:
dataframe> :declareColumns df
"longitude :: Expr Double"
"latitude :: Expr Double"
"housing_median_age :: Expr Double"
"total_rooms :: Expr Double"
"total_bedrooms :: Expr (Maybe Double)"
"population :: Expr Double"
"households :: Expr Double"
"median_income :: Expr Double"
"median_house_value :: Expr Double"
"ocean_proximity :: Expr Text"
dataframe> df |> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
-------------------------------------
ocean_proximity | avg_value
-----------------|-------------------
Text | Double
-----------------|-------------------
<1H OCEAN | 240084.28546409807
INLAND | 124805.39200122119
ISLAND | 380440.0
NEAR BAY | 259212.31179039303
NEAR OCEAN | 249433.97742663656
Create new columns from existing ones:
dataframe> df |> D.derive "rooms_per_household" (total_rooms / households) |> D.take 3
-----------------------------------------------------------------------------------------------------------------
longitude | latitude | housing_median_age | total_rooms | ... | ocean_proximity | rooms_per_household
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
Double | Double | Double | Double | ... | Text | Double
-----------|----------|--------------------|-------------|-----|-----------------|--------------------
-122.23 | 37.88 | 41.0 | 880.0 | ... | NEAR BAY | 6.984126984126984
-122.22 | 37.86 | 21.0 | 7099.0 | ... | NEAR BAY | 6.238137082601054
-122.24 | 37.85 | 52.0 | 1467.0 | ... | NEAR BAY | 8.288135593220339
Type mismatches are caught as compile errors — adding a Double column to a Text column won’t silently produce garbage:
dataframe> df |> D.derive "nonsense" (latitude + ocean_proximity)
<interactive>:14:47: error: [GHC-83865]
• Couldn't match type 'Text' with 'Double'
Expected: Expr Double
Actual: Expr Text
• In the second argument of '(+)', namely 'ocean_proximity'
In the second argument of 'derive', namely
'(latitude + ocean_proximity)'
Template Haskell
For scripts and projects, Template Haskell can generate column bindings at compile time.
Generate column references from a CSV
declareColumnsFromCsvFile reads your CSV at compile time and generates typed Expr bindings for every column:
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified DataFrame as D
import qualified DataFrame.Functions as F
import DataFrame.Operators
-- Reads housing.csv at compile time and generates:
-- latitude :: Expr Double
-- total_rooms :: Expr Double
-- ocean_proximity :: Expr Text
-- ... one binding per column
$(F.declareColumnsFromCsvFile "./data/housing.csv")
main :: IO ()
main = do
df <- D.readCsv "./data/housing.csv"
print $ df
|> D.derive "rooms_per_household" (total_rooms / households)
|> D.filterWhere (median_income .>. 5)
|> D.groupBy ["ocean_proximity"]
|> D.aggregate [F.mean median_house_value `as` "avg_value"]
Compare this to the manual version which requires spelling out every column name and type:
-- Without TH — every column needs its name and type spelled out
df |> D.derive "rooms_per_household"
(F.col @Double "total_rooms" / F.col @Double "households")
|> D.filterWhere (F.col @Double "median_income" .>. F.lit 5)
Generate a schema type from a CSV
deriveSchemaFromCsvFile generates a type synonym for use with the typed API — instead of manually writing out every column name and type:
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DataKinds #-}
import qualified DataFrame.Typed as T
-- Generates:
-- type HousingSchema = '[ T.Column "longitude" Double
-- , T.Column "latitude" Double
-- , T.Column "total_rooms" Double
-- , ...
-- ]
$(T.deriveSchemaFromCsvFile "HousingSchema" "./data/housing.csv")
Typed API
When you want compile-time guarantees that column names exist and types match, wrap your DataFrame in a TypedDataFrame:
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified DataFrame as D
import qualified DataFrame.Typed as T
import Data.Text (Text)
import DataFrame.Operators
type EmployeeSchema =
'[ T.Column "name" Text
, T.Column "department" Text
, T.Column "salary" Double
]
main :: IO ()
main = do
df <- D.readCsv "employees.csv"
case T.freeze @EmployeeSchema df of
Nothing -> putStrLn "Schema mismatch!"
Just tdf -> do
let result = tdf
|> T.derive @"bonus" (T.col @"salary" * T.lit 0.1)
|> T.filterWhere (T.col @"salary" .>. T.lit 50000)
|> T.select @'["name", "bonus"]
print (T.thaw result)
T.freeze validates the runtime DataFrame against your schema once at the boundary. After that, every column access is checked at compile time:
-- Typo in column name → compile error
tdf |> T.filterWhere (T.col @"slary" .>. T.lit 50000)
-- error: Column "slary" not found in schema
-- Wrong type → compile error
tdf |> T.filterWhere (T.col @"name" .>. T.lit 50000)
-- error: Couldn't match type 'Text' with 'Double'
filterAllJust goes further — it strips Maybe from every column in the schema type, so downstream code can’t accidentally treat cleaned columns as nullable:
-- Before: TypedDataFrame '[Column "score" (Maybe Double), Column "name" Text]
let cleaned = T.filterAllJust tdf
-- After: TypedDataFrame '[Column "score" Double, Column "name" Text]
cleaned |> T.derive @"scaled" (T.col @"score" * T.lit 100)
Features
I/O: CSV, TSV, Parquet (Snappy, ZSTD, Gzip), JSON. Read Parquet from HTTP URLs and Hugging Face datasets (hf:// URIs). Column projection and predicate pushdown for Parquet reads.
Operations: filter, select, derive, groupBy, aggregate, joins (inner, left, right, full outer), sort, sample, stratified sample, distinct, k-fold splits.
Expressions: typed column references (F.col @Double "x"), arithmetic, comparisons, logical operators, nullable-aware three-valued logic (.==, .&&), string matching (like, regex), casting, and user-defined functions via lift/lift2.
Statistics: mean, median, mode, variance, standard deviation, percentiles, inter-quartile range, correlation, skewness, frequency tables, imputation.
Plotting: terminal plots (histogram, scatter, line, bar, box, pie, heatmap, stacked bar, correlation matrix) and interactive HTML plots.
Lazy engine: streaming query execution for files that don’t fit in memory. Rule-based optimizer with filter fusion, predicate pushdown, and dead column elimination. Pull-based executor with configurable batch sizes.
Interop: Arrow C Data Interface for zero-copy round-trips with Python and Polars.
ML: decision trees (TAO algorithm), feature synthesis, k-fold cross-validation, stratified sampling.
Notebooks: IHaskell integration with pre-built Binder examples.
Lazy Queries
For files too large to fit in memory, DataFrame.Lazy provides a streaming query engine. Declare a schema, build a query plan with the same familiar operations, and runDataFrame runs it through an optimizer before streaming results batch-by-batch:
import qualified DataFrame.Lazy as L
import qualified DataFrame.Functions as F
import DataFrame.Operators
import DataFrame.Internal.Schema (Schema, schemaType)
import Data.Text (Text)
mySchema :: Schema
mySchema = [ ("name", schemaType @Text)
, ("weight", schemaType @Double)
, ("height", schemaType @Double)
]
main :: IO ()
main = do
result <- L.runDataFrame $
L.scanCsv mySchema "large_file.csv"
|> L.filter (F.col @Double "height" .>. F.lit 1.7)
|> L.select ["name", "weight", "height"]
|> L.derive "bmi" (F.col @Double "weight"
/ (F.col @Double "height" * F.col @Double "height"))
|> L.take 1000
print result
The optimizer pushes the filter into the scan, drops unreferenced columns before reading, and stops pulling batches once 1000 rows have been collected.
Documentation
- User guide: https://dataframe.readthedocs.io/en/latest/
- API reference: https://hackage.haskell.org/package/dataframe/docs/DataFrame.html
- Coming from pandas, Polars, dplyr, or Frames?
- Cookbook (SQL-style patterns)
- Tutorials
- Discord: https://discord.gg/8u8SCWfrNC
Changes
Revision history for dataframe
1.1.2.0
- Safe read can now choose between Either and Maybe for error handling.
- Add countAll and over (window) functions.
- Add mkRandom to make random columns.
- Faster CSV parsing.
1.1.1.0
New features
- Add
DataFrame.Typed.Lazymodule — a type-safe lazy query pipeline combining compile-time schema tracking with deferred execution. - Add
fromCsvfunction for parsing a CSV string directly into a DataFrame. - Add
DataKindsextension andDataFrame.Typedimport to the GHCi file for easier interactive typed dataframe workflows.
Performance
- Specialize and inline aggregation functions (
sum,mean,variance,median,stddev, etc.) to avoid expensive numeric conversions at runtime. - Remove
Ordconstraint fromColumnable'and move it to call sites, reducing unnecessary constraint propagation. - Replace exponential type-level
Ifnesting in typed schema families with linear helper type families, fixing slow compilation times.
Bug fixes
- Fix Functions module compilation under GHC 9.10 (#194).
- Document and test
safeColumnsoption inParquetReadOptions(#190).
1.1.0.0
Breaking changes
- Remove
OptionalColumnconstructor; fold nullability intoBoxedColumn/UnboxedColumnvia bit-packed bitmap. - Remove
NFDatainstance fromColumnableconstraint.
New features
- Add
toCsvandtoSeparatedfor converting a DataFrame to CSV/delimited text without writing to a file. safeReadnow defaults reading columns toMaybe a.- Split SIMD CSV reader into a separate
dataframe-fastcsvpackage.
Bug fixes
- Fix joins for missing key columns (#187).
- Fix single column not found error when using typed dataframe.
- Fix Synthesis to use
SafeLookupconstraint. - Fix
writeSeparatedignoring separator parameter (was hardcoded to comma).
Internal
- Slice groups now use custom backpermute instead of converting unboxed vectors.
- Reuse comparison operators in Subset.
- Refactor
getRowAsTextfor readability using pattern guards.
1.0.0.1
- toMarkdownTable is now toMarkdown (mostly used internally)
- Provide toMarkdown’ that outputs string
- Add associativity to nullable operators
- Better null dataframe handling/error messages for core operations.
- Fix some function display names.
- Examples now build with CI
1.0.0.0
- Fix mappend to respect schema of empty columns.
- Add cast operators that force column schema
- Add null aware operators so some operations are easier.
- Add arrow shim with python example.
- Add numeric promotion for numeric operations.
- Add column provenance tracking
- Add stratified sampling
- Read files from hugging face
0.7.0.0
- This release adds A LOT of AI code to the repo (which we’ll now pause in favour of refactoring, testing, and completeness for 1.0)
- The lazy reader now has a custom binary format that it spills to. This almost halved the time it takes to run the 1 billion row challenge. The lazy evaluation now also supports Parquet.
- You can now explicitly set the schema of a subset of columns (named) as opposed to setting a list.
- Join order is changed from pipelining style to function call style.
- Bug in division being marked as commutative.
- Columnable now requires nfdata.
- Parquet reads are now faster across the board.
- Better test coverage.
0.6.0.0
- New typed API see https://dataframe.readthedocs.io/en/latest/using_dataframe_in_a_standalone_script.html
- Faster joins
- Fine grained parquet reads using
readParquetFilesWithOpts.
0.5.0.0
- SortOrder now takes an expression rather than a string.
- selectBy now respects the original column order.
- Some changes to the internal representation of the expression GADT.
- Added fixity to binary operations for column comparisons.
- Fixed ZSTD decompression for Parquet files.
- Added logical type handling for timestamps and decimals.
- aggregation now handles null dataframes gracefully.
- Drop random dep to 1.2 and use random-shuffle for shuffling instead.
- Join now merges columns to a
These. - Added writeTsv function.
- Move expression operators to
Operatorsnamespace.
0.4.1.0
- Improve signal handling of dataframe repl.
writeCsvnot correctly writesMaybevalues (thanks to @mcoady).- Create a boilerplate package in cache that will be used to start repl.
- Tree implementation is now a TAO tree instead of greedy cart trees.
- Add
sampleMandtakeMfunctions.
0.4.0.10
- License in cabal was wrong.
- Remove ollama-haskell dependencies.
- readParquetFiles for reading globs.
- Fixed printing of
neqfunction.
0.4.0.9
- Update license to MIT.
0.4.0.8
- LLM guided decision tree
- Parquet fixes: floats read properly and bit width now properly interpreted.
- renameM added to monadic functions.
- No need to explicitly import text for declareColumns
0.4.0.7
- Pretty printer for expression
- Fix issue with how
powwas getting displayed. exposeColumnshas been renamed todeclareColumns
0.4.0.6
- Even faster groupby: uses radix sort rather than mergesort.
- Created
rowValuefunction -df |> D.toRowList |> map (D.rowValue some_column). - SIMD reads for TSV files (thanks to @jhingon).
- Fix string representation of recodeWithDefault.
- Fix gain function for decision tree.
- Add disallowed pairs for decision tree analyst.
- Decision tree percentiles are now a tree-level configuration.
0.4.0.5
- Faster groupby: does less allocations by keeping everything in a mutable vector.
- declareColumnsFromCsvFile now infers types from a sample rather than reading the whole dataframe.
- Decision trees API is now more configurable.
- Add annotation to show what expressions were used to derive a column.
0.4.0.4
- More robust synthesis based decision tree
- Improved performance on sum and mean.
- recodeWithCondition - to change a value given a condition
- medianMaybe, genericPercentile, percentile - self explanatory
- all the maybe functions as dataframe functions
- Fix concatColumnsEither when types are the same.
- Decision tree implementation is more robust now.
0.4.0.3
- Improved performance for folds and reductions.
- Improve standalone mean and correlation functions.
- Remove buggy boxedness check in aggregations.
- CSV files shouldn’t have spaces in headers.
- Small decision tree implementation (experimental).
0.4.0.1
- Fuse literals in binary expressions and conditionals: we can now express computations like:
df |> D.groupBy [F.name ocean_proximity] |> D.aggregate ["rand" .= F.sum (F.ifThenElse (ocean_proximity .== "ISLAND") 1 0)]. - Unary aggregations do not mistakenly boxed unboxed instances.
0.4.0.0
readSeparatedno longer takes the separator as an argument. This is not placed into readOptions.- Some improvements to the synthesis demo
- Add a
declareColumnsParquetFilefunction. - Column conversion functions now take expressions instead of strings.
- Add more monadic functions to make previously tricky transformations easier to write:
{-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE TemplateHaskell #-} module Main where import qualified DataFrame as D import qualified DataFrame.Functions as F import DataFrame.Monad import Data.Text (Text) import DataFrame.Functions ((.&&), (.>=)) $(F.declareColumnsFromCsvFile "./data/housing.csv") main :: IO () main = do df <- D.readCsv "./data/housing.csv" print $ execFrameM df $ do is_expensive <- deriveM "is_expensive" (median_house_value .>= 500000) meanBedrooms <- inspectM (D.meanMaybe total_bedrooms) totalBedrooms <- imputeM total_bedrooms meanBedrooms filterWhereM (totalBedrooms .>= 200 .&& is_expensive)
0.3.5.0
- Add a
deriveWithExprthat returns an expression that you can use in a subsequent expressions. - Add
declareColumnsFromCsvFilewhich can create the expressions up front for use in scripts.import qualified DataFrame as D import qualified DataFrame.Functions as F import Data.Text (Text) import DataFrame.Functions ((.==), (.>=)) $(F.declareColumnsFromCsvFile "./data/housing.csv") main :: IO () main = do df <- D.readCsv "./data/housing.csv" let (df', test) = D.deriveWithExpr "test" (median_house_value .>= 500000) df print (D.filterWhere test df') - Fix bounds on random.
- Parquet Column chunks weren’t reading properly because we didn’t correctly calculate the list size.
- Sum function had a bug where the first number was summed twice.
- Add monadic interface for building dataframe expressions that makes schema evolution nice.
{-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE TemplateHaskell #-} module Main where import qualified DataFrame as D import qualified DataFrame.Functions as F import DataFrame.Monad import Data.Text (Text) import DataFrame.Functions ((.&&), (.>=)) $(F.declareColumnsFromCsvFile "./data/housing.csv") main :: IO () main = do df <- D.readCsv "./data/housing.csv" print $ runFrameM df $ do is_expensive <- deriveM "is_expensive" (median_house_value .>= 500000) filterWhereM is_expensive luxury <- deriveM "luxury" (is_expensive .&& median_income .>= 8) filterWhereM luxury - Change order of exponentiation to putting the exponent second. It was initially first cause of some internal efficiency detail but that’s silly.
- Fix bug where we didn’t concat columns from row groups.
0.3.4.1
- Faster sum operation (now does a reduction instead of collecting the vector and aggregating)
- Update the fixity of comparison operations. Before
(x + y) .<= 10. Now:x + y ,<= 10. - Revert sort for groupby back to mergesort.
0.3.4.0
- Fix right join - previously erased some values in the key.
- Change sort API so we can sort on different rows.
- Add meanMaybe and stddevMaybe that work on
Maybevalues. - More efficient numeric groupby - use radix sort for indices and pre-sort when collecting.
0.3.3.9
- Fix compilation issue for ghc 9.12.*
0.3.3.8
- More efficient inner joins using hashmaps.
- Initial JSON lines implementation
- More robust logic when specifying CSV types.
- Strip spaces from titles and rows in CSV reading.
- Auto parsing bools in CSV.
- Add
imputeWith,bind,nRows,nColumns,recodeWitDefaultfunction that takes - Better support for proper markdown
- Fix bug with full outer join.
- Unify
insertVectorandinsertListfunctions into insert.
0.3.3.7
- Many functions how rely on expressions (not strings).
- full, left, and right join now implemented.
- fastCsv now strips quotations from text.
- Add “NA” as a nullish pattern.
- Add bin parameter to terminal plotting.
- Implement filterAllNothing for null handling.
- Remove behaviour where we parse mixed types as
Either - Add
whenPresent,whenBothPresentandrecodefunctions. - Web charts now show on first load.
- Add deriveMay function for multiple column derivations.
0.3.3.6
- Fix bug where doubles were parsing as ints
- Fix bugs where optionals were left in boxed column (instead of optionals)
- Change syntax for conditional operations so it doesn’t clash with regular operations.
0.3.3.5
- Fix parsing logic for doubles. Entire parsing logic is still a work in progress.
- Speed up index selection by using backPermute.
- Add
modefunction toFunctions. - Rewrite some expressions to evaluation more efficient.
- Show correct number of rows in message after truncating for display.
- Add experimental fast CSV parsers (thanks @jhingon)
- Add support to read dataframes from SQL databases.
0.3.3.4
- Add linting CI step + fix existing lint errors.
- Show now only prints 10 row. To print more you should use the new
displayfunction that takes the number of rows as a parameter in its configuration. - Add
toDouble,div, and,modfunctions. - Define an
IsStringinstance for columns so you can use string literals withoutF.lit. - Include variance expression.
- Improved filter performance.
- Make beam search loss function configurable for synthesizing features.
0.3.3.3
- Split
toMatrixinto more specificto<Type>Matrixfunctions.
0.3.3.2
- Update documentation on both readthedocs and hackage.
0.3.3.1
- Fix bug in
randomSplitcausing two splits to overlap.
0.3.3.0
- Better error messaging for expression failures.
- Fix bug where exponentials were not being properly during CSV parsing.
toMatrixnow returns either an exception or the a vector of vector doubles.- Add
sample,kFolds, andrandomSplitto sample dataframes.
0.3.2.0
- Fix dataframe semigroup instance. Appending two rows of the same name but different types now gives a row of
Either a b(work by @jhrcek). - Fix left expansion of semigroup instance (work by @jhrcek).
- Added
hasElemTypefunction that can be used withselectByto filter columns by type. E.g.selectBy [byProperty (hasElemType @Int)] df. - Added basic support for program synthesis for feature generation (
synthesizeFeatureExpr) and symbolic regression (fitRegression). - Web plotting doesn’t embed entire script anymore.
- Added
relu,min, andmaxfunctions for expressions. - Add
fromRowsfunction to build a dataframe from rows. Also addtoAnyfunction that converts a value to a dynamic-like Columnable value. isNumericfunction now recognisesIntegertypes.- Added
readCsvWithOptsfunction that allows read specification. - Expose option to specify data formats when parsing CSV.
- Added setup script for Hasktorch example.
0.3.1.2
- Update granite version, again, for stackage.
0.3.1.1
- Aggregation now works on expressions rather than just column references.
- Export writeCsv
- Loosen bounds for dependencies to keep library on stackage.
- Add
filterNothingfunction that returns all empty rows of a column. - Add
IfThenElsefunction for conditional expressions. - Add
synthesizeFeatureExprfunction that does a search for a predictive variable in aDoubledataframe.
0.3.1.0
- Add new
selectByfunction which subsumes all the other select functions. Specifically we can:selectBy [byName "x"] df: normal select.selectBy [byProperty isNumeric] df: all columns with a given property.selectBy [byNameProperty (T.isPrefixOf "weight")] df: select by column name predicate.selectBy [byIndexRange (0, 5)] df: picks the first size columns.selectBy [byNameRange ("a", "c")] df: select names within a range.
- Cut down dependencies to reduce binary/installation size.
- Add module for web plots that uses chartjs.
- Web plots can open in the browser.
0.3.0.4
- Fix bug with parquet reader.
0.3.0.3
- Improved parquet reader. The reader now supports most parquet files downloaded from internet sources
- Supports all primitive parquet types plain and uncompressed.
- Can decode both v1 and v2 data pages.
- Supports Snappy and ZSTD compression.
- Supports RLE/bitpacking encoding for primitive types
- Backward compatible with INT96 type.
- From the parquet-testing repo we can successfully read the following:
- alltypes_dictionary.parquet
- alltypes_plain.parquet
- alltypes_plain.snappy.parquet
- alltypes_tiny_pages_plain.parquet
- binary_truncated_min_max.parquet
- datapage_v1-corrupt-checksum.parquet
- datapage_v1-snappy-compressed-checksum.parquet
- datapage_v1-uncompressed-checksum.parquet
- Improve CSV parsing: Parse bytestring and convert to text only at the end. Remove some redundancies in parsing with suggestions from @Jhingon.
- Faster correlation computation.
- Update version of granite that ships with dataframe and add new scatterBy plot.
0.3.0.2
- Re-enable Parquet.
- Change columnInfo to describeColumns
- We can now convert columns to lists.
- Fast reductions and groupings. GroupBys are now a dataframe construct not a column construct (thanks to @stites).
- Filter is now faster because we do mutation on the index vector.
- Frequencies table nnow correctly display percentages (thanks @kayvank)
- Show table implementations have been unified (thanks @metapho-re)
- We now compute statistics on null columns
- Drastic improvement in plotting since we now use granite.
0.3.0.1
- Temporarily remove Parquet support. I think it’ll be worth creating a spin off of snappy that doesn’t rely on C bindings. Also I’ll probably spin Parquet off into a separate library.
0.3.0.0
- Now supports inner joins
ghci> df |> D.innerJoin ["key_1", "key_2"] other
- Aggregations are now expressions allowing for more expressive aggregation logic. Previously:
D.aggregate [("quantity", D.Mean), ("price", D.Sum)] dfnowD.aggregate [(F.sum (F.col @Double "label") / (F.count (F.col @Double "label")) `F.as` "positive_rate")] - In GHCI, you can now create type-safe bindings for each column and use those in expressions.
ghci> :exposeColumns df
ghci> D.aggregate [(F.sum label / F.count label) `F.as` "positive_rate"]
- Added pandas and polars benchmarks.
- Performance improvements to
groupBy. - Various bug fixes.
0.2.0.2
- Experimental Apache Parquet support.
- Rename conversion columns (changed from toColumn and toColumn’ to fromVector and fromList).
- Rename constructor for dataframe to fromNamedColumns
- Create an error context for error messages so we can change the exceptions as they are thrown.
- Provide safe versions of building block functions that allow us to build good traces.
- Add readthedocs support.
0.2.0.1
- Fix bug with new comparison expressions. gt and geq were actually implemented as lt and leq.
- Changes to make library work with ghc 9.10.1 and 9.12.2
0.2.0.0
Replace Function adt with a column expression syntax.
Previously, we tried to stay as close to Haskell as possible. We used the explicit ordering of the column names in the first part of the tuple to determine the function arguments and the a regular Haskell function that we evaluated piece-wise on each row.
let multiply (a :: Int) (b :: Double) = fromIntegral a * b
let withTotalPrice = D.deriveFrom (["quantity", "item_price"], D.func multiply) "total_price" df
Now, we have a column expression syntax that mirrors Pyspark and Polars.
let withTotalPrice = D.derive "total_price" (D.lift fromIntegral (D.col @Int "quantity") * (D.col @Double"item_price")) df
Adds a coverage report to the repository (thanks to @oforero)
We don’t have good test coverage right now. This will help us determine where to invest. @oforero provided a script to make an HPC HTML report for coverage.
Convenience functions for comparisons
Instead of lifting all bool operations we provide eq, leq etc.
0.1.0.3
- Use older version of correlation for ihaskell itegration
0.1.0.2
- Change namespace from
Data.DataFrametoDataFrame - Add
toVectorfunction for converting columns to vectors. - Add
imputefunction for replacingNothingvalues in optional columns. - Add
filterAllJustto filter out all rows with missing data. - Add
distinctfunction that returns a dataframe with distict rows.
0.1.0.1
- Fixed parse failure on nested, escaped quotation.
- Fixed column info when field name isn’t found.
0.1.0.0
- Initial release