Deprecated

sparkle

Distributed Apache Spark applications in Haskell

Version on this page:	0.5.0.1
LTS Haskell 12.26:	0.7.4
Stackage Nightly 2018-09-28:	0.7.4
Latest on Hackage:	0.7.4@rev:1

See all snapshots sparkle appears in

BSD-3-Clause licensed by Tweag I/O

Maintained by [email protected]

This version can be pinned in stack with:sparkle-0.5.0.1@sha256:2771e60f2ee37f8aeca43d89fa731cf8d54db5154dc510f038fa00e6f2988d38,2859

Module documentation for 0.5.0.1

Control
- Control.Distributed
  - Control.Distributed.Spark
    - Control.Distributed.Spark.Closure
    - Control.Distributed.Spark.Context
    - Control.Distributed.Spark.ML
      - Control.Distributed.Spark.ML.Feature
        Control.Distributed.Spark.ML.Feature.CountVectorizer
        
        Control.Distributed.Spark.ML.Feature.RegexTokenizer
        
        Control.Distributed.Spark.ML.Feature.StopWordsRemover
      - Control.Distributed.Spark.ML.LDA
    - Control.Distributed.Spark.PairRDD
    - Control.Distributed.Spark.RDD
    - Control.Distributed.Spark.SQL

Depends on 17 packages(full list with versions):

base, binary, bytestring, choice, distributed-closure, filepath, jni, jvm, jvm-streaming, process, regex-tdfa, singletons, sparkle, streaming, text, vector, zip-archive

Used by 1 package in lts-9.21(full list with versions):

sparkle

sparkle: Apache Spark applications in Haskell

sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details.

This is an early tech preview, not production ready.

Getting started

The tl;dr using the hello app as an example on your local machine:

$ stack build hello
$ stack exec -- sparkle package sparkle-example-hello
$ stack exec -- spark-submit --master 'local[1]' sparkle-example-hello.jar

How to use

To run a Spark application the process is as follows:

create an application in the apps/ folder, in-repo or as a submodule;
add your app to stack.yaml;
build the app;
package your app into a deployable JAR container;
submit it to a local or cluster deployment of Spark.

If you run into issues, read the Troubleshooting section below first.

Build

Linux

Requirements

the Stack build tool (version 1.2 or above);
either, the Nix package manager,
or, OpenJDK, Gradle and Spark (version 1.6) installed from your distro.

To build:

$ stack build

You can optionally get Stack to download Spark and Gradle in a local sandbox (using Nix) for good build results reproducibility. This is the recommended way to build sparkle. Alternatively, you’ll need these installed through your OS distribution’s package manager for the next steps (and you’ll need to tell Stack how to find the JVM header files and shared libraries).

To use Nix, set the following in your ~/.stack/config.yaml (or pass --nix to all Stack commands, see the Stack manual for more):

nix:
  enable: true

Other platforms

sparkle is not directly supported on non-Linux operating systems (e.g. Mac OS X or Windows). But you can use Docker to run sparkle natively inside a container on those platforms. First,

$ stack docker pull

Then, just add --docker as an argument to all Stack commands, e.g.

$ stack --docker build

By default, Stack uses the tweag/sparkle build and test Docker image, which includes everything that Nix does as in the Linux section. See the Stack manual for how to modify the Docker settings.

Package

To package your app as a JAR directly consumable by Spark:

$ stack exec -- sparkle package <app-executable-name>

Submit

Finally, to run your application, for example locally:

$ stack exec -- spark-submit --master 'local[1]' <app-executable-name>.jar

The <app-executable-name> is any executable name as given in the .cabal file for your app. See apps in the apps/ folder for examples.

See here for other options, including launching a whole cluster from scratch on EC2. This blog post shows you how to get started on the Databricks hosted platform and on Amazon’s Elastic MapReduce.

How it works

sparkle is a tool for creating self-contained Spark applications in Haskell. Spark applications are typically distributed as JAR files, so that’s what sparkle creates. We embed Haskell native object code as compiled by GHC in these JAR files, along with any shared library required by this object code to run. Spark dynamically loads this object code into its address space at runtime and interacts with it via the Java Native Interface (JNI).

Troubleshooting

`jvm` library or header files not found

You’ll need to tell Stack where to find your local JVM installation. Something like the following in your ~/.stack/config.yaml should do the trick, but check that the paths match up what’s on your system:

extra-include-dirs: [/usr/lib/jvm/java-7-openjdk-amd64/include]
extra-lib-dirs: [/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server]

Or use --nix: since it won’t use your globally installed JDK, it will have no trouble finding its own locally installed one.

Can’t build sparkle on OS X

OS X is not a supported platform for now. There are several issues to make sparkle work on OS X, tracked in this ticket.

Gradle <= 2.12 incompatible with JDK 9

If you’re using JDK 9, note that you’ll need to either downgrade to JDK 8 or update your Gradle version, since Gradle versions up to and including 2.12 are not compatible with JDK 9.

License

sparkle is free software, and may be redistributed under the terms specified in the LICENSE file.

About

Tweag I/O

sparkle is maintained by Tweag I/O.

Have questions? Need help? Tweet at @tweagio.

Changes

Change Log

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

[0.5] - 2017-02-21

Added

Bind to expm1
Add bindings to dayofmonth, current_timestamp and current_date.
Add support for the dataframe condition expressions
Add bindings to withColumnRenamed, columns, printSchema, Column.expr.
Bind DataFrame distinct.
Add bindings for log and log1p for Columns.
Add binding to Column.cast.
Add bindings getList and array for columns.
Add bindings: schema for rows, Metadata type, javaRDD, range, Row getters and constructors, StrucType constructors, createDataFrame, more DataType bindings.

Changed

Prevent Haskell exceptions from escaping apply.
Update sparkle to work with latest jni which uses ForeignPtr for java references.
Move StructType and friends to modules StructField, DataType and Metadata.
Rename createRow, rowGet, rowSize, joinPairRDD to have the same names as the java methods.

[0.4]

Added

Support for reading/writing Parquet files.
More RDD method bindings: repartition, treeAggregate, binaryRecords, aggregateByKey, mapPartitions, mapPartitionsWithIndex.
More complete DataFrame support.
Intero support.
stack ghci support.
Support Template Haskell splices and ANN annotations that use sparkle code.

Changed

Fixed

More reliable initialization of embedded shared library.
Cleanup temporary files properly.

[0.3] - 2016-12-27

Added

Dockerfile to build sparkle.
Compatibility with singletons-2.2.
Add the identity Reify/Reflect instances.
Change JNI bindings to use new JNI.String type, instead of ByteString. This new type guarantees the invariants required by the JNI API (null-termination in particular).

Changed

Remove Reify/Reflect instances for Int. Only instances for sized types remain.

Fixed

Fix type in Reify Int making it incorrect.

[0.2.0] - 2016-12-13

Added

New binding: getOrCreateSQLContext.

Changed

getOrCreate renamed to getOrCreateSparkContext.

[0.1.0.1] - 2016-06-12

Added

More bindings to more call*Method JNI functions.

Changed

Use getOrCreate to get SparkContext.

[0.1.0] - 2016-04-25

Initial release

Deprecated

sparkle

Module documentation for 0.5.0.1

sparkle: Apache Spark applications in Haskell

Getting started

How to use

Build

Linux

Other platforms

Package

Submit

How it works

Troubleshooting

jvm library or header files not found

Can’t build sparkle on OS X

Gradle <= 2.12 incompatible with JDK 9

License

About

Changes

Change Log

[0.5] - 2017-02-21

Added

Changed

[0.4]

Added

Changed

Fixed

[0.3] - 2016-12-27

Added

Changed

Fixed

[0.2.0] - 2016-12-13

Added

Changed

[0.1.0.1] - 2016-06-12

Added

Changed

[0.1.0] - 2016-04-25

`jvm` library or header files not found