sequor

A sequence labeler based on Collins's sequence perceptron. https://bitbucket.org/gchrupala/sequor

Latest on Hackage:0.7.5

This package is not currently in any snapshots. If you're interested in using it, we recommend adding it to Stackage Nightly. Doing so will make builds more reliable, and allow stackage.org to host generated Haddocks.

BSD3 licensed by Grzegorz Chrupała
Sequor
======

Sequor is a sequence labeler based on Collins's (2002)
perceptron. Sequor has a flexible feature template language and is
meant mainly for NLP applications such as Named Entity labeling, Part
of Speech tagging or syntactic chunking. It includes the SemiNER named
entity recognizer, with pre-trained models for German and English (see
`Named Entity Recognition (SemiNER)`_).

Sequor is especially useful if your dataset has a large label set. In
this case it is likely to run faster and allow you to use much less
RAM than a sequence labeler based on Conditional Random
Fields. Additionally sequor implements options which allow you to
control the size of model and tradeoff speed against accuracy:

- size of the beam
- label dictionary
- feature hashing

See https://bitbucket.org/gchrupala/sequor/wiki/Options for details.

Installation
------------

The easiest way to compile and install sequor is to

1. Install the `Haskell platform <http://www.haskell.org/platform/>`_
2. Run::

cabal update
cabal install sequor --prefix=`pwd`

Cabal should then download and install the necessary packages, and
install the sequor binary in ./bin, and the data files in ./share



Usage
-----
With Sequor you can learn a model from sequences manually annotated
with labels, and then apply this model to new data in order to add
labels. Sequor is meant to be used mainly with linguistic data, for
example to learn Part of Speech tagging, syntactic chunking or Named
Entity labeling::

Usage: sequor command [OPTION...] [ARG...]
train: train model
train [OPTION...] TEMPLATE-FILE TRAIN-FILE MODEL-FILE
--rate=NUM (0.01) learning rate
--beam=INT (10) beam size
--iter=INT (10) number of iterations
--min-count=INT (100) minimum feature frequency for label dictionary
--heldout=FILE path to heldout data
--hash use hashing instead of feature dictionary
--hash-sample=INT (1000) sample size to estimate number of features when hashing
--hash-max-size=INT maximum size of parameter vector when hashing

::

See https://bitbucket.org/gchrupala/sequor/wiki/Options for more
details about the training options.

predict: predict using model
predict MODEL-FILE

version: print version
version

help: print usage information
help

Data files should be in the UTF-8 encoding.

As an example we can use data annotated with syntactic chunk labels in
the data directory. For example::

./bin/sequor train data/all.features data/train.conll model\
--rate 0.1 --beam 10 --iter 5 --hash\
--heldout data/devel.conll

./bin/sequor predict model < data/test.conll > data/test.labels

Feature template syntax
-----------------------

Sequor uses a mini language to specify which features to extract from
data. For details see https://bitbucket.org/gchrupala/sequor/wiki/Templates


Named Entity Recognition (SemiNER)
----------------------------------

Sequor includes the SemiNER named entity recognizer, with pre-trained
models for German and English.

The German model recognizer is trained on the CoNLL 2003 data and
recognizes the following labels:

- PER - people
- ORG - organizations
- LOC - locations such as cities and countries
- MISC - miscellaneous entities such as nationalities

The German model is described in [Chrupala_and_Klakow_2010]_.

The English model is trained on the BBN Wall Street Journal data and
recognizes the following labels:

* CARDINAL - cardinal number
* DATE - calendar date
* GPE:CITY - city
* GPE:COUNTRY - country
* GPE:STATE_PROVINCE - state or province
* MONEY - currency
* NORP:NATIONALITY - nationality
* NORP:OTHER -
* NORP:POLITICAL - political affiliation
* ORDINAL - ordinal number
* ORGANIZATION - organization
* PERCENT - percentage
* PERSON - people
* QUANTITY - numerical quantity

See https://bitbucket.org/gchrupala/sequor/wiki/SemiNER for usage
information.


Sequence perceptron
-------------------

Compared to the commonly used Conditional Random Field model, the
Sequence Perceptron algorithm is simpler, more efficient and often has
similar performance.

The sequence perceptron was introduced in [Collins_2002]_.

.. [Collins_2002] Collins, Michael. 2002. Discriminative training
methods for Hidden Markov Models: Theory and experiments with
perceptron
algorithms. EMNLP 2002. http://www.clic.cs.columbia.edu/~mcollins/papers/tagperc.pdf


.. [Chrupala_and_Klakow_2010] Grzegorz Chrupała and Dietrich
Klakow. 2010. A Named Entity Labeler for German: exploiting
Wikipedia and distributional
clusters. LREC. http://grzegorz.chrupala.me/papers/lrec-2010.pdf
comments powered byDisqus