Tools for manipulating sequence clusters

Latest on Hackage:0.1.5

This package is not currently in any snapshots. If you're interested in using it, we recommend adding it to Stackage Nightly. Doing so will make builds more reliable, and allow to host generated Haddocks.

GPL licensed by Ketil Malde
Maintained by Ketil Malde
This contains the following tools:

To build these, you will need a Haskell compiler (the most likely
candidate begin GHC), and my bioinformatics library and the SimpleArgs
module installed (Downloadable from: <>).

filter - remove unwanted sequences from a clustering
usage: filter seq.list < cluster.L > cluster2.L
cluster2.L will only contain sequence labels found in seq.list

hist - produce a histogram of cluster sizes from a "label"-formatted

clusc - compare clusterings, calculating numerous pair-based and
entropy based indices.

xcerpt - given a file containing a list of sequence labels (e.g. a
"label" formatted clustering), extract matching sequences
from a FASTA file. Like "agrep -d '^>'" without the bugs.

Usage: xcerpt list.txt fasta.seq
creates "fasta.seq.match" and ""

add_single - add singletons to a clustering.
Usage: add_single all.L clustering.L
creates clustering.L_s listing all sequences in all.L but not in
clustering.L, one per line.

ace2contigs - parse an ACE assembly file, and output the contigs in a
FASTA file (named by tacking on .fasta to the ACE file name),
and the corresponding quality information (.qual).

ace2fasta - parse an ACE assembly, and output each assembly in a separate
FASTA formatted file, with the necessary gaps inserted to align the
sequences (suitable for import into e.g. Seaview)

ace2clusters - parse an ACE assembly, and output clusters composed of the
sequences used for each contig. The format is similar to TGICL's,
with cluster output as one line consisting of a '>' and the contig name,
and the next line containing the names of the sequences that comprise
the cluster.

clusterlibs - given a table of regular expressions and library names,
along with a clustering (TGICL-format), output a table of clusters
with the library name prepended to the sequences.
Used by 1 package:
comments powered byDisqus