sequenceTools
A package with tools for processing DNA sequencing data
Version on this page: | 1.5.2 |
LTS Haskell 22.43: | 1.5.3.1 |
Stackage Nightly 2024-12-05: | 1.5.3.1 |
Latest on Hackage: | 1.5.3.1 |
sequenceTools-1.5.2@sha256:1571e5ecd2db5fe95025b14e2c8f4fedc1366f9dbe696f6e38097d20183079e8,3018
Module documentation for 1.5.2
- SequenceTools
SequenceTools
This repository contains some programs that I use for processing sequencing data.
Installation
Simple Installation via stack and hackage
This installation installs the latest version that is up on hackage:
- Download stack (https://docs.haskellstack.org/en/stable/README/#how-to-install).
- Run
stack install sequenceTools --resolver nightly
. You should now have the executables from this package under~/.local/bin
. - Add
~/.local/bin
to your PATH, for example by adding to your~/.profile
or~/.bash_profile
the linePATH=$PATH:$HOME/.local/bin
. Runsource ~/.profile
orsource ~/.bash_profile
, respectively, to update your path.
Installation from source via stack
- Download stack (https://docs.haskellstack.org/en/stable/README/#how-to-install).
- Clone this repository via
git clone https://github.com/stschiff/sequenceTools.git
- Install via
cd sequenceTools; stack install
Commands
pileupCaller
The main tool in this repository is the program pileupCaller
to sample alleles from low coverage sequence data. The first step is to generate a “pileup” file at all positions you wish to genotype. To do that, here is a typical command line, which restricts to mapping and base quality of 30:
samtools mpileup -R -B -q30 -Q30 -l <list_of_positions.txt> \
-f <reference_genome.fasta> \
Sample1.bam Sample2.bam Sample3.bam > pileup.txt
Important Note: You should definitely use the -B
flag, which disables base alignment quality recalibration. This mechanism is turned on by default and causes huge reference bias with low coverage ancient DNA data. This flag disables the mechanism.
In the above command line, the file “list_of_positions.txt” should either contain positions (0-based) or a bed file (see samtools manual for details). The output is a simple text file with all positions that could be genotyped in the three samples.
Next, you need to run my tool pileupCaller, which you run like this:
pileupCaller --randomHaploid --sampleNames Sample1,Sample2,Sample3 \
--samplePopName MyPop -f <Eigenstrat.snp> \
-e <My_output_prefix> < pileup.txt
Here, options --sampleNames
gives the names of the samples that is output in the Eigenstrat *.ind
file, and and -–samplePopName
is optional to also give the population names in that file (defaults to Unknown
, you can also change it later in the output). Then, (required) option -f
needs an Eigenstrat positions file. This is required for pileupCaller to know what is the reference and which the alternative allele in your reference dataset that you want to call. An Eigenstrat positions file is a line-based file format, where each line denotes a SNP position, and there are exactly six required columns, denoting in order i) SNP ID, ii) chromosome, iii) genetic position (can be set to zero), iv) physical position, v) reference allele, vi) alternate allele. Here is an example:
rs0000 11 0.000000 0 A C
rs1111 11 0.001000 100000 A G
rs2222 11 0.002000 200000 A T
rs3333 11 0.003000 300000 C A
rs4444 11 0.004000 400000 G A
rs5555 11 0.005000 500000 T A
rs6666 11 0.006000 600000 G T
Finally, the -e
option specifies Eigenstrat as output format and gives the prefix for the *.ind
, *.pos
and *.geno
files. Without the -e
option, pileupCaller will output in FreqSum format, described here, which is useful for debugging your pipeline, since it’s just a single file that is output into the terminal and can therefore easily be inspected.
You can also get some help by typing pileupCaller -h
, which shows a lot more option, for example the sampling method, minimal coverage and other important options.
Note that you can also fuse the two steps above into one unix pipe:
samtools mpileup -R -B -q30 -Q30 -l <list_of_positions.txt> \
-f <reference_genome.fasta> \
Sample1.bam Sample2.bam Sample3.bam | \
pileupCaller --randomHaploid --sampleNames Sample1,Sample2,Sample3 \
--samplePopName MyPop -f <Eigenstrat.snp> \
-e <My_output_prefix>
There is however an issue here: If you have aligned your read data to a version of the reference genome that uses chr1
, chr2
and so on as chromosome names, the resulting Eigenstrat file will be valid, but won’t merge with other Eigenstrat datasets that use chromosome names 1
, 2
and so on. I would therefore recommend to strip the chr
from your chromosome names if necessary. You can do that easily using a little UNIX filter using the sed
tool. In the full pipeline, it looks like this:
samtools mpileup -R -B -q30 -Q30 -l <list_of_positions.txt> \
-f <reference_genome.fasta> \
Sample1.bam Sample2.bam Sample3.bam | sed 's/chr//' | \
pileupCaller --sampleNames Sample1,Sample2,Sample3 \
--samplePopName MyPop -f <Eigenstrat.snp> \
-o EigenStrat -e <My_output_prefix>
vcf2eigenstrat
Simple tool to convert a VCF file to an Eigenstrat file. Pretty self-explanatory. Please run vcf2eigenstrat --help
to output some documentation.
genoStats
A simple tool to get some per-individual statistics from an Eigenstrat or Freqsum-file. Run genoStats --help
for documentation.
Scripts
This package also contains several haskell wrapper scripts for the following ADMIXTOOLS and EIGENSOFT commands: convertf, mergeit, qp3Pop, qpDstat and smartPCA. The original tools require parameter files as input, which I find tedious to use in bioinformatics pipelines. I wrote those wrapper scripts to be able to start the tools with a simple command line option interface.
If you have stack
installed your system (see above), you should be able to run those scripts on your machine without any difficult setup. Simply clone this repository, navigate to the scripts
subfolder and invoke any script using standard bash execution, for example
./convertf_wrapper.hs
If you start this the first time it may take a while, since stack
downloads all dependencies and even the script interpreter for you, but after that it should start instantanious. If you want to use the scripts from your path, I suggest to put symbolic links into any folder that is already on your path (for example ~/.local/bin
).
Changes
V 1.2.3 : Adapted to newest sequence-formats. Had to change all the chromosome-related code to the newType Chrom datatype. Also started implementing normaliseBimWithVCF.
V 1.2.4: normaliseBimWithVCF is ready.
V 1.3.0: Lots of refactoring. Lots of testing. Removed some features in vcf2eigenstrat and in pileupCaller, including the option in pileupCaller to call without a SNP file.
V 1.3.1: Bumped dependency on sequence-formats to new sequence-formats-1.4.0, which includes strand-information in pileup data, as well as rsIds in freqSum to output the correct rsId, and an option to parse chromosomes X, Y and MT.
V 1.4.0: Added single strand mode, and new triallelic treatment.
V 1.4.0.1: Improved README, fixed output bug in genoStats.hs
V 1.4.0.3: Updated to new sequence-formats version, now including reading of genetic position from eigenstrat files.
V 1.4.0.4:
- Fixed eigenstrat-output in pileupCaller to add a dot after the outputprefix before the file extensions.
- Updated haskell-stack wrapper scripts for EIGENSOFT and ADMIXTOOLS.
- Moved unmaintained scripts into unmaintained folder.
V 1.5.0: Added support for Plink output
V 1.5.1: Added automatic building
V 1.5.2: Fixed a bug with –samplePopName having to be entered after -p or -e. Fixed a bug in the sequence-formats dependency.