LAST: Genome-Scale Sequence Comparison
======================================

Introduction
------------

LAST finds similar regions between sequences, and aligns them.  LAST
is similar to BLAST, but it copes better with giga-scale biological
sequences.  It can also indicate the reliability of each aligned
column, and it can use sequence quality data.


Requirements
------------

To handle mammalian genomes, you will need at least 2 gigabytes of
RAM, and a few tens of gigabytes of disk space.  To install the
software, you need a C++ compiler.

Optional: to run the scripts, you need a Unix-like environment with
Python.  To make dotplots, you need the Python Imaging Library.

Luxury: to handle mammalian genomes with maximum efficiency, it's good
to have about 16-20 gigabytes of RAM.


Installation
------------

Just go into the src directory and type 'make'.  (If you checked it
out using subversion, then type 'make' in the top-level directory, not
the src directory.)  This should make three programs: lastdb, lastal,
and lastex.  Run the programs without arguments to get usage messages.


Example 1: Compare the human and fugu mitochondrial genomes
-----------------------------------------------------------

You can find these sequences in the examples directory: humanMito.fa
and fuguMito.fa.  Firstly, make a LAST database of the human
sequence::

  lastdb -c humanMito humanMito.fa

This will make some new files whose names begin with "humanMito".
Here, we used "-c" to soft-mask lowercase letters.  (Lowercase
indicates repetitive sequence, and "soft-masking" helps to avoid
uninteresting repetitive alignments.)  Secondly, compare the fugu
sequence to the human database::

  lastal -o myalns.maf humanMito fuguMito.fa

This will write alignments in a file called "myalns.maf".  To view the
alignments, you'll want to avoid text-wrapping, e.g. 'less -S
myalns.maf'.

For an example of aligning multiple mitochondrial genomes, see
multiMito.sh in the examples directory.


Example 2: Compare the cat and mouse genomes
--------------------------------------------

Let's assume you have the cat and mouse genomes in FASTA-format files:
cat/chr*.fa and mouse/chr*.fa.  We'll assume also that repetitive
regions are in lowercase.  We can compare them using these steps::

  lastdb -c -s20G -v mousedb mouse/chr*.fa
  lastal -o myalns.maf -v mousedb cat/chr*.fa

The "-s20G" tells it that >20 GiB of memory are available, so it can
put the whole mouse genome into one database volume.  (For mouse, it
will use well under 20 GiB.)  Without this, it assumes that <2 GiB are
available, so it puts the chromosomes into multiple volumes: this
makes lastal slower (e.g. 1 day instead of several hours).  The "-v"
(verbose) option just makes it write progress messages on the screen.

Next, we might want to remove paralogs or make a dotplot: see the
accompanying document last-scripts.txt.


Example 3: Map DNA reads to the human genome
--------------------------------------------

Let's assume you have the human genome in FASTA-format files
(human/chr*.fa), and the reads in FASTQ-Sanger format (reads.fastq).
This time, we will not mask repeats, because we want to map repetitive
reads too::

  lastdb -m1111110 -s20G -v humandb human/chr*.fa
  lastal -Q1 -o myalns.maf -v humandb reads.fastq

The "-m1111110" makes it better at finding short, strong alignments.
(The default settings are tuned for long, weak alignments.)  The "-Q1"
indicates that the reads are in FASTQ-Sanger format.  For more ideas
on read mapping, see the accompanying document tag-seeds.txt.


Output Formats
--------------

lastal can write alignments in two formats: tabular and MAF.  MAF
format looks like this::

  a score=15
  s chr3L        19433515 23 + 24543557 TTTGGGAGTTGAAGTTTTCGCCC
  s H04BA01F1907        2 21 +       25 TTTGGGAGTTGAAGGTT--GCCC
  p 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.85 0.759 0.662 - - 0.533 0.574 0.593 0.564

Lines starting with "s" contain: the sequence name, the start position
of the alignment, the number of nucleotides in the alignment, the
strand, the total size of the sequence, and the aligned nucleotides.
If the alignment starts at the beginning of the sequence, the start
position is zero.  If the strand is "-", the start position is as if
we had used the reverse-complemented sequence.  The line starting with
"p" contains the probability of each pair of aligned letters.  The
same alignment in tabular format looks like this::

  15 chr3L 19433515 23 + 24543557 H04BA01F1907 2 21 + 25 17,2:0,4

The final column shows the sizes and offsets of gapless blocks in the
alignment.  In this case, we have a block of size 17, then an offset
of size 2 in the upper sequence and 0 in the lower sequence, then a
block of size 4.  Probabilities are not shown in this format.


Steps in lastal
---------------

1) Find initial matches:
     keep those with multiplicity <= m and length >= l.

2) Extend gapless alignments from the initial matches:
     keep those with score >= d.

3) Extend gapped alignments from the gapless alignments:
     keep those with score >= e.

4) Non-redundantize the gapped alignments:
     remove those that share an endpoint with a higher-scoring alignment.

5) Calculate probabilities (OFF by default).

6) Redo the gapped extensions using centroid alignment (OFF by default).


What the Probabilities Mean
---------------------------

The probabilities indicate the reliability of each pairing, *assuming
that the alignment is not wholly spurious*.  In more detail, gapped
alignments are extended (in step 3) from either side of a "core"
alignment (derived from the gapless alignment of step 2).  The
probabilities are contingent upon the core being correct, and pairings
within the core automatically get a probability of 1.  So the
probabilities give no indication if an alignment is wholly spurious.
For that, please use lastex.

The probabilities are calculated as follows.  We assume that each
gapped extension has probability proportional to: exp(lambda * score).
Here, lambda is the scale parameter of the scoring matrix (YK Yu et
al. 2003, PNAS 100(26):15688-93).  Then, the probability of each
letter-pair is the sum of the probabilities of all possible gapped
extensions that include this pairing.


Options for lastdb
------------------

Main Options
~~~~~~~~~~~~

-h  Show all options and their default settings.

-p  Interpret the sequences as proteins.  The default is to interpret
    them as DNA.

-c  Be case-sensitive: lowercase letters will then be forbidden in
    initial matches (even in skipped positions).


Advanced Options
~~~~~~~~~~~~~~~~

-s  Split large databases into "volumes" of at most the specified
    number of bytes (excluding buckets).  If a single sequence exceeds
    this amount, however, it is not split.  The default is tuned for 2
    gigabytes of RAM: if you have more, increase this to make lastal
    go faster. You can use suffixes K, M, and G to specify KibiBytes,
    MebiBytes, and GibiBytes.  Example: if you have 6G of RAM, "-s 5G"
    seems to work well.

-m  Specify skipped positions in initial matches, e.g. "-m 110101". In
    this example, every third and fifth position out of six will be
    skipped.

-u  Use a subset seed in the specified file.  The -m option will then
    be ignored.  For an example of the format, see yass.seed in the
    examples directory.

-w  Allow initial matches to start only at every "w"th position in each
    database sequence.  This reduces time and storage requirements, at
    the expense of sensitivity.  To emulate BLAT, use "-w 11".

-a  Specify your own alphabet, e.g. "-a 0123".  The default (DNA)
    alphabet is equivalent to "-a ACGT".  The protein alphabet (-p) is
    equivalent to "-a ACDEFGHIKLMNPQRSTVWY".  Non-alphabet letters are
    allowed in sequences, but by default they are forbidden in initial
    matches (even in skipped positions) and get the mismatch score
    when aligned to anything.  If -a is specified, -p is ignored.

-b  Specify the depth of "buckets" used to accelerate initial match
    finding.  The deeper the faster, but the more memory is needed.
    The default is to use the maximum depth that consumes at most one
    byte per possible match start position.  This option has no effect
    on the results.

-x  Don't make a full LAST database; just count sequences and letters.
    This is useful with lastex.  (Letter counting is never
    case-sensitive.)

-v  Be verbose: write messages about what lastdb is doing.


Options for lastal
------------------

Main Options
~~~~~~~~~~~~

-h  Show all options and their default settings.

-o  Write output to the specified file, instead of the screen.

-s  Specify which query strand should be used: 0 means reverse only, 1
    means forward only, and 2 means both.

-f  Choose the output format: 0 means tabular and 1 means MAF.


Score Parameters
~~~~~~~~~~~~~~~~

-r  Match score.

-q  Mismatch cost.

-p  Obtain match and mismatch scores from the specified file.  Options
    -r and -q will be ignored.  For examples of the format, see HOXD70
    and TiTv212 in the examples directory.  Any letters that aren't in
    the file will get the lowest score in the file when aligned to
    anything.  Asymmetric scores are allowed: query letters correspond
    to columns and database letters correspond to rows.  Other options
    can be specified on lines starting with "#last", but command line
    options override them.

-a  Gap existence cost.

-b  Gap extension cost.  A gap of size k costs: a + b*k.

-c  This option allows use of "generalized affine gap costs" (SF
    Altschul 1998, Proteins 32(1):88-96).  Here, a "gap" may consist
    of unaligned regions of both sequences.  If these unaligned
    regions have sizes j and k, where j <= k, the cost is: a + b*(k-j)
    + c*j.  If c >= a + 2b (the default), it reduces to standard
    affine gaps.

-F  Align DNA queries to a protein database, using the specified
    frameshift cost.  A value of 15 seems to be reasonable.

-x  Maximum score dropoff for gapped alignments.  Gapped alignments
    are forbidden from having any internal region with score < -x.
    This serves two purposes: accuracy (avoid spurious internal
    regions in alignments) and speed (the smaller the faster).

-y  Maximum score dropoff for gapless alignments.

-d  Minimum score for gapless alignments.  For guidance on choosing
    this parameter, use lastex.

-e  Minimum score for gapped alignments.  For guidance on choosing
    this parameter, use lastex.


Miscellaneous Options
~~~~~~~~~~~~~~~~~~~~~

-u  Specify treatment of lowercase letters for gapless and gapped
    extensions.  0 means mask them for neither stage; 1 means mask
    them for gapless extensions but not for gapped extensions; 2 means
    mask them for both stages.  "Mask" means give them the worst
    mismatch score when aligned to anything.  Note that treatment of
    lowercase for initial matches is set by lastdb's -c option.

-m  Maximum multiplicity for initial matches.  Each initial match is
    lengthened until it occurs at most this many times in the database
    volume.

-l  Minimum length for initial matches.  (Skipped positions are
    included in the length.)

-n  Maximum number of gapless alignments per query position.  When
    lastal extends gapless alignments from initial matches that start
    at one query position, if it gets n successful extensions, it
    skips any remaining initial matches starting at that position.
    This option has no effect unless -n is less than -m.

-k  Look for initial matches starting only at every "k"th position in
    the query.  This increases speed at the expense of sensitivity.

-i  Search queries in batches of at most this many bytes.  If a single
    sequence exceeds this amount, however, it is not split.  You can
    use suffixes K, M, and G to specify KibiBytes, MebiBytes, and
    GibiBytes.  This option has no effect on the results (apart from
    their order).  Higher values can reduce disk reads.

-w  This option is a kludge to avoid catastrophic time and memory
    usage when self-comparing a large sequence.  If a large identical
    match is found, then gapped alignments will not be triggered from
    repeats (typically tandem repeats) within the identical match
    whose start positions are offset by this distance or less.  Use
    "-w 0" to turn this off.

-t  'temperature' for calculating probabilities.  Make the probability
    of each gapped extension proportional to exp(score / t).

-g  This option allows use of "gamma-centroid alignment" (M Hamada et
    al. 2009, Bioinformatics 25(4):465-73).  Such alignments only
    include pairings with probability > 1/(1+g).  When g=1, this is
    the same as "centroid alignment" (LE Carvalho & CE Lawrence 2008,
    PNAS 105(9):3209-14).  The reported alignment score is that of the
    ordinary gapped alignment, not of the (gamma-)centroid alignment.

-G  Use an alternative genetic code in the specified file.  For an
    example of the format, see vertebrateMito.gc in the examples
    directory.  By default, the standard genetic code is used.  This
    option has no effect unless DNA-versus-protein alignment is
    selected with option -F.

-v  Be verbose: write messages about what lastal is doing.

-j  Output type: 0 means counts of initial matches (of all lengths); 1
    means gapless alignments; 2 means gapped alignments before
    non-redundantization; 3 means gapped alignments after
    non-redundantization; 4 means alignments with probabilities; 5
    means centroid alignments.  Match counts (-j 0) respect the
    minimum length option but not the maximum multiplicity option.
    It's a bad idea to try -j 0 when comparing a large sequence to
    itself.

-Q  This option allows lastal to use sequence quality scores, or
    PSSMs, for the queries.  0 means read queries in FASTA format
    (without quality scores); 1 means FASTQ-Sanger format; 2 means
    FASTQ-Solexa format; 3 means PRB format; 4 means read PSSMs.  The
    FASTQ formats look like this::

      @mySequenceName
      TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
      +
      55555555*&5-/55*5//5(55,5#&$)$)*+$

    The "+" may optionally be followed by a name (ignored), and the
    sequence and quality codes are allowed to wrap onto more than one
    line.  For FASTQ-Sanger, the quality scores are obtained by
    subtracting 33 from the ASCII values of the characters below the
    "+", and for FASTQ-Solexa, they are obtained by subtracting 64.
    PRB format stores four quality scores (A, C, G, T) per position,
    with one sequence per line, like this::

      -40   40  -40  -40      -12    1  -12   -3      -10   10  -40  -40

    Since PRB does not store sequence names, lastal uses the line
    number (starting from 1) as the name.  In FASTQ-Sanger format, the
    quality scores are related to error probabilities like this:
    qScore = -10log10[p].  In FASTQ-Solexa and PRB, however, qScore =
    -10log10[p/(1-p)].  In lastal's MAF output, the quality scores are
    written on lines starting with "q".  For FASTQ, they are written
    with the same encoding as the input.  For PRB, they are written in
    the FASTQ-Solexa (ASCII-64) encoding.

    The quality scores influence alignment scores as follows.  Let Qiy
    be the probability that the base at position i is y (y = A, C, G,
    or T).  Let Sxy be the scoring matrix, and let T be the
    "temperature" parameter (by default 1/lambda).  Then, the score
    for aligning base x (A, C, G, or T) to position i is::

      Rix = T * ln[ sum(y){ Qiy * exp[ Sxy / T ] } ]

    Finally, PSSM means "position-specific scoring matrix".  The
    format is::

      myLovelyPSSM
             A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
      1 M   -2 -2 -3 -4 -2 -1 -3 -3 -2  1  2 -2  8 -1 -3 -2 -1 -2 -2  0
      2 S    0 -2  0  1  3 -1 -1 -1 -2 -3 -3 -1 -2 -3 -2  5  0 -4 -3 -2
      3 D   -1 -2  0  7 -4 -1  1 -2 -2 -4 -4 -2 -4 -4 -2 -1 -2 -5 -4 -4

    The sequence appears in the second column, and columns 3 onwards
    contain the position-specific scores.  Any letters not specified
    by any column will get the lowest score in each row.  This format
    is a simplified version of PSI-BLAST's ASCII format: the
    non-simplified version is allowed too.  If you use PSSMs, options
    -r -q and -p are mostly ignored, except that they determine the
    default value of -y.


Parallelization and Memory Sharing
----------------------------------

If you run two lastal jobs at the same time on the same computer,
using the same lastdb database, they will share memory for the
database.  In other words, the database gets loaded into memory only
once.  So you can run lastal in parallel without excessive memory
requirements.

This memory sharing will be less effective if the database has more
than one volume.  (If there is more than one file ending in ".suf",
there is more than one volume.)  The lastdb -s option controls
voluming.


Credits & Citation
------------------

LAST was developed by Martin C. Frith, Michiaki Hamada, Toshiyuki
Sato, and Paul Horton in the Computational Biology Research Center.
Many thanks to Hajime Harada for setting up the repository and
website, and Takako Sugawara for making the logo.  LAST includes
public domain code kindly provided by Yi-Kuo Yu and Stephen Altschul
at the NCBI.  If you use LAST in your research, please cite one (or
more) of the following publications:

| Incorporating sequence quality data into alignment improves DNA read mapping
| Martin C. Frith, Raymond Wan, Paul Horton
| Nucleic Acids Research, 2010 (in press)

| Parameters for Accurate Genome Alignment
| Martin C. Frith, Michiaki Hamada, Paul Horton
| BMC Bioinformatics, 2010 (in press)

Questions, Comments, Problems
-----------------------------

Please email: last (ATmark) cbrc (dot) jp.  If reporting a problem,
please describe exactly how to trigger the problem.
