ECHO: Short-read Error Correction
v1.12

Prerequisites:
-------------
1. GNU GCC (version >= 4.1)
    http://gcc.gnu.org/

2. Python (version >= 2.6)
   (64-bit Python is needed for large data sets, e.g. >= 10M reads)
    http://www.python.org/

3. SciPy (version >= 0.7)
    http://www.scipy.org/

4. NumPy (version >= 1.3)
    http://numpy.scipy.org/

License:
-------
ECHO is released under the BSD License. See the LICENSE file for more details.

Quickstart:
----------
1. In the directory containing the source files, type
    > make

2. In the same directory, type:
    > python ErrorCorrection.py -o output/sample_data.fastq sample_data.txt
    
    - sample_data.txt is the input file.
    - output/sample_data.fastq is the output file name.

3. The output files will be in the directory called "output".
   There will be 3 files:
    a. sample_data.fastq
        The corrected reads in FASTQ format.
    b. sample_data.fastq.seq
        Only the reads, without the quality scores.
    c. sample_data.fastq.qual
        Only the quality scores.

4. The output file sample_data.fastq.seq can be compared to sample_answer.txt, which contains the original reads without errors. There will be differences between sample_data.fastq.seq and sample_answer.txt because not every error is corrected by the error correction.

5. To run with 4 threads, type
    > python ErrorCorrection.py --ncpu 4 -o output/sample_data.fastq sample_data.txt

Installation:
------------
In the directory containing the source files, type
    > make

For ECHO to allocate more than 4GB of memory, it must be compiled with the -m64 option (default). This requires a 64-bit machine.

Usage:
-----
ECHO accepts 4 input formats:
    1) Plain text format
    2) Plain text format gzipped
    3) FASTQ format
    3) FASTQ format gzipped

Plain text format is a file with one line per read. Note that this is *not* the same as the FASTA format. The Plain text format does *not* allow reads to span across multiple lines, and it does not have "comment" lines. The read should be oriented so that bases sequenced first are at the beginning of the line.

Also note that FASTA format is *not* supported by  ECHO at the moment.

Either the plain text format or the FASTQ format may be gzipped, and used as input to ECHO.

Note that the parsing method used by ECHO is selected depending on the *extension* of the input file:
    .gz is parsed as a gzipped Plain text format file.
    .fastq is parsed as a FASTQ file.
    .fastq.gz is parsed as a gzipped FASTQ file.

The output file format of ECHO is standard FASTQ format.

To run ECHO, use the following command
> python ErrorCorrection.py -o output.fastq input.txt

NOTE: The above command will generally work on relatively small data set sizes. The exact size depends on the amount of avaiable RAM. Please read the below 'Notes on Memory Usage' for additional details on running ECHO on larger data sets.

Help messages for additional options can be found by using
> python ErrorCorrection.py -h

Notes on Memory Usage:
---------------------
It is very important that ECHO does not saturate all available memory. If this happens, ECHO will begin to swap with the hard disk, which will cause it to run very slowly, much more slowly than it would otherwise if certain parameters were set correctly.

There are two parameters that should be tweaked:
a) -b denotes the number of reads per read block.

b) --nh will divide the kmer space into nh blocks.
    The larger the data set, the higher nh needs to be.
    Exactly what this value needs to be depends on the available RAM.

These parameters are set on the command line.

Examples:
For 5M reads, the following will typically run well with 8GB memory:
> python ErrorCorrection.py -b 2000000 --nh 256 -o output/5Mreads.fastq 5Mreads.txt

For 20M reads, the following will typically run well with 8GB memory:
> python ErrorCorrection.py -b 2000000 --nh 1024 -o output/20Mreads.fastq 20Mreads.txt

Using Multiple Threads: 
----------------------
Use the --ncpu command line option to select the number of threads to use during the error correction. Note that because the main bottleneck is due to memory constraints, only a few stages of the error correction process use more than one thread. 

Example:
To use 6 threads at once:
> python ErrorCorrection.py -b 2000000 --nh 256 --ncpu 6 -o output/5Mreads.fastq 5Mreads.txt

Log files:
---------
Every execution of ECHO produces a log file. The default location is the log/ directory. The log file records ECHO's progress during the run.

History of ECHO versions:
------------------------

Changes in 1.12:
----------------
- Fixed bug regarding custom temporary directory names.

Changes in 1.11:
----------------
- Changed struct type for writing out reads to unsigned long long from unsigned long in ErrorCorrection.py. This was causing some compatibility issues.

Changes in 1.10:
----------------
- Fixed a bug where temporary files were created in /tmp, rather than ./tmp.

Changes in 1.09:
----------------
- Removed dependence on C++0x features. Now using several functions from TR1, which has been included since at least gcc 4.1.

Changes in 1.08:
----------------
- Changed which kmer bins are considered for constructing the read overlap graph. Kmer bins with too many reads are not ignored to improve running times.

Changes in 1.07a:
----------------
- Changed default max_paramh to 50 from 24. 

Changes in 1.06a:
----------------
- Changed voting procedure and quality score computation.

--------------------------------------------------------------
Changes from 1.04 and 1.05 are not applicable to this version.
--------------------------------------------------------------

Changes in 1.03:
----------------
- Changed data type for read indexing in the reverse complement file to accommodate for large data sets (>30M reads).

Changes in 1.02:
----------------
- Added support for FASTQ and gzipped FASTQ input files.

Changes in 1.01:
----------------
- Added support for gzipped input files.
- Fixed a bug in the error message when checking for the existence of the input file.

Bug Reports:
-----------
Wei-Chun Kao <wckao@EECS.Berkeley.EDU>
Andrew H. Chan <andrewhc@EECS.Berkeley.EDU>
Department of EECS
U.C. Berkeley