CRISPRessoPooled Genome Mode

Genome Mode Input

In this mode the tool aligns each read to the best location in the genome. Then potential amplicons are discovered looking for regions with enough reads (the default setting is to have at least 1000 reads, but the parameter can be adjusted with the option --min_reads_to_use_region). If a gene annotation file from UCSC is provided, the tool also reports the overlapping gene/s to the region. In this way it is possible to check if the amplified regions map to expected genomic locations and/or also to pseudogenes or other problematic regions. Finally CRISPResso is run in each region discovered.

To run the tool in this mode the user must provide:

  • Paired-end reads (two files) or single-end reads (single file) in FASTQ format (fastq.gz files are also accepted)

  • The full path of the reference genome in bowtie2 format (e.g. /genomes/human_hg19/hg19). Instructions on how to build a custom index or precomputed index for human and mouse genome assembly can be downloaded from the bowtie2 website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

  • Optionally the full path of a gene annotations file from UCSC. The user can download this file from the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) selecting as table "knownGene", as output format "all fields from selected table" and as file returned "gzip compressed". (e.g. /genomes/human_hg19/gencode_v19.gz)

Genome Mode Output

The output of CRISPRessoPooled Genome mode consists of:

  1. REPORT_READS_ALIGNED_TO_GENOME_ONLY.txt: this file contains the list of all the regions discovered, one per line with the following information:

    1. chr_id: chromosome of the region in the reference genome.

    2. bpstart: start coordinate of the region in the reference genome.

    3. bpend: end coordinate of the region in the reference genome.

    4. fastq_file: location of the fastq.gz file containing the reads mapped to the region.

    5. n_reads: number of reads mapped to the region.

    6. sequence: the sequence, on the reference genome for the region.

  2. MAPPED_REGIONS (folder): this folder contains all the fastq.gz files for the discovered regions.

  3. A set of folders with the CRISPResso report on the regions with enough reads.

  4. SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).

  5. CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

This running mode is particularly useful to check for mapping artifacts or contamination in the library. In an optimal experiment, the list of the regions discovered should contain only the regions for which amplicons were designed.