CRISPRessoPooled Examples

Example:

Using Bioconda:

CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -f AMPLICONS_FILE.txt --name ONLY_AMPLICONS_SRR1046762 --gene_annotations gencode_v19.gz

Using Docker:

docker run -v ${PWD}:/DATA -w /DATA -i pinellolab/crispresso2 CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -f AMPLICONS_FILE.txt --name ONLY_AMPLICONS_SRR1046762 --gene_annotations gencode_v19.gz

The output of CRISPRessoPooled Amplicons mode consists of:

  1. REPORT_READS_ALIGNED_TO_AMPLICONS.txt: this file contains the same information provided in the input description file, plus some additional columns:

    a. Demultiplexed_fastq.gz_filename: name of the files containing the raw reads for each amplicon.

    b. n_reads: number of reads recovered for each amplicon.

  2. A set of fastq.gz files, one for each amplicon.

  3. A set of folders, one for each amplicon, containing a full CRISPResso report.

  4. SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).

  5. CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

Genome mode: In this mode the tool aligns each read to the best location in the genome. Then potential amplicons are discovered looking for regions with enough reads (the default setting is to have at least 1000 reads, but the parameter can be adjusted with the option --min_reads_to_use_region). If a gene annotation file from UCSC is provided, the tool also reports the overlapping gene/s to the region. In this way it is possible to check if the amplified regions map to expected genomic locations and/or also to pseudogenes or other problematic regions. Finally CRISPResso is run in each region discovered.

To run the tool in this mode the user must provide:

  1. Paired-end reads (two files) or single-end reads (single file) in FASTQ format (fastq.gz files are also accepted)

  2. The full path of the reference genome in bowtie2 format (e.g. /genomes/human_hg19/hg19). Instructions on how to build a custom index or precomputed index for human and mouse genome assembly can be downloaded from the bowtie2 website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

  3. Optionally the full path of a gene annotations file from UCSC. The user can download this file from the UCSC Genome Browser ( http://genome.ucsc.edu/cgi-bin/hgTables?command=start ) selecting as table "knownGene", as output format "all fields from selected table" and as file returned "gzip compressed". (e.g. /genomes/human_hg19/gencode_v19.gz)

Example:

Using Bioconda:

CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -x /GENOMES/hg19/hg19 --name ONLY_GENOME_SRR1046762 --gene_annotations gencode_v19.gz

Using Docker:

docker run -v ${PWD}:/DATA -w /DATA -i pinellolab/crispresso2 CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -x /GENOMES/hg19/hg19 --name ONLY_GENOME_SRR1046762 --gene_annotations gencode_v19.gz

The output of CRISPRessoPooled Genome mode consists of:

  1. REPORT_READS_ALIGNED_TO_GENOME_ONLY.txt: this file contains the list of all the regions discovered, one per line with the following information:

    • chr_id: chromosome of the region in the reference genome.

    • bpstart: start coordinate of the region in the reference genome.

    • bpend: end coordinate of the region in the reference genome.

    • fastq_file: location of the fastq.gz file containing the reads mapped to the region.

    • n_reads: number of reads mapped to the region.

    • sequence: the sequence, on the reference genome for the region.

  2. MAPPED_REGIONS (folder): this folder contains all the fastq.gz files for the discovered regions.

  3. A set of folders with the CRISPResso report on the regions with enough reads.

  4. SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).

  5. CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

    This running mode is particularly useful to check for mapping artifacts or contamination in the library. In an optimal experiment, the list of the regions discovered should contain only the regions for which amplicons were designed.

Mixed mode (Amplicons + Genome): in this mode, the tool first aligns reads to the genome and, as in the Genome mode, discovers aligning regions with reads exceeding a tunable threshold. Next it will align the amplicon sequences to the reference genome and will use only the reads that match both the amplicon locations and the discovered genomic locations, excluding spurious reads coming from other regions, or reads not properly trimmed. Finally CRISPResso is run using each of the surviving regions.

To run the tool in this mode the user must provide:

  • Paired-end reads (two files) or single-end reads (single file) in FASTQ format (fastq.gz files are also accepted)

  • A description file containing the amplicon sequences used to enrich regions in the genome and some additional information (as described in the Amplicons mode section).

  • The reference genome in bowtie2 format (as described in Genome mode section).

  • Optionally the gene annotations from UCSC (as described in Genome mode section).

Example:

Using Bioconda:

CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -f AMPLICONS_FILE.txt -x /GENOMES/hg19/hg19 --name AMPLICONS_AND_GENOME_SRR1046762 --gene_annotations gencode_v19.gz

Using Docker:

docker run -v ${PWD}:/DATA -w /DATA -i pinellolab/crispresso2 CRISPRessoPooled -r1 SRR1046762_1.fastq.gz -r2 SRR1046762_2.fastq.gz -f AMPLICONS_FILE.txt -x /GENOMES/hg19/hg19 --name AMPLICONS_AND_GENOME_SRR1046762 --gene_annotations gencode_v19.gz

The output of CRISPRessoPooled Mixed Amplicons + Genome mode consists of these files:

  1. REPORT_READS_ALIGNED_TO_GENOME_AND_AMPLICONS.txt: this file contains the same information provided in the input description file, plus some additional columns:

    a. Amplicon_Specific_fastq.gz_filename: name of the file containing the raw reads recovered for the amplicon.

    b. n_reads: number of reads recovered for the amplicon.

    c. Gene_overlapping: gene/s overlapping the amplicon region.

    d. chr_id: chromosome of the amplicon in the reference genome.

    e. bpstart: start coordinate of the amplicon in the reference genome.

    f. bpend: end coordinate of the amplicon in the reference genome.

    g. Reference_Sequence: sequence in the reference genome for the region mapped for the amplicon.

  2. MAPPED_REGIONS (folder): this folder contains all the fastq.gz files for the discovered regions.

  3. A set of folders with the CRISPResso report on the amplicons with enough reads.

  4. SAMPLES_QUANTIFICATION_SUMMARY.txt: this file contains a summary of the quantification and the alignment statistics for each region analyzed (read counts and percentages for the various classes: Unmodified, NHEJ, point mutations, and HDR).

  5. CRISPRessoPooled_RUNNING_LOG.txt: execution log and messages for the external utilities called.

The Mixed mode combines the benefits of the two previous running modes. In this mode it is possible to recover in an unbiased way all the genomic regions contained in the library, and hence discover contaminations or mapping artifacts. In addition, by knowing the location of the amplicon with respect to the reference genome, reads not properly trimmed or mapped to pseudogenes or other problematic regions will be automatically discarded, providing the cleanest set of reads to quantify the mutations in the target regions with CRISPResso.

If the focus of the analysis is to obtain the best quantification of editing efficiency for a set of amplicons, we suggest running the tool in the Mixed mode. The Genome mode is instead suggested to check problematic libraries, since a report is generated for each region discovered, even if the region is not mappable to any amplicon (however, his may be time consuming). Finally the Amplicon mode is the fastest, although the least reliable in terms of quantification accuracy.