Skip to contents

This function identifies potential cryptic splice sites by comparing sequence motifs in introns to canonical splice site motifs (donor and acceptor). Cryptic splice sites are those that do not match the canonical donor (GT) or acceptor motifs (AG). It compares the identified splice sites with the provided canonical motifs and flags the sites that differ from the canonical patterns, making it useful for studying aberrant splicing events.

Usage

find_cryptic_splice_sites(input, genome, canonical_donor, canonical_acceptor, verbose)

Arguments

input

A data frame containing intron coordinates, ideally generated by extract_introns() and assign_splice_sites(). Must contain columns: seqnames, intron_start, intron_end, strand, transcript_id, intron_number, gene_name, gene_id, donor_ss and acceptor_ss.

genome

A BSgenome object representing the genome sequence. This is used to extract the sequence for each intron to identify splice sites.

canonical_donor

A character vector of canonical donor splice site motifs. Default is c("GT").

canonical_acceptor

A character vector of canonical acceptor splice site motifs. Default is c("AG").

verbose

Logical; if TRUE, progress messages are printed. Default is TRUE.

Value

The input data frame with two logical columns:

  • cryptic_donor: TRUE if donor site is non-canonical.

  • cryptic_acceptor: TRUE if acceptor site is non-canonical.

Details

This function performs the following steps:

  • It assigns donor and acceptor splice sites to each intron using the assign_splice_sites function.

  • It compares the identified donor and acceptor splice sites against the provided canonical motifs (GT for donor and AG for acceptor by default). If the splice site sequences do not match the canonical motifs, they are flagged as cryptic.

  • The function returns a data frame with the same intron information, including additional columns cryptic_donor and cryptic_acceptor indicating whether the splice sites are cryptic.

  • The progress of the function is printed if the verbose argument is set to TRUE, showing also the total number of cryptic donor and acceptor sites and their respective percentages.

Examples

suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
#> Preparing input...
#> Removing single-exon transcripts...
#> Single-exon transcripts: 1
#> Extracting intron coordinates...
#> Collecting intron data...
#> Total introns: 3
introns_ss <- assign_splice_sites(introns_df, genome = BSgenome.Hsapiens.UCSC.hg38)
#> Preparing donor splice sites data...
#> Preparing acceptor splice sites data...
#> Merging donor and acceptor splice site data...
cryptic_sites <- find_cryptic_splice_sites(introns_ss, BSgenome.Hsapiens.UCSC.hg38)
#> Identifying cryptic splice sites...
#> Detected 3 cryptic donors (100.00% of total).
#> Detected 3 cryptic acceptors (100.00% of total).
head(cryptic_sites)
#>   seqnames intron_start intron_end strand transcript_id intron_number gene_name
#> 1     chr1         7001       5999      -     GeneB-001       intron1      <NA>
#> 2     chr1         7501       5999      -     GeneB-002       intron1      <NA>
#> 3     chr1         7501       6599      -     GeneB-001       intron2      <NA>
#>   gene_id donor_ss acceptor_ss cryptic_donor cryptic_acceptor
#> 1   GeneB       NN          NN          TRUE             TRUE
#> 2   GeneB       NN          NN          TRUE             TRUE
#> 3   GeneB       NN          NN          TRUE             TRUE