This function identifies potential cryptic splice sites by comparing sequence motifs in introns to canonical splice site motifs (donor and acceptor). Cryptic splice sites are those that do not match the canonical donor (GT) or acceptor motifs (AG). It compares the identified splice sites with the provided canonical motifs and flags the sites that differ from the canonical patterns, making it useful for studying aberrant splicing events.
Arguments
- input
A data frame containing intron coordinates, ideally generated by
extract_introns()
andassign_splice_sites()
. Must contain columns:seqnames
,intron_start
,intron_end
,strand
,transcript_id
,intron_number
,gene_name
,gene_id
,donor_ss
andacceptor_ss
.- genome
A BSgenome object representing the genome sequence. This is used to extract the sequence for each intron to identify splice sites.
- canonical_donor
A character vector of canonical donor splice site motifs. Default is
c("GT")
.- canonical_acceptor
A character vector of canonical acceptor splice site motifs. Default is
c("AG")
.- verbose
Logical; if
TRUE
, progress messages are printed. Default isTRUE
.
Value
The input data frame with two logical columns:
cryptic_donor
:TRUE
if donor site is non-canonical.cryptic_acceptor
:TRUE
if acceptor site is non-canonical.
Details
This function performs the following steps:
It assigns donor and acceptor splice sites to each intron using the
assign_splice_sites
function.It compares the identified donor and acceptor splice sites against the provided canonical motifs (
GT
for donor andAG
for acceptor by default). If the splice site sequences do not match the canonical motifs, they are flagged as cryptic.The function returns a data frame with the same intron information, including additional columns
cryptic_donor
andcryptic_acceptor
indicating whether the splice sites are cryptic.The progress of the function is printed if the
verbose
argument is set toTRUE
, showing also the total number of cryptic donor and acceptor sites and their respective percentages.
Examples
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
#> Preparing input...
#> Removing single-exon transcripts...
#> Single-exon transcripts: 1
#> Extracting intron coordinates...
#> Collecting intron data...
#> Total introns: 3
introns_ss <- assign_splice_sites(introns_df, genome = BSgenome.Hsapiens.UCSC.hg38)
#> Preparing donor splice sites data...
#> Preparing acceptor splice sites data...
#> Merging donor and acceptor splice site data...
cryptic_sites <- find_cryptic_splice_sites(introns_ss, BSgenome.Hsapiens.UCSC.hg38)
#> Identifying cryptic splice sites...
#> Detected 3 cryptic donors (100.00% of total).
#> Detected 3 cryptic acceptors (100.00% of total).
head(cryptic_sites)
#> seqnames intron_start intron_end strand transcript_id intron_number gene_name
#> 1 chr1 7001 5999 - GeneB-001 intron1 <NA>
#> 2 chr1 7501 5999 - GeneB-002 intron1 <NA>
#> 3 chr1 7501 6599 - GeneB-001 intron2 <NA>
#> gene_id donor_ss acceptor_ss cryptic_donor cryptic_acceptor
#> 1 GeneB NN NN TRUE TRUE
#> 2 GeneB NN NN TRUE TRUE
#> 3 GeneB NN NN TRUE TRUE