This function takes a data frame of intron coordinates and a genome sequence (ideally human or mouse) and returns a data frame with two additional columns for the donor and acceptor splice site consensus sequences. It prepares the donor and acceptor sequences based on the provided intron coordinates and the specified genome (e.g., human hg38), making it useful for downstream analysis of splicing events.
Arguments
- input
A data frame containing intron coordinates with the following columns:
seqnames
: The chromosome name.intron_start
: The start position of the intron.intron_end
: The end position of the intron.strand
: The strand on which the intron is located (+
or-
).transcript_id
: The ID of the transcript to which the intron belongs.intron_number
: The number of the intron within the transcript.gene_name
: The name of the gene.gene_id
: The gene ID.
- genome
The genome sequence (BSgenome object) for the species. Default is the human genome (hg38). This object is required for extracting the consensus sequences from the genome at the specified intron positions.
- verbose
Logical. If TRUE, the function prints progress messages while preparing the splice site data. Default is TRUE.
Value
A data frame containing the original intron data, with two additional columns:
donor_ss
: The donor splice site consensus sequence for each intron.acceptor_ss
: The acceptor splice site consensus sequence for each intron.
Details
This function performs the following steps:
First, it prepares the splice site sequences for both donor and acceptor sites by calculating their positions based on the strand orientation and intron coordinates. The donor splice site is typically located at the 5' end of the intron, while the acceptor splice site is at the 3' end.
The function utilizes the
getSeq
function from theBSgenome
package to extract the nucleotide sequences for both donor and acceptor sites from the specified genome (default is hg38 for humans).The resulting sequences are added as new columns (
donor_ss
andacceptor_ss
) to the original input data frame.The final data frame includes the splice site sequences for each intron, allowing for analysis of splicing efficiency or identification of consensus motifs.
Examples
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns_df <- extract_introns(gtf_v1)
#> Preparing input...
#> Removing single-exon transcripts...
#> Single-exon transcripts: 1
#> Extracting intron coordinates...
#> Collecting intron data...
#> Total introns: 3
result <- assign_splice_sites(introns_df, genome = BSgenome.Hsapiens.UCSC.hg38)
#> Preparing donor splice sites data...
#> Preparing acceptor splice sites data...
#> Merging donor and acceptor splice site data...