Extract Splice Site Motifs for MaxEntScan Analysis (5' or 3')
Source:R/introns.R
extract_ss_motif.Rd
This function extracts splice site motifs (5' splice site (5ss) or 3' splice site (3ss)) from a genomic dataset. It retrieves the donor or acceptor splice site motifs for each intron, based on the strand orientation, and compiles them into a FASTA file, which can be used for further analysis (e.g., MaxEntScan).
Arguments
- input
A data frame containing genomic information with the following required columns:
seqnames
: Chromosome or scaffold names.strand
: Strand orientation, either '+' or '-'.intron_start
: Start position of the intron.intron_end
: End position of the intron.transcript_id
: Identifier for the transcript.intron_number
: Identifier for the intron within the transcript.
- genome
A genome object from the BSgenome package (default is
BSgenome.Hsapiens.UCSC.hg38
).- type
A string indicating which splice site motif to extract. One of
"5ss"
(donor splice site) or"3ss"
(acceptor splice site).- verbose
Logical; if
TRUE
, progress messages will be printed. Default isTRUE
.- save_fasta
Logical; if
TRUE
, a FASTA file will be saved containing the extracted motifs. Default isFALSE
.- output_file
A string specifying the output file path and name for the FASTA file. If
NULL
, a default name will be used (either "5ss_motif_fasta.fa" or "3ss_motif_fasta.fa").
Value
A data frame with:
donor_ss_motif
oracceptor_ss_motif
: 9bp (5' ss) or 23bp (3' ss) sequence.Genomic coordinates and transcript metadata.
Details
This function performs the following steps:
Based on the
type
argument, the function prepares coordinates for extracting either donor (5ss) or acceptor (3ss) splice site motifs, adjusting the motif start and end positions depending on the strand orientation.The motif sequences are then extracted from the specified genome using the
getSeq
function from the BSgenome package.If
save_fasta
isTRUE
, a FASTA file is generated containing the extracted motifs, with transcript IDs and intron numbers used as FASTA headers.
Examples
file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns <- extract_introns(gtf_v1)
#> Preparing input...
#> Removing single-exon transcripts...
#> Single-exon transcripts: 1
#> Extracting intron coordinates...
#> Collecting intron data...
#> Total introns: 3
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
# Extract donor splice site motifs
motifs_df <- extract_ss_motif(introns, BSgenome.Hsapiens.UCSC.hg38, "5ss", verbose = FALSE)
# Extract acceptor splice site motifs without saving the FASTA file
motifs_df <- extract_ss_motif(introns, BSgenome.Hsapiens.UCSC.hg38, "3ss", verbose = FALSE)