Extract Splice Site Motifs for MaxEntScan Analysis (5' or 3')

This function extracts splice site motifs (5' splice site (5ss) or 3' splice site (3ss)) from a genomic dataset. It retrieves the donor or acceptor splice site motifs for each intron, based on the strand orientation, and compiles them into a FASTA file, which can be used for further analysis (e.g., MaxEntScan).

Usage

extract_ss_motif(input, genome, type, verbose, save_fasta, output_file)

Arguments

input

A data frame containing genomic information with the following required columns:

seqnames: Chromosome or scaffold names.
strand: Strand orientation, either '+' or '-'.
intron_start: Start position of the intron.
intron_end: End position of the intron.
transcript_id: Identifier for the transcript.
intron_number: Identifier for the intron within the transcript.

genome

A genome object from the BSgenome package (default is BSgenome.Hsapiens.UCSC.hg38).

type

A string indicating which splice site motif to extract. One of "5ss" (donor splice site) or "3ss" (acceptor splice site).

verbose

Logical; if TRUE, progress messages will be printed. Default is TRUE.

save_fasta

Logical; if TRUE, a FASTA file will be saved containing the extracted motifs. Default is FALSE.

output_file

A string specifying the output file path and name for the FASTA file. If NULL, a default name will be used (either "5ss_motif_fasta.fa" or "3ss_motif_fasta.fa").

Value

A data frame with:

donor_ss_motif or acceptor_ss_motif: 9bp (5' ss) or 23bp (3' ss) sequence.
Genomic coordinates and transcript metadata.

Details

This function performs the following steps:

Based on the type argument, the function prepares coordinates for extracting either donor (5ss) or acceptor (3ss) splice site motifs, adjusting the motif start and end positions depending on the strand orientation.
The motif sequences are then extracted from the specified genome using the getSeq function from the BSgenome package.
If save_fasta is TRUE, a FASTA file is generated containing the extracted motifs, with transcript IDs and intron numbers used as FASTA headers.

Examples

file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
introns <- extract_introns(gtf_v1)
#> Preparing input...
#> Removing single-exon transcripts...
#> Single-exon transcripts: 1
#> Extracting intron coordinates...
#> Collecting intron data...
#> Total introns: 3
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
# Extract donor splice site motifs
motifs_df <- extract_ss_motif(introns, BSgenome.Hsapiens.UCSC.hg38, "5ss", verbose = FALSE)

# Extract acceptor splice site motifs without saving the FASTA file
motifs_df <- extract_ss_motif(introns, BSgenome.Hsapiens.UCSC.hg38, "3ss", verbose = FALSE)