Extract Coding Sequences (CDS) from GTF Annotations — extract_cds

Extracts CDS regions from a GTF annotation file or data frame using genomic coordinates and retrieves corresponding DNA sequences from a BSgenome reference.

Usage

extract_cds_sequences(input, genome, save_fasta, output_file, verbose)

Arguments

input: A character string (GTF file path) or data frame containing CDS annotations.
genome: A BSgenome object for the relevant genome. Defaults to human (hg38).
save_fasta: A logical indicating whether to save sequences to a FASTA file. Defaults to FALSE.
output_file: A character string specifying the FASTA output path. If NULL, uses "CDS.fa".
verbose: A logical indicating whether to print progress messages. Defaults to TRUE.

Value

A data frame containing CDS annotations with corresponding sequences. If save_fasta = TRUE, also writes a FASTA file.

Details

This function processes CDS entries from the input GTF, extracts their sequences from the reference genome, and optionally saves them in FASTA format. Useful for downstream analyses like protein translation.

Examples

file_v1 <- system.file("extdata", "gencode.v1.example.gtf.gz", package = "GencoDymo2")
gtf_v1 <- load_file(file_v1)
# Human CDS extraction
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressPackageStartupMessages(library(GenomicRanges))
gtf_granges <- GRanges(gtf_v1)
cds_seqs <- extract_cds_sequences(gtf_granges, BSgenome.Hsapiens.UCSC.hg38, save_fasta = FALSE)
#> Using provided GRanges object...
#> Warning: No CDS features found in the GTF data.