Human Promoter Classification Using K-mer Encoding and Classical Machine Learning

Classifying human DNA sequences as promoter or non-promoter regions using k-mer frequency vectors and three ML models, with a focus on biological interpretability.

Model comparisons Logistic Regression, SVM, and Random Forest on human promoter classification

Overview

Gene expression begins at promoters — regions of DNA that recruit the transcriptional machinery to the correct starting position. Identifying these regions computationally is a foundational problem in genomics, with direct applications to gene annotation, regulatory network reconstruction, and understanding disease-associated variants in non-coding DNA. This project builds a complete pipeline that takes raw DNA sequences, encodes them as k-mer frequency vectors, and classifies them as promoter or non-promoter using three machine learning models spanning a complexity spectrum: Logistic Regression (linear), SVM with RBF kernel (non-linear), and Random Forest (ensemble). Beyond classification accuracy, the focus is on connecting the learned features back to known promoter biology through feature importance analysis.

Background

Human promoters come in two major flavours. About 25–30% contain a TATA box (consensus sequence TATAAA) roughly 30 base pairs upstream of the transcription start site (TSS), and these tend to drive tissue-specific gene expression. The remaining ~70% are TATA-less and are instead associated with CpG islands — regions enriched in CG dinucleotides that are typically found at housekeeping genes expressed across all cell types. Both architectures produce distinctive sequence composition signatures that can be captured by counting short subsequences (k-mers).

K-mer frequency encoding is a bag-of-words representation borrowed from natural language processing: just as a document can be represented by word frequencies, a DNA sequence can be represented by the frequencies of all possible subsequences of length k. For k=3, this produces 43 = 64 features per sequence. The approach is simple, interpretable, and serves as a strong baseline before moving to more complex representations like convolutional or attention-based architectures.

Approach

The pipeline is built end-to-end in Python with scikit-learn and consists of four stages:

Results

All three models achieved strong classification performance, with SVM (RBF kernel) leading at 97.4% F1 and 0.997 ROC-AUC. The fact that even the simplest model (Logistic Regression) reached 95.8% F1 indicates the problem is largely linearly separable in k-mer space, with a small non-linear component that the RBF kernel exploits.

Model Accuracy Precision Recall F1 Score ROC-AUC
Logistic Regression 0.958 0.961 0.955 0.958 0.993
SVM (RBF) 0.974 0.980 0.969 0.974 0.997
Random Forest 0.966 0.979 0.952 0.965 0.995

Feature importance analysis revealed CG-containing k-mers (ACG, CGA, CGT, TCG, GCG) as the most discriminating features across both Logistic Regression and Random Forest — consistent with the central role of CpG dinucleotides in promoter biology. Notably, GCG carried a positive coefficient toward promoter in the Logistic Regression model, potentially reflecting GC-box motifs that serve as SP1 transcription factor binding sites in proximal promoter regions. AAA and TTT also showed strong positive associations with promoters, consistent with TATA box and other AT-rich functional elements.

An unexpected finding emerged from the coefficient directionality: most CG-containing k-mers pointed toward the non-promoter class. This reflects the fact that real genomic DNA — including promoters — carries evolutionary CpG depletion due to cytosine methylation and spontaneous deamination. The synthetic negative sequences, lacking this mutational history, retain CG dinucleotides at the frequency expected from base composition alone. The model is therefore partially learning to distinguish real from synthetic DNA, not purely promoter-specific biology.

Limitations & Future Work

The reported performance likely represents an upper bound. The synthetic negative set, while GC-content matched, lacks the complex sequence structure of real genomic DNA — codon usage patterns, repetitive elements, splice site signals, and the evolutionary CpG depletion signature present across all real sequences. Replacing synthetic negatives with real intergenic sequences from the human genome would provide a more stringent and biologically meaningful evaluation.

Several extensions could strengthen the analysis: a systematic comparison across k-mer sizes (k=3, 4, 5, 6) to identify the optimal resolution for capturing promoter motifs; the addition of CNN or LSTM architectures that preserve positional information lost in the bag-of-words encoding; and cross-species evaluation (training on human, testing on mouse or Drosophila) to assess the evolutionary conservation of learned features. These represent natural next steps toward a preprint-quality contribution.

Key Takeaways

← Back to Projects GitHub Repository EPDnew Database