The Problem Nobody Talks About Enough
21,000 Canadians die of lung cancer every year. Not because we don't have treatments — but because we find it too late.
Most cases are diagnosed at Stage III or IV, when the 5-year survival rate sits below 15%. Catch it at Stage I and that number flips to above 80%. The cancer isn't different. The timing is everything.
Current screening tools like low-dose CT are expensive, expose patients to radiation, and are only recommended for high-risk smokers. The majority of the population has no viable early screening option. A blood test could change that — and the science to build one already exists.
That's what this project is about.
The Biology (Bear With Me)
When cells die, they release fragments of their DNA into the bloodstream. This circulating material is called cell-free DNA, or cfDNA. In cancer patients, a fraction of that cfDNA comes from tumor cells — and it carries a distinct biological fingerprint.
That fingerprint is DNA methylation.
Methylation is a chemical modification — a methyl group (CH₃) attached to specific cytosine bases at locations called CpG sites. It doesn't change your DNA sequence, but it controls which genes get expressed. Think of it as a dimmer switch on top of your genetic code.
Cancer cells have a disrupted methylation pattern. They silence tumor suppressor genes that should be active, and activate oncogene regions that should be quiet. This isn't random — it's systematic, reproducible, and measurable.
The Illumina 450K array measures methylation at ~450,000 CpG sites simultaneously. Each site returns a beta value between 0 (unmethylated) and 1 (fully methylated). That's your data — a matrix of numbers that encodes the epigenetic state of a cell.
The question is whether a machine learning model can read that matrix and tell tumor from healthy.
I Almost Didn't Build This
I'm a first-year Biomedical Engineering student at the University of Waterloo. Before this hackathon, I had never written a single line of machine learning code.
For months I'd been telling myself ML was too complex. Too mathematical. Something you work up to after years of prerequisites, not something you just start. I'd read papers, bookmarked tutorials, and consistently found reasons to wait until I was "ready."
Hack Canada 2026 forced my hand. 36 hours. Solo. No excuses.
The project that came out of it — earlySignal — is imperfect, limited, and one of the things I'm most proud of building. Not because it solves the problem. Because it honestly shows where the problem actually lives.
What I Built
The pipeline has five stages:
- Data acquisition — TCGA-LUAD methylation array data streamed directly from UCSC Xena — 132 patient samples, each with ~413,000 CpG probe measurements. Half tumor tissue, half normal lung tissue.
- Preprocessing — Probes with more than 20% missing values filtered out. Remaining missing values imputed with column means. Result: a clean 132 × 413,000 beta value matrix.
- Feature selection — I built a custom scikit-learn transformer called TopVarianceSelector. It computes the variance of each probe's beta values across all training samples and keeps the top 5,000. The logic is straightforward: a probe that behaves identically in tumor and healthy tissue has near-zero variance and carries no discriminating information. A probe that's consistently methylated in tumor but unmethylated in healthy tissue will show high variance across a mixed dataset — and that variance is your signal. Critically, this filter runs inside the cross-validation loop — not before it. If the filter sees the test fold when deciding which probes to keep, it's using information it shouldn't have access to. That's data leakage, and it makes your results look better than they actually are. The sklearn Pipeline enforces the boundary.
- Model training — XGBoost classifier with 5-fold stratified cross-validation. XGBoost builds decision trees sequentially — each tree corrects the errors of the previous ensemble rather than averaging independent trees like Random Forest. For complex, non-linear methylation patterns with limited samples, that iterative correction matters.
- Evaluation — CV AUC: 0.924. Holdout AUC: 1.0. Strong results — but this is the easy version of the problem.
The Result That Actually Matters
Tissue biopsies are clean. Tumor fraction is high, signal is strong, model performs well. But that's not what a blood test sees.
In plasma cfDNA from an early-stage lung cancer patient, tumor-derived DNA is 0.1 to 5% of total cfDNA. The other 95–99.9% comes from healthy cells — white blood cells, hepatocytes, epithelial cells. Your signal is buried in noise at a ratio of up to 1000:1.
To test this, I built an in silico tumor fraction simulation. Take a tumor sample's methylation profile, mix it computationally with a normal profile at a defined ratio, feed that mixed profile to the trained model, record the AUC. Repeat across tumor fractions from 100% down to 1%.
Here's what happens:
- At 100% tumor fraction — bulk tissue, what the model trained on — AUC is 1.0.
- At 10% — AUC starts degrading.
- At 5% — performance drops meaningfully.
- At 1% — AUC hits 0.542. Essentially a coin flip.
This isn't the model failing. This is the field's unsolved problem, quantified. A tissue-trained classifier cannot reliably detect cancer from plasma cfDNA at clinically realistic tumor fractions. Solving that likely requires training directly on plasma data, deconvolution models that estimate tumor fraction first, or feature selection specifically optimized for low-fraction detection.
That's not a conclusion I was hoping to reach. It's the most honest and useful thing in the project.
What's Next
This project exists at an early stage of a much longer research question. The 30-day build period from Hack Canada gives me a concrete next step: validating this pipeline on real plasma cfDNA datasets available through GEO, rather than the tissue biopsies it was trained on.
More importantly — the project caught the attention of Dr. Fei Geng at McMaster University, whose research group works on AI-based cancer detection from blood plasma. What resonated most with me about his work isn't just the detection problem — it's the ambition to go further. Not just identifying that cancer is present, but pinpointing it precisely enough to inform treatment decisions. That's the direction this pipeline needs to grow toward.
The longer-term vision is a config-driven framework — swap the cancer type, swap the feature set, same pipeline. Colon cancer, breast cancer, any methylation dataset. The infrastructure for open, reproducible liquid biopsy research is underdeveloped. That's the gap worth filling.
What This Actually Taught Me
ML is not as inaccessible as I made it in my head. The mathematical foundations matter — understanding why variance filtering works, why data leakage is dangerous, why XGBoost outperforms Random Forest on tabular data — but you learn those things by building, not by waiting until you've mastered them in the abstract.
The more important lesson is about honesty in research. The temptation when you get AUC 1.0 on your holdout set is to stop there and call it a win. The tumor fraction simulation exists because I wanted to know what the result actually meant clinically — and the answer was uncomfortable and more valuable than the clean number.
That instinct — to push past the result that makes you look good toward the result that's actually true — is the one I want to carry into every project from here.
earlySignal is open source. GitHub and live demo linked below. Built solo at Hack Canada 2026 as a first-year BME student at the University of Waterloo.