How Mela sorts cause from coincidence
A worked example: causal-discovery procedure walked through synthetic 30-day skin data with a known ground-truth edge, showing where the procedure recovered the truth, where it found a spurious pattern, and how the hedging in user-facing copy is calibrated.
A user starts azelaic acid in the third week of her Mela trial. Three weeks later her hyperpigmentation has visibly lightened. The natural read, and the one most skincare products encourage, is that the azelaic acid worked. But she also started a new sleep routine in week two, switched cleansers in week three, and entered her follicular phase. Skin tends to read brighter in follicular for most cycle-tracking users regardless of what's on the surface. Four things changed. One thing got better. Which of the four was responsible?
Most consumer skincare claims do not answer this question. They cannot. The data they have are reports of association: users who tried this serum saw improvement. There is no procedure for separating cause from coincidence in that statement. The claim is rhetorical, not inferential. It works because consumers do not generally demand the distinction.
Mela treats the distinction as the central methodological problem of personalized skincare. The procedure has three parts: constraint-based causal discovery run on a user's longitudinal data, bootstrap resampling to filter out edges that look strong but aren't stable, and an explicit naming convention that refuses to call observational patterns "cause" even when they survive every filter. The framework comes from Spirtes, Glymour, and Scheines' work on constraint-based causal structure learning (Spirtes et al., 2000), and from Pearl's foundational distinction between association and causation in observational data (Pearl, 2009). Neither is new to statistics. Neither, to Mela's knowledge, has been applied openly inside a consumer-facing skincare product. If one has, this publication will issue a correction.
What follows walks the procedure through one specific synthetic test case. The case is constructed, not observed. Mela's user base is not yet large enough to publish observed causal patterns, and those belong in future Field Observations pieces. What this case can demonstrate is what the procedure does, what gets surfaced to the user, and where the limits sit.
Why this matters
The cost of confident-wrong attribution in skincare is not theoretical. Users build month-long routines on the belief that "this product cleared my skin," and stay on those routines for years (sometimes decades) because no procedure ever sorts out whether the product was responsible. The opposite cost is also real. Refuse to surface any pattern at all and users abandon routines that were actually helping, because they cannot tell which moving piece was doing the work.
A computational engine that produces individual-level recommendations has to navigate this trade-off explicitly. Refusing to navigate it, publishing association-grade findings as if they were causal, is what made skincare's claim culture untrustworthy in the first place. The diagnosis is not Mela's alone. Dermatologists writing inside their own literature have observed the same gap. The industry expands faster than empirical evidence of efficacy and safety can be acquired, and many products make therapeutic claims while avoiding the regulatory framework of pharmaceuticals (Glass, 2020). Even for retinoids, well-studied as isolated actives, a recent review of their use in cosmeceutical formulations concluded there is a lack of evidence from properly designed clinical trials to support the claimed efficacy of the most commonly used products (Milosheska & Roškar, 2022). The gap is between what is known about the molecule and what is sold in the bottle. Methodological rigor in consumer health is rare. The failure to build it is a choice, not a constraint.
Background: where the field is
Dermatology research handles causal claims primarily through randomized controlled trials. RCTs are the discipline's gold standard, and the body of work is substantial: a recent 23-RCT network meta-analysis of topical retinoids for photoaging is one example among many (Lin et al., 2025); a 179-RCT network meta-analysis anchors the modern acne hierarchy (Mavranezouli et al., 2022).
What RCT evidence establishes is population-level efficacy. Under controlled conditions, this ingredient produces this effect on average across the studied population. What it does not establish is whether the ingredient will work for any particular individual. This is the N-of-1 problem. Outside dermatology, the medical literature has begun developing single-patient trial designs as a methodological response. N-of-1 trials, patient-centric studies, and adaptive designs appear in precision oncology and pediatric hypertension among other domains (Fountzilas et al., 2022). A 2023 randomized trial in pediatric hypertension found a 69–74% probability that treatment choices informed by an N-of-1 trial improved blood pressure control compared to usual care (Samuel et al., 2023). Dermatology has been slower to adopt these methods. The dominant heuristic, matching products to declared skin type, predates modern computational measurement and has poor predictive validity for individual response.
The literature on formal causal inference applied to dermatology is thin. Causal frameworks have been developed extensively for epidemiology, for econometrics, and increasingly for healthcare outcomes research. Hernán and Robins' target-trial-emulation paradigm, which formalizes how to draw causal inferences from observational data by explicitly modeling the randomized experiment one would have run, is now widely applied across cardiology, oncology, and infectious disease (Hernán & Robins, 2016; for a 2023 scoping review documenting adoption across 96 papers, see Zuo et al., 2023). The application to consumer skin data with longitudinal individual observations is largely undeveloped. The procedure Mela uses is not new to statistics. It is new to dermatology in the sense that, based on the literature Mela has surveyed and the consumer products Mela has examined, no consumer skincare product currently surfaces its causal-inference machinery openly to users. If counter-examples exist, this publication welcomes correction.
The state of the question: consumer skincare publishes association without procedure, dermatology research publishes population RCT without individual modeling, and the methodological gap between them is the territory Mela's engine is built in.
The procedure
The synthetic test case constructs 30 days of metric readings on seven skin axes (clarity, glow, texture, pigmentation, redness, barrier, oil), introduces retinol on day 10, and engineers texture to improve nonlinearly from that day onward. The other six metrics are pure noise. Ground truth is therefore known: one ingredient, one true causal edge, six metrics that should be uncorrelated with retinol.
The engine does not know this. It receives 30 daily snapshots plus the intervention indicator and runs the procedure.
Step 1: Variable assembly
The engine builds a data matrix with eight columns: seven metric time series plus a retinol indicator (zero before day 10, one from day 10 forward). Ingredients started before day zero of tracking, whose indicator column would be constant, are dropped before the procedure runs. A constant column has zero variance, and the conditional independence tests at the core of the PC algorithm cannot run on it. This is a real limitation, not a workaround. Ingredients the user was already using when Mela first started watching are invisible to causal discovery until they pause and restart. The methodology document flags this. Pretending otherwise would be the wrong move.
Step 2: The PC algorithm
The algorithm comes from Peter Spirtes, Clark Glymour, and Richard Scheines' Causation, Prediction, and Search. It is constraint-based: start from a complete undirected graph over all variables, then prune edges by testing whether each pair is conditionally independent given some subset of the other variables. An edge that survives every test it is asked to survive stays. An edge that fails any test is removed.
The conditional independence test is Fisher's Z. Pearson correlation is computed (partial correlation when conditioning on a single variable), Fisher-transformed to a Z statistic, and the resulting two-tailed p-value is compared against α = 0.05. Independence is declared when p > α; the edge drops.
After pruning, the algorithm orients what's left. Two rules: v-structures (when a chain x – z – y exists, x and y are not adjacent, and z is not in the separating set that disconnected them, orient as x → z ← y) and Meek's rules (iteratively propagate orientation under the constraint that no new v-structures or cycles emerge).
The output is a partially directed acyclic graph. It represents the equivalence class of DAGs consistent with the observed conditional independence pattern. Some edges have unambiguous direction. Others remain undirected because the data alone cannot distinguish between directions that would produce the same independence pattern. The algorithm does not pretend to know more than the data supports. That is the honesty built into the output structure.
Step 3: Bootstrap stability
One run of the PC algorithm on 30 data points is not enough. Statistical tests at small sample sizes are noisy. An edge that appears in one run may not appear in another run on slightly different data. The engine handles this with bootstrap resampling: 20 subsamples of 80% of the rows (with replacement), the full PC procedure re-run on each, and a count of how often each edge appears across the 20 runs.
Stability is count divided by 20. An edge that appears in all 20 runs scores 1.0. An edge that appears in 7 of 20 scores 0.35. Mela's threshold for treating an edge as worth surfacing is bootstrap stability ≥ 0.35 combined with raw correlation strength > 0.3.
The threshold was originally 0.5 and was relaxed to 0.35 earlier this year. The reasoning, preserved in the source code: at sample sizes between 15 and 25, even genuine causal edges have approximately 40 to 50 percent bootstrap recovery. Setting the bar at 0.5 was filtering out real signal alongside noise. Lowering to 0.35 admits real signal at the cost of admitting some noise. Confidence is weighted by mean stability, so weaker edges contribute less.
This is a choice with downstream consequences. A higher threshold produces fewer false positives and more false negatives. The current threshold trades some specificity for sensitivity. That is the right trade for individual modeling at Mela's current data scales. It will be revisited if observed behavior on real user data suggests otherwise.
Step 4: The output
The engine runs all of the above on the synthetic test case. The output:
Strong causal edges (passing both filters):
- retinol → texture (strength 0.661, p = 3.7×10⁻⁵, bootstrap 0.65)
- barrier → texture (strength 0.562, p = 9.6×10⁻⁴, bootstrap 0.75)
Tier assigned: observed (n = 30 ≥ 25 threshold; confidence 0.87 ≥ 0.60 threshold).
Surface text generated for the user:
"In your data, your retinol looks like it's smoothing your texture. Doesn't prove cause, but the pattern is strong."
Bootstrap variability across reruns: the procedure is stochastic. Bootstrap resampling draws different subsamples on different runs and the stability scores fluctuate. Running the same 30-day input through five times with different random seeds: retinol → texture stability ranged 0.60 to 0.70, barrier → texture ranged 0.40 to 0.70, confidence ranged 0.62 to 0.84. The tier ("observed") was stable across all five runs. This is consistent with the 0.35 threshold being calibrated to absorb single-run jitter while still rejecting low-recovery noise. A tighter threshold would have rejected the engineered ground truth on some single-run evaluations.
Two things to notice in the output.
First, the engineered ground-truth edge was recovered. The algorithm did the job it was designed to do.
Second, and more interesting, a spurious directed edge appears in the output. The engine reports barrier → texture with bootstrap stability 0.75, above threshold, in the strong-causal set. This edge was not engineered. Barrier was generated as pure noise with mean 7.0 and standard deviation 0.2. Its appearance as a directed edge is an artifact of how the PC orientation step handled the test case. With retinol genuinely connected to texture, and barrier correlating with texture by chance at p = 0.001, the orientation step produced a v-structure that pointed barrier toward texture.
The algorithm is correct that something is statistically separable in the barrier-texture relationship. The algorithm is also wrong, in a sense that matters, about what the relationship means. Barrier did not cause texture in the test case. It correlated with texture because both vary, and at small samples some chance correlations rise above the noise floor. The orientation step has no way to know this. It does what the data structure permits.
What the engine does next is the part that matters for what users see. The user-facing surface text reports only the top ingredient edge, retinol → texture, the one with both the strongest correlation and the highest bootstrap stability. The spurious barrier → texture edge stays in the full graph output (visible in the methodology drawer if a user wants to look) but is not surfaced as a user-facing claim. The user is told what the engine is most confident about, with the hedge that Mela's behavioral commitments require.
The hedge is the load-bearing part. "Doesn't prove cause, but the pattern is strong." Not because anyone wrote a brief asking for honesty. Because Pearl's framework is explicit: observational data can establish association, sometimes strong association, but never cause. Causation requires intervention, and Mela cannot run interventions on its users. What Mela can do is observe, weigh, and report what observation supports. Labeled honestly.
Full edge data (for readers who want the raw output, including the weak edges the engine filtered out):
| from | to | type | direction | |r| | p-value | bootstrap | causality class |
|---|---|---|---|---|---|---|---|
| retinol | texture | directed | negative | 0.661 | 3.7×10⁻⁵ | 0.65 | sprt_confirmed |
| barrier | texture | directed | negative | 0.562 | 9.6×10⁻⁴ | 0.75 | sprt_confirmed |
| texture | retinol | undirected | negative | 0.661 | 3.7×10⁻⁵ | 0.45 | correlational |
| texture | barrier | undirected | negative | 0.562 | 9.6×10⁻⁴ | 0.15 | correlational |
| glow | retinol | undirected | positive | 0.420 | 2.0×10⁻² | 0.15 | correlational |
(Negative direction here means the raw value of the metric goes down. Texture is an inverted-scale metric in Mela's framework, where lower is better, since "texture" tracks roughness. A negative correlation with retinol presence indicates improvement.)
What this means
The procedure above is what produces the difference between Mela saying "your azelaic acid works" and Mela saying "in your data, your azelaic acid looks like it's fading your pigmentation. Doesn't prove cause, but the pattern is strong." The first is the claim consumer skincare typically makes. The second is the claim the procedure actually supports.
The difference is not a marketing decision. It is a methodological consequence. The first asserts something the data cannot establish. The second reports what the data supports under the conditions of observation. A user who builds her routine on the first has been given an inferential gift Mela cannot honestly deliver. A user who builds on the second has been given what she actually has: a pattern strong enough to act on, hedged appropriately, with the underlying procedure available for inspection.
This is not a claim that observational causal inference is sufficient for medical claims. It isn't. Mela does not make medical claims. The claim is narrower. For the kind of statement Mela does make, in your data, this ingredient appears to be doing this thing, the procedure above is what allows the statement to be made honestly, with calibrated uncertainty, on individual longitudinal data.
What the approach does not eliminate: unobserved confounders. PC's conditional independence machinery rules out common causes among the observed variables. If something not in the variable set is driving both the ingredient adherence and the outcome (a seasonal effect, a stress-correlated lifestyle shift, a coincidence of timing with another routine change the user didn't report) the algorithm cannot see it. The procedure does as well as observational data can do. Observational data has limits that observational machinery cannot resolve.
What remains open: the threshold choices documented above (0.35 bootstrap stability, 0.3 minimum strength, α = 0.05) are defensible at current sample sizes but are calibrated against synthetic data and small early cohorts. As Mela's user base grows and the engine accumulates real causal-discovery runs at larger sample sizes, these thresholds will be revisited. Quarterly audits will check whether observed false-positive and false-negative rates remain calibrated. Corrections will be issued if they don't.
A specific takeaway
A pattern that survives conditional independence testing, survives bootstrap resampling, and survives the correlation-strength filter is worth surfacing to a user, labeled honestly, with the procedure that produced it available for inspection. A pattern that fails any of those filters is not. Users are owed both the surfacing of strong patterns and the silence on weak ones.
The procedure described here will inform future Field Observations pieces that document what patterns the engine is actually finding on real anonymized user data, what fraction survive the filters, and what the field can learn from running this kind of inference at scale. Those pieces will use real engine output rather than synthetic test cases. They will report results that may include negative findings: patterns the engine expected to find and didn't, patterns the procedure rejected that field intuition would have endorsed. Mela Field Notes commits to publishing negative results alongside positive ones.
This procedure is one component of a longer methodological agenda Mela's engine is built around. Future Methodology lane pieces will document the surrounding components: how Mela handles cold-start when individual data is sparse, how confidence calibration sharpens over the first sixty days of a user's data, how the engine separates a true regime change in a user's skin from an ordinary daily fluctuation. Each is a methodological choice with downstream consequences for what gets said to users. Each is worth documenting publicly. The alternative is methodology behind a curtain, which is what made skincare's claim culture untrustworthy in the first place.
References
- Fountzilas, E., Tsimberidou, A. M., Vo, H. H., & Kurzrock, R. (2022). Clinical trial design in the era of precision medicine. Genome Medicine, 14(1), 101. https://doi.org/10.1186/s13073-022-01102-1
- Glass, G. E. (2020). Cosmeceuticals: The principles and practice of skin rejuvenation by nonprescription topical therapy. Aesthetic Surgery Journal Open Forum, 2(4), ojaa038. https://doi.org/10.1093/asjof/ojaa038
- Hernán, M. A., & Robins, J. M. (2016). Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology, 183(8), 758–764. https://doi.org/10.1093/aje/kwv254
- Lin, S. et al. (2025). Network meta-analysis of randomized controlled trials on topical retinoids for photoaging. Scientific Reports. https://doi.org/10.1038/s41598-025-12597-0
- Mavranezouli, I. et al. (2022). A systematic review and network meta-analysis of topical pharmacological, oral pharmacological, physical and combined treatments for acne vulgaris. British Journal of Dermatology, 187(5), 639–649. https://doi.org/10.1111/bjd.21739
- Milosheska, D., & Roškar, R. (2022). Use of retinoids in topical antiaging treatments: A focused review of clinical evidence for conventional and nanoformulations. Advances in Therapy, 39(12), 5351–5375. https://doi.org/10.1007/s12325-022-02319-7
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press. https://www.cambridge.org/9780521895606
- Samuel, J. P. et al. (2023). N-of-1 trials vs. usual care in children with hypertension: A pilot randomized clinical trial. American Journal of Hypertension, 36(2), 126–132. https://doi.org/10.1093/ajh/hpac117
- Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search (2nd ed.). MIT Press. https://mitpress.mit.edu/9780262194402/causation-prediction-and-search/
- Zuo, H., Yu, L., Campbell, S. M., Yamamoto, S. S., & Yuan, Y. (2023). The implementation of target trial emulation for causal inference: A scoping review. Journal of Clinical Epidemiology, 162, 29–37. https://doi.org/10.1016/j.jclinepi.2023.08.003
Bibliographic data for all PubMed-indexed citations was verified against the National Library of Medicine's PubMed database at the time of authorship. If any citation here does not match what can be verified on PubMed, that's a bug, not a feature; corrections are welcomed at the email address above.
Educational information, not medical advice. See Terms & Privacy.