Glass Box UMAP: A Python package for interpretable UMAP via exact feature contributions

James R. Golden; Evan Kiefl; Ryan York

doi:10.57844/arcadia-4ye8-8tun

Purpose

Uniform manifold approximation and projection (UMAP) is widely used to embed high-dimensional data into lower-dimensional manifolds. However, the nonlinear mapping learned by UMAP makes it difficult to understand which input features are responsible for the position of any particular point, limiting its analytical utility.

We built a Python package called Glass Box UMAP (“glass-box-umap” on PyPI) that enables interpretation of the exact feature contributions within UMAP embeddings. We're sharing it for scientists and data practitioners who use UMAP to explore high-dimensional datasets and want to understand which measured variables organize the structure they see.

Statement of need

UMAP is used to visualize high-dimensional data on a manifold, revealing clusters, gradients, neighborhoods, and outliers associated with patterns of variation 1. Unfortunately, the nonlinear mapping it learns makes the embeddings difficult to interpret. A researcher may see separated clusters or local substructure, but still not know which measured variables placed a point where it appears. Linear methods like PCA have the opposite trade-off: they're interpretable but can't separate complex nonlinear structures. Given these trade-offs, methods that can bridge desirable components of linear and nonlinear methods are of obvious interest.

One approach is to explain model outputs post hoc with attribution methods such as SHAP 2, LIME 3, and Grad-CAM 4, as well as tools for dimensionality reduction like MCML 5. These methods are valuable but produce linear approximations: attributions come from local sampling or surrogate models rather than from the embedding function itself, and they aren't guaranteed to sum to the coordinate they aim to explain. The nonlinear components that often make the representation useful are simply discarded.

We created Glass Box UMAP to preserve these nonlinear components while making their relationships with input features interpretable. In a previous pub 6, we introduced the fundamental principles behind the method. It uses parametric UMAP 7 to learn a mapping function between embedding coordinates and inputs via a neural network. The neural network architecture is constrained to be homogeneous of order 1 at inference so that the learned embedding function can be decomposed exactly 8–10. The embedding of each sample can be written as a sum of feature-level contributions:

z_i = \sum_{j=1}^{p} c_{ij}

where $z_i$ is the embedding of sample $i$ , $p$ is the number of input features, and $c_{ij}$ is the contribution of feature $j$ to that sample's embedding coordinates. Equivalently, for each embedding dimension $k$ , $z_{ik} = \sum_j c_{ijk}$ .

To illustrate the theory, we previously analyzed a single-cell gene expression dataset and showed that exact feature contributions revealed which genes shaped the positions of individual cells in the embedding, offering a direct view of what UMAP had learned 6. However, our analysis was written in a bespoke Jupyter Notebook, with no stable, documented, reusable interface. Thus, until now, there's been no general-purpose software package that allows users to fit glass-box UMAP models and calculate exact feature contributions in their own datasets.

We created Glass Box UMAP to fill this need by providing an installable Python package that fits interpretable UMAP-like embeddings and computes exact per-feature contributions. It’s designed for researchers who want interpretable, nonlinear embeddings.

Software design

Following the lead of umap-learn 11, the authoritative Python implementation of UMAP, Glass Box UMAP adopts an API convention borrowed from scikit-learn 12, in which models are represented as classes with standard methods like fit and transform. The primary class in the package is GlassBoxUMAP, which, in simple cases, is a drop-in replacement for the UMAP object in umap-learn.

The encoder stack is built with PyTorch 13 and PyTorch Lightning 14. PyTorch Lightning handles the training loop, device placement, checkpointing, and custom callbacks. Users can take advantage of GPU acceleration when available without changing the API.

The package separates three conceptual stages: embedding, attribution, and plotting. The embedding stage trains a parametric UMAP model using a neural network architecture designed to support exact decomposition into local linear representations. The attribution stage computes per-sample feature contributions using the model's local linear mapping, returning a feature contribution array of shape (n_samples, n_embedding_dimensions, n_features) that can be used however the user sees fit. Summing across the feature axis recovers the embedding coordinates exactly, so the reconstruction property is easy to verify on any fit. The package provides visualization utilities as an optional layer for users who want to explore embeddings using out-of-the-box interactive solutions (Figure 1). We designed these plotting utilities as opt-in in order to maintain a minimal core dependency set.

Example application

Since feature contributions are still an emerging object of analysis, the package doesn't prescribe a single interpretation workflow. Instead, the documentation presents a range of example datasets that span different structures and domains, including tabular, image-derived, synthetic-manifold, and gene-expression examples. These examples show how users can reason about contribution tensors in different contexts while making clear that best practices for interpreting Glass Box UMAP outputs will continue to evolve with community use.

In Figure 1, we include one particular dataset as a compact example. The dataset contains 178 wines produced from three cultivars, with 13 chemical measurements for each wine. The features in this dataset are few enough and familiar enough to interpret directly, including chemical measurements such as proline, flavonoids, color intensity, ash, alcohol, and magnesium.

Figure 1. Interactive exploration of feature contributions in a Glass Box UMAP embedding of the UCI Wine dataset.

Points represent wines colored by cultivar. The linked bar chart summarizes which chemical features contribute most to the selected points' positions in the embedding. In this example, proline helps hold cultivar 0 together, flavonoids act as a signed polarity feature pushing cultivar 0 toward one end of the embedding and cultivar 2 toward the other, color intensity is prominent in cultivar 2, and ash explains local substructure within the rightmost portion of cultivar 0. The full walkthrough is available in the package documentation.

AI usage

We used Claude (Opus 4.6 and 4.7) to help write code, clean up code, comment our code, and write text that we edited.

Key takeaways

UMAP is a popular method for visualizing high-dimensional data in two dimensions, but it's historically been a black box: You can see clusters and patterns, but you don’t know why. Glass Box UMAP solves this by producing UMAP-like embeddings where every point's position can be broken down into exact contributions from each input feature. If you work with high-dimensional data and use UMAP to explore it, this package lets you go beyond "these points cluster together" to "these points cluster together because of these specific features." The theoretical foundations are described in our previous pub 6, and the package itself is designed as a near drop-in replacement for the standard UMAP Python library, so it's straightforward to integrate into existing workflows.

Glass Box UMAP is and will always be open source. The documentation is hosted on Read the Docs and the codebase is hosted on GitHub (DOI: 10.5281/zenodo.20220755).

Glass Box UMAP: A Python package for interpretable UMAP via exact feature contributions

Glass Box UMAP: A Python package for interpretable UMAP via exact feature contributions

Glass Box UMAP transforms UMAP into an interpretable embedding workflow by decomposing each point's coordinates into exact contributions from the original input features, letting users identify which variables drive clusters, gradients, and outliers in high-dimensional data.

Purpose

Statement of need

Software design

Example application

AI usage

Key takeaways