The CUNY Data Science and Applied Topology Reading Group is joint between the Mathematics and Computer Science programmes. We meet Fridays 11.45 -- 12.45 in GC 4419. You can contact us at cunygc@appliedtopology.nyc.

Our plan is to primarily read and discuss seminal papers in data science, in applied topology and in topological data analysis. Each seminar one participant takes the responsibility to present a paper and prepare items for discussion. We expect occasionally to be able to invite external speakers.

## Schedule

Current schedule can be found here.

We will be sending out announcements through a mailing list; you can subscribe here.

## Organizers

- Mikael Vejdemo-Johansson, Computer Science Programme, CUNY Graduate Center; Department of Mathematics, CUNY College of Staten Island
- Azita Mayeli, Mathematics Programme, CUNY Graduate Center; Department of Mathematics, CUNY Queensborough Community College

## Suggested papers

We have compiled a list of papers that might be interesting to present.

# Schedule

### Semester Introduction

We will start off this semester with a light introduction to the basic techniques and ideas in Topological Data Analysis, and a discussion of possible content for this semester's seminars.

### Piecewise Linear Manifold Clustering

Clustering is commonly viewed as an ill-defined problem because there is not clear performance criteria which can tell about relevant element arrangements withing the clustering. Devising such evaluation criteria usually based on geometrical and/or structural composition of the original data. We present a novel algorithm that incorporates a topological structure and information-theoretical description of the clustered data to recognize and evaluate stable nonlinear structures in form of a piecewise linear manifolds as a meta-extension to common statistical clustering algorithms.

### Multiple hypothesis testing with persistent homology

Based on joint work with Sayan Mukherjee, we will talk about the need for multiple hypothesis corrections - both for Family-Wise Error Rates (decrease the risk of even a single false positive) and for False Discovery Rates (fix the rate of false positives at a chosen level) and how to correct both types in hypothesis testing using persistent homology.

### Topology and the Machine Mind

Using standard tools from topological data analysis and basic notions from topology, we explore new ways of understanding how a machine-learning algorithm utilizes the data it is given. This understanding can aid in the computability of a model and offers insight into additional, targeted input for improving its efficacy.

### Computing Minimal Presentations and Bigraded Betti Numbers of 2-Parameter Persistent Homology

Motivated by applications to topological data analysis, we give an efficient algorithm for computing a (minimal) presentation of a bigraded $K[x,y]$-module $M$, where $K$ is a field. The algorithm takes as input a short chain complex of free modules $\displaystyle F_2\xrightarrow{\partial_2} F_1 \xrightarrow{\partial_1} F_0$ such that $M\cong \ker{\partial_1}/\im{\partial_2}$. It runs in time $O(\sum_i |F_i|^3)$ and requires $O(\sum_i |F_i|^2)$ memory, where $|F_i|$ denotes the size of a basis of $F_i$. We observe that, given the presentation computed by our algorithm, the bigraded Betti numbers of $M$ are readily computed. These algorithms have been implemented in RIVET, a software tool for the visualization and analysis of two-parameter persistent homology. In experiments on topological data analysis problems, our approach outperforms the standard computational commutative algebra packages Singular and Macaulay2 by a wide margin.

### Cellular Sheaves, Discrete Hodge Theory, and Applications

Cellular sheaves are a discrete and computable instantiation of the theory of sheaves over cell complexes, and naturally model many situations where data of varying types is parameterized by a space. Assigning inner products to the stalks of a cellular sheaf allows us to apply the constructions of discrete Hodge theory to its cochain complex. We obtain generalizations of the discrete Hodge Laplacians, including the ubiquitous graph Laplacian. These operators have interesting topological and geometric properties, and allow us to use the local-to-global structure of cellular sheaves in real-world systems where robustness is essential. This talk will outline the construction of cellular sheaves and their Laplacians, and sketch various avenues for their application to realistic engineering and scientific problems.

### Singularity in Data Analysis

Statistical data by their very nature are indeterminate in the sense that if one repeated the process of collecting the data the new data set would be somewhat different from the original. Therefore, a statistical method, f, taking a data set x to a point in some space F, should be stable at x: Small perturbations in x should result in a small change in f(x). Otherwise, f is useless at x or -- and this is important -- near x. So one doesn't want f to have "singularities", a data set x such that the the limit of f(y) as y approaches x doesn't exist. (Yes, the same issue arises elsewhere in applied math.)

However, broad classes of statistical methods have topological obstructions of continuity: They must have singularities. In this talk I will show why and give lower bounds on the Hausdorff dimension, even Hausdorff measure, of the set of singularities of such methods. I will give examples.

### No Meeting

No Meeting today - relax after Halloween.

### Coding and Generative Design for 3D Printing

3D printing and design allows us to physically experience complex mathematical objects. In this talk we’ll take a 3D-printed tour of mathematical knots, tessellations, fractals, and polyhedra. Using code and generative design we can create parametric models that leverage randomness to achieve structural variety or even organic-looking behavior. We’ll also talk about iterative design, the ability to “learn by failing,” and the importance of being open to sharing that process, both in the 3D design process and in mathematical exploration.

(I also have two "homework" assignments for anyone that wants to try things themselves afterwards)

Bio: Laura Taalman is a Professor of Mathematics at James Madison University whose research has included algebraic geometry, knot theory, and games. Also known as “mathgrrl”, Dr. Taalman is a computational designer who leverages a diverse toolbox of 3D design software and technical materials to create elegant and aesthetic realizations of idealized mathematical objects. She is a Project NExT Fellow, a recipient of the Alder Award, Trevor Evans Award, and SCHEV Outstanding Faculty Award, and has been featured on Thingiverse, Adafruit, and Science Friday.

### Iterated Integrals and Paths of Persistence Diagrams

Path signatures are a reparametrization-invariant characterization of paths, which are defined via iterated integrals. More generally, path signatures form the 0-cochains of Kuo-Tsai Chen's iterated integral cochain model of path spaces. By considering multivariate time series as a path through Euclidean space, we can leverage the path signatures as a reparametrization-invariant feature set for time series. In this talk, we will introduce the path signature, consider its application to studying paths of persistence diagrams (persistence vineyards), and briefly discuss how Chen's perspective can lead to generalizations.

### A new data structure for the epsilon-approximate range emptiness problem

Constructing space-efficient data structures for answering approximate membership queries is a well-studied problem and one that is increasingly important in the era of big data. That is, given a set $S$ of $n$ elements from a large universe $[U] = { 0,2, ..., U -1 }$, preprocess $S$ so as to answer queries of the form, is $x \in S$? Since $|S|$ is prohibitively large to fit directly into RAM, the approach is to give a sketch, or succinct summary, of the set that, while removing elements from S, gives good answers to specified queries. Hence, we allow such data structures to return a small fraction false positives in return for significant savings in space.

This talk will review a new data structure, along with it techniques and lower bounds that generalizes the functionality from single point queries to 1-D queries of intervals of length $L$. Known as the $\epsilon$-approximate range emptiness problem, the paper by Mayank Goswami, Allan Grølund, Kasper Green Larsen, and Rasmus Pagh show that the space/error trade-off of a naive approach of querying a traditional Bloom filter $L$ times cannot be improved asymptotically: Any data structure for answering approximate range emptiness queries on intervals of length up to $L$ with false positive probability $\epsilon$, must use space $\Omega(n \lg(L/\epsilon)) − O(n)$ bits. While this does seem unfortunate, their data structure does answer such queries in $O(1)$ time.

### Planning: High Dimensional Probability: an introduction with application to data analysis

In the coming spring, the seminar will be reading "High Dimensional Probability: an introduction with application to data analysis" together.

The book will be introduced, and chapters assigned to anyone interested in speaking in that reading group.

### Tensor Networks and Language Models

Unsupervised machine learning algorithms based on tensor networks can provide an excellent inductive bias for generative language models. I’ll share some ideas about this, including a big-picture overview and a mathematical “look under the hood” at a training algorithm based on the density matrix renormalization group (DMRG) procedure, which helps to explain the performance of these models. I’ll start with some of this motivation, then describe an elementary passage from classical to quantum probability theory, and—after a brief introduction to tensor network diagrams—use the passage as inspiration for taking a deeper look at the DMRG training algorithm.