Working Group FDA
Please contact Fabian Scheipl if you’re interested in one of these BA or MA thesis topics or if you want to discuss related ideas of your own.
Last update: 2024-03-26
tidyfun
is an R
package for
functional data analysis currently under development. Some of the issues
tracked on Github for this and its underlying infrastructure package
tf
could also be good topics for
theses.
For BA theses, we would keep the focus on
refactoring/evaluating/describing existing implementations or applying
those to real data, for MA more novel developments and detailed theory
along with clean and performant implementations would be expected as
well.
The functional data literature contains many possible definitions of
“function-valued quantiles”. We would pick out some of the most
relevant/interesting of these, summarize the relevant theory behind
them, implement them for use within tidyfun
, and perform a comparison
based on real and/or synthetic data sets.
A minimal BA thesis in this topic area would be re-implementing,
documenting and validating (most of) the methods in the
rainbow
package integrated into / as an add-on package for tf
& tidyfun
.
Functional data contains both vertical (amplitude - how large is the peak/valley) and horizontal (phase - where is the peak/valley) variability. The latter requires more sophisticated mathematical theory and complex algorithms to deal with. Potential tasks here include:
- defining & implementing additional data structures, classes & methods to represent & visualize aligned functions along with their corresponding warping functions
- writing glue code for using registration packages like
fdasrvf,
registr,
DTW methods
with
tf
vectors - … or (re-)implementing (simpler) alignment methods (like
fda
’s landmark alignment or alignment based on FPC 1 (“continuous registration”)) - implementation of summary statistics, visualizations, diagnostics etc for the results of registration/alignment procedures
Stretch goals here include implementing methods for noisy and/or sparse and/or non-Gaussian/discrete functional data and accommodating functional fragments/unequal domains with functions of different observed lengths. Excellent review of (mostly) SRVF framework: Wu et al (2023, ch. 3 f)
Extend tf
-classes and methods for
- multivarate functions with vector outputs
(
$f:\mathbb R \to \mathbb R^d$ for$d>1$ ) - scalar fields (
$f:\mathbb R^q \to \mathbb R$ for$q>1$ )
This is a large SWE task - scope would probably be one of the above,
limited to either extending tfd
or tfb
, and may require major
refactoring of tf
to make such an extension work smoothly and
consistently (e.g. it probably requires definining new classes and logic
for arg
-“vectors” and function domain
s).
The Bayes Space paradigm developed by v.d. Boogart, Hron, Egozcue and
others (e.g. v.d. Boogart et
al. (2014),
Hron et al. (2016))
provides a way to represent probability measures so that their addition
and multiplication are well defined, enabling simple summary statistics
(means etc) as well as methods such as PCA or linear regression for
probability-density-valued data – i.e. the unit of observation is
represented by an entire probability distribution, not a single value,
and the inferential goal is typically to understand how other covariates
are associated with changes in these distributions. This has many
interesting applications, for example see Meier et al,
(2021) for differential effects of
family formation on gender-specific income distributions in East and
West Germany or (Menafoglio et al,
2021) for an application
to groundwater monitoring.
A thesis on this topic would
- summarize the necessary theoretic background and literature
- implement functionality for
tf
andtidyfun
that represents density data and performs arithmetic operations as well as basic statistics in Bayes space, - apply this to an interesting real-world data set (or: replicate a published analysis in this context with the new implementation).
Topics: Write tidyfun
scripts for Craniceanu et al’s “Functional Data Analysis with R” / Ramsay et als’s “Functional Data Analysis with R and MATLAB”
Both of these books contain many chapters, data sets and case studies
that could also be done (mostly) using tidyfun
and/or refund
.
We’ll select some of them, you’ll identify and implement missing
functionality in tidyfun
with my help, and write them up with all the
necessary theoretical background and some extensions, in an online
document / as vignettes for tidyfun
.
Books: Craniceanu et al. (2024), Ramsay et al. (2009)
Summarize, implement & evaluate SRVF-based function registration using the peak-persistence diagrams of Kim, Dasgupta, Srivastava (2023). This topic would involve some more advanced and interesting maths and algorithms like differential geometry, topology, dynamic programming optimization. The paper to implement is bleeding edge state of the art, so this makes an excellent topic for people considering a PhD and looking for a thesis topic that might turn into something publishable. Potential tasks would include:
- summarizing the maths behind these methods
- implementation of the algorithms and visualizations from the paper in
R, preferably using infrastructure of / integrated into
tf
/tidyfun
- benchmarking against other registration approaches available in R
- application to real world datasets (e.g. mouse brain stem audiograms, bodyweight fitness movement patterns, …)
Stretch goals would include extending this to either non-Gaussian/discrete functional data or accommodating functional fragments/unequal domains with functions of different observed lengths, based on ideas we’d develop together.
manifun
is a small, unpublished
R
package for dimension reduction and embedding visualization
(primarily) for functional data. Possible tasks include implementing
suitable interfaces to mlr3
and/or
tidyfun
. Implementing AUMVC
framework could be included in this topic area as well.
The central goal of the project is to improve existing and implement new
embedding (i.e. dimension reduction) visualization approaches. Fairly
flexible, interactive versions of the kind of visualization shown below:
that includes e.g. tooltips/interactive
highlighting when hovering over specific data points, brushing for
selecting and highlighting specific embedding regions or curves, etc
have already been implemented in a previous MA thesis (EmbedIt
,
Jennert 2023).
Thesis goals could include:
- Re-implementing
EmbedIt
based on more performant software like D3.js or refactoring it for better responsiveness etc - Adding interactive 3D visualizations
- Implementing the “grand tour” and other classic multivariate
exploration tools (c.f.
tourr
) - Adding pre-processing and embedding steps to the existing app
Beyond the methodological/theoretical topics below, we could develop more applied thesis topics in this context together with external partners that deal with large functional data sets such as the German Mouse Clinic (e.g. auditory brain stem response curves) or with (partners of) Prof. Christian Müller’s group at the Institute of Statistics.
Realistic evaluation of outlier detection should use real datasets with real outliers. Usually, this is done by selecting all majority class observations from a labeled dataset and contaminating them with a few randomly sampled instances from other minority classes. This approach yields “false” negatives/positives unless the minority class is really sufficiently and consistently different from the majority class observations. The goal of this project is to investigate under which circumstances this “unless” applies by comparing two approaches:
- use only datasets from the
mlr-fda
classification benchmark (pdf) that were predicted very accurately to generate outlier detection benchmark data - for the generated benchmark datasets, use detailed observation-level
mlr-fda
benchmark results to pick only those minority (and maybe also majority?) class observations that were consistently labeled correctly
Additionally, we are interested in how these results are affected by measures of dataset structure like separability ( pdf) and intrinsic dimensionality (pdf, CRAN).
The area-under-the-mass-volume-curve (AUMVC,
pdf) can be used to tune outlier
detection algorithms. Yet, a major caveat is that it relies on MC
simulations for approximating integrals and is thus not applicable to
high-dimensional settings. Combining it with dimension reduction and
manifold learning may allow to solve this issue.
The goal of this project is to implement the AUMVC framework in
manifun
and to conduct initial experiments. The central questions to
be answered are:
- How robust is the AUMVC approach to the ambient dimensionality of the data? This could be assessed by comparing results on image and functional data.
- A further question, which may be investigated: are there any distance measures for complex data such as images, which can be used to induce suitable bias for AUMVC to work on (embeddings of) such data?
- stretch goal: can AUMVC be adapted to sample only from the relevant space so it scales better to (nominally) high-dimensional data? e.g. by simulating data uniformly from the (convex hull) of the observed data or from lower dimensional (but: almost losslessly compressed) representations/embeddings of the data?
Multi-dimensional scaling (MDS) can be used to represent the outlier
structure of functional datasets
(pdf). However, MDS embeddings
represent the entire data structure (… hopefully, at least), not
just structural outlyingness. Since MDS embedding dimensions are
sorted by decreasing amount of “explained structure”, this might lead to
components of structural outlyingness being represented only in “late”
embedding dimensions in datasets with few outliers and complex
structured variation of high rank.
You would develop and evaluate a procedure for identifying embedding
dimensions (or 2D-subspaces of such embeddings) in which structural
outlyingness is reflected, i.e. the goal is to find relevant
(combinations of) embedding dimensions for outlier
visualization/detection (and, possibly, for tuning AUMVC if embeddings
are too high-dimensional, see below). Possible approaches:
- use something like HiCS (Link; this is computationally heavy)
- run Local Outlier Factor (LOF) or similar methods on each 2D-subspace and pick the ones where upper-tail (!) dependencies with global LOF scores are maximal – “upper-tail” only because the strength of association between global LOF scores and corresponding subspace LOF scores for low/intermediate values is irrelevant for this issue.
Replicate simulation study and application examples of Aléman-Gomez et al. (2021) with our geometric-topological approach and compare results.
Both UMAP and t-SNE, state-of-the-art manifold learning methods, can be
used to detect and represent the cluster structure of complex,
high-dimensional data. However, which method is more suitable for the
task (or specific aspects of the task) under which conditions remains an
open question. Answering this question is complicated since several
hyperparameters need to be tuned for both methods and the underlying
task is unsupervised (i.e., tuning is hard…).
While there are indications that UMAP leads to “better” clusterings
(smaller intra-cluster distances, larger inter-cluster distances) when
high-dimensional data consists of clearly separate clusters, the
situation is less clear in situations with very close or even
overlapping clusters. The focus of this project is to assess the latter
setting and tasks include the following:
- set up extensive synthetic experiments to obtain an initial
understanding of the problem. Possible factors to assess:
- Overlap/Separability as a function of mean and variance of the underlying data generating processes
- Parameter sensitivity w.r.t. to dimensionality and number of observations
- Relevant parameters for improving separability
- investigate whether measures of separability (pdf) can be used to reliably infer the structure of data set
These fairly old and rather badly written functions implement very general classes of penalized regression models (GAMs and GAMMs) for functional responses and/or predictors. Your thesis would be to re-write them from scratch with my help, using best practices for R programming like proper unit tests, input validation, and extensive documentation. This could also include developing a more stream-lined, consistent formula interface and developing better methods to deal with factor covariates and interaction effects as well as writing up some interesting case studies to be published as a vignette accompanying the package. See Scheipl & Greven (2017) for a review of the underlying methodology.