|
Abstract:
In recent years, technological advances have rapidly expanded our data acquisition capabilities. As a result of this greater ease of acquisition, the typical modern scientific dataset has increased exponentially in both size and complexity, and the difficulties of effectively analyzing such datasets have multiplied. This thesis considers a few of the most prominent analysis challenges posed by this new wealth of data. We start by considering the problem of reconciling and fusing information from multiple acquisition modalities. We study current methods that have shown great experimental success in order to identify their most critical properties. We then propose a new, larger class of metrics, which share these previously identified properties, for this purpose. We verify that these new methods capture the experimental success of the methods we've studied, but can also incorporate additional desirable properties, such as greater computational simplicity. We then move to the problem of automatically finding underlying structure in very large datasets. In particular, we focus on the problem of finding underlying manifold structure in high-dimensional data that can be effectively parameterized by a small number of latent variables. We draw a previously unnoticed connection between a recent wave of new spectral methods for this problem and earlier criteria motivated by more physical concerns. Using this connection, we demonstrate the source of a number of typical shortcomings in the output of spectral methods and show how to correct these shortcomings. We also use these methods on fMRI datasets to attempt to characterize the brain's representation of spatial location. However, our experiments reveal that there are several challenges posed by fMRI datasets, which we detail and discuss, that current manifold learning methods can not yet handle. Finally, we look at the problem of trying to characterize an unknown or poorly understood underlying process that has generated our dataset. For this, we look to a dataset of high-resolution scans of paintings by Vincent Van Gogh and contemporaries, and we ask what distinguishes a painting generated by Van Gogh's artistic process from one by another artist. In this vein, we develop methods that allow at least 80% classification accuracy of Van Gogh paintings from those of other artists. We also develop further methods that seem to detect the hesitation of a forger or a copyist in a painting's digital image.
|