Big Data, Missing Data
and everything in between
11 January 2023 - 12h15-14h00
Online
Registration mandatory - Under this link
This century is undoubtedly the century of massive data. The research community, companies and administrations are producing an unprecedented volume of data. Made possible by major computational advances, the methods themselves have improved to embrace the issues raised by the processing and analysis of these increasingly large data sets. However, to reduce data science to Big Data would be to see only part of the Big picture. On the other side of the spectrum, a lot of research has to deal with a scarcity of data, or even its total absence. Here too, fascinating methodological innovations have been developed to make the most of existing data or to capture the nature and scope of missing data as accurately as possible. The cross-fertilisation of the issues at stake in these approaches is all the more enriching for research as the boundaries between them are porous. Indeed, it is not uncommon for research based on large ensembles to also have to deal with missing data.
Through concrete examples from their research, the speakers of this seminar will present how they deal with Big Data and Missing Data, and sometimes both at the same time.
Program
Big Data from the Gaia Space Mission
Laurent Eyer, Faculty of Science, Department of Astronomy
Gaia is an ambitious mission of the European Space Agency with enormous success in the scientific community. Early in the mission preparation, it was recognised that the Gaia data processing would be a challenge of the highest order. The University of Geneva is leading and responsible for the time series analysis of more than 2 billion sources that Gaia measures repeatedly. The measurements are done with different instruments nearly simultaneously. Gaia Data Processing Centre at the premises of University of Geneva is one of five processing centers of the Gaia consortium: drives the software development and operations of the Variablity Studies, handles hardware, storage and processing of this petabyte-scale project. We coordinate an effort of 70 scientists and developers across Europe who contribute to the development of methods, coding or the analysis validation. We use AI methods Machine Learning and developed tools to analyse the data. The results are iteratively delivered to the community through successive data releases. More sources are delivered to the public at each data release with improved quality as well as more diverse products. This resulted in the largest collection of variable sources in our Milky over the whole sky.
Long-term intracranial recordings: epileptic seizures are rare and recording sites are sparse
Nicolas Roehri, Faculty of Medicine, Basic Neurosciences
One percent of the world’s population suffers from epilepsy. Anti-seizure medications exist, but one third of patients are drug resistant. These patients can be admitted to presurgical evaluation for resective surgery, which consists in localizing and resecting the part of the brain involved in generating seizures, namely the epileptogenic zone. During the presurgical evaluation, patients are first monitored non-invasively (e.g., scalp-EEG, MRI). If the localisation is conclusive, surgery is proposed to the patient. Otherwise, a second phase with invasive monitoring is needed, where electrodes are implanted directly into the patient’s brain. If the localisation is then conclusive, surgery is proposed. During the invasive phase, patients are continuously monitored for several weeks to record as many seizures as possible, as seizures are the most informative manifestation of the disease. Unfortunately, for some patients, only a few seizures are recorded (one or two). Would it be possible to take advantage of the large amount of recorded data even without seizures? Moreover, due to the invasive nature of the recording, only a small number of brain regions are recorded, making these recordings spatially biased. I will present studies that either take advantage of the large amount of data without seizures or mitigate the spatial bias by building a multicentre intracranial whole brain atlas.
The Handling of Data Missingness in Biological Anthropological Research
Jessica Ryan-Despraz, Department of Physical Anthropology, Institute of Forensic Medicine, University of Bern
Missing data in biological anthropology is inevitable due to the circumstances commonly surrounding skeletal preservation. In recent years, an increased accessibility to statistical analysis software has inspired anthropologists to reconsider how they analyze data. In particular, the application of imputation models for both quantitative bone measurements and qualitative development scores has opened the field to new areas of data analysis. However, questions remain concerning the best-adapted imputation models as well as the amount of accepted missingness. This study explores both of those questions through the application of the R packages MICE and missMethods to a well-documented osteological dataset.