Data Science Clinics

12 octobre 2020 - 12h15-14h00

Uni Dufour - U159

Powerpoint Stéphane Guerrier & Maria-Pia Victoria-Feser

12h15-12h30: Stephane Guerrier et Maria-Pia Victoria-Fese r, Geneva School of Economics and Management, Research center for statistics.

"Data owners and methodologists: towards fruitful collaborations"

There are researchers that produce/collect data for the purpose of increasing/con rming their knowledge on subjects of interest in their disciplines and/or
among disciplines. There are also researchers that develop data analysis methodologies that might concern the exploration of the (big) data at hand, such as
machine learning methods, and other researchers (sometimes the same), that develop mathematical/probabilistic tools to acquire knowledge from the data
that can be used to answer questions on a more general basis (what is called the population).

Collaboration between the data owners and methodologists can be very fruitful, since the former know very well the problem under study and the later can
translate the problem into a formal framework from which possibly new methodologies for data analysis can be developed. When these associations work, all
sides are winners and science can progress in several directions. In other words, interdisciplinary research allows to combine knowledge and deliver more quality
research than if this were done independently in respective domains.

In the presentation, a non exhaustive set of examples of such collaborations will be presented, without the technical aspects, but rather focusing on the fun-
damental ideas. These examples concern, for example, bioequivalence (pharmaceutical sciences), dynamic gene networks (biology), breast cancer omics
analysis (medical sciences), carrier choice of medical students (education sciences), targeting hallmarks of cancer with a food system-based approach (nu-
trition), COVID-19 prevalence estimation (public health), improved navigation for small unmanned aerial vehicles (aerospace engineering), energy consumption
modelling (architecture/civil engineering), natural disaster modelling (environmental sciences), risk management (actuarial sciences), statistical analysis in
economic inequality (economics).

12h30-12h55: Hy Dao et Jacques Michelet, Faculté des Sciences de la société, Institut de gouvernance de l'environnement et développement territorial.

"Big data dans le domaine du développement territorial: l'expérience ESPON"

Dans le cadre du programme ESPON (recherche appliquée pour la cohésion territoriale européenne), où l’UNIGE est très impliquée depuis une dizaine d’années, l’utilisation de nouvelles sources de données du web devient stratégique. Le Prof. Hy Dao et le Dr Jacques Michelet, avec les partenaires du Certificat de géomatique, ont développé un savoir-faire autour de l’information géographique (production, stockage, traitement, interprétation à destination des politiques), mais la maîtrise des nouvelles sources de données du web requiert une association d’expertises plus large, d’où l’idée d’un réseau à créer à partir de la diversité de compétences présentes à l’UNIGE.

L’enjeu peut se résumer au fait que les statistiques territoriales dont dépend ESPON sont encore essentiellement déterminées par les unités administratives et la statistique officielle. Or, les demandes actuelles montrent besoin d’aller plus loin, plus fin, plus vite pour compléter des thèmes mal couverts par ces données « traditionnelles ». Face à de telles demandes, nous avons jusqu’ici œuvré en réseau avec des partenaires européens pour des productions limitées. A l’avenir, l’objectif serait de traiter à Genève de façon systématique une plus grande partie, voire la totalité de cette chaine de production de nouvelle information géographique.

Nous mettons à profit l’opportunité offerte par les Data Science Clinics pour identifier des compétences UNIGE qui nous manquent, notamment :

o les informaticiens pour le « web content mining »

o les physiciens/mathématiciens pour le traitement massif de l’information dans Baobab par exemple

o les ressources centrales pour le développement de partenariats public-privé

o les juristes pour inventer un cadre de confidentialité lié à l’utilisation de ces données

o …

Les compétences ainsi mutualisées permettraient en retour à chacun des partenaires de devenir plus compétitifs dans ses projets à l’ère de la « Big information ». L’UNIGE gagnerait en capacité d’expertise et se profilerait sur le champ des nouvelles données émergeantes.

12h55-13h20: Yi-Tang Lin, Faculté des Lettres, Département d'histoire générale.

" La base de données des boursiers de la fondation Rockefeller : comment analyser les réseaux internationaux dans le temps ? "

La présentation est divisée en deux parties. La première portera sur la base de données des boursiers et les bourses de la fondation Rockefeller, 1914-1970s. Durant cette période, la fondation américaine a distribué près de 15,000 bourses pour les experts dans 80 pays permettant à ceux-ci de se former ou de voyager dans plusieurs pays pendant 1-2 ans pour améliorer leur expertise. Les sujets de formation sont diversifiés – ils comprennent les sciences naturelles, l’administration sanitaire, les arts et cultures. En croisant les sources historiques, la base de données documente les profils démographiques, les trajectoires des boursiers. Dans la deuxième partie, nous abordons nos questions du moment concernant les pistes potentielles pour analyser les données et traiter les informations manquantes.

13h20-13h45: Luca Caricchi, Faculté des Sciences, Département des sciences de la terre.

"Machine-learning in volcanology"

The chemistry of minerals can be used to figure out depth and temperature within magma chambers from the analyses of magma erupted at the surface. Recently, we proposed a machine learning approach to do this, but various issues have risen that could be interesting to discuss.

1. Since we have few experiments that we use to calibrate our method, "data augmentation” is a salient issue.

2. The range of pressure and temperature is large, but when we study one or two minerals they might be stable only within specific ranges. What is the best approach to calibrate algorithms over the entire pressure and temperature range?

3- We use a random forest approach and separate train-validation and test dataset. Because of the paucity of data, each repetition provides a different estimator for the performance of the model because of the differences between pressure and/or temperature range distributions within the train and validation dataset. Is this issue addressed by data augmentation or can we use the “best” models whitin the range of the validation dataset?

More general issues of interest for other projects within our research group

1- How to deal with mixed datasets with images and chemical data?

2- We often have very high quality data (in small numbers) and lower/low-quality data in abundance. What is the best approach to obtain the most information from combining these sort of mixed datasets?