ISFA-COLUMBIA Workshop program

Monday June 27, 2016













Welcome coffee

Welcome address: José Blanchet (Columbia) & Stéphane Loisel (Lyon)

Nicole El Karoui & Stéphane Loisel (Paris 6 & ISFA)
Short presentation of LoLitA project & associated big data challenges

Nicholas Tatonetti (Columbia University)
Birth Month Affects Lifetime Disease Risk: A Phenome-Wide Method

Edouard Debonneuil (Lyon)
Cancer Baseline and other big data approaches to extend life

Coffee Break

Alexander Zhavoronkov & Evgeny Putin (In Silico)
Applications of deep learning in biomedicine, part 1

Lunch break & discussions

Julien Velcin (Lyon)
Structuring temporal sparse data with application to opinion mining

Xavier Milhaud (Lyon)
Tree-based censored regression with applications to insurance

Refreshment Break

Research discussions with DAMI chair team (chairs C. Robert & F. Planchet)
-Presentation of the research chair Data Analytics and Models in Insurance
-Presentation of a research project by Christian Robert 
-Feedback on Cardif Kaggle competition by Lam Dang (Cardif)

Research discussions with LoLitA team (chairs N. El Karoui, S. Loisel & O. Lopez)

Tuesday June 28, 2016













Welcome coffee

JingChen Liu (Columbia University)
Latent and Network Models with Applications to Finance

Coffee break

Alexander Zhavoronkov & Evgeny Putin (In Silico)
Applications of deep learning in biomedicine, part 2

Lunch break & discussions

Karthyek Murthy (Columbia)
Distributional robustness and Regularisation in machine learning: Two sides of the same coin

Christophe Geissler (BNP Paribas Cardif chair)
SRI investment: applying a supervised learning algorithm to environmental, social and governance historical scores

Refreshment break

José Blanchet (Columbia)
On Robust Risk Analysis

Nabil Kazi-Tani (Lyon)
Overreacting random walks tend to become amnesic


Ceremony in honour of E.J. Gumbel
- Emil J. Gumbel - An Extreme Statistician by Matthias Scherer (Munich)
- Additional remarks by Officials from Lyon
- Official renaming of Lecture Hall G3 after the name of Gumbel



Birth Month Affects Lifetime Disease Risk: A Phenome-Wide Method
An individual’s birth month has a significant impact on the diseases they develop during their lifetime. Previous studies reveal relationships between birth month and several diseases including atherothrombosis, asthma, attention deficit hyperactivity disorder, and myopia, leaving most diseases completely unexplored. This retrospective population study systematically explores the relationship between seasonal affects at birth and lifetime disease risk for 1688 conditions. We developed a hypothesis-free method that minimizes publication and disease selection biases by systematically investigating disease-birth month patterns across all conditions. Our dataset includes 1 749 400 individuals with records at New York-Presbyterian/Columbia University Medical Center born between 1900 and 2000 inclusive. We modeled associations between birth month and 1688 diseases using logistic regression. Significance was tested using a chi-squared test with multiplicity correction. We found 55 diseases that were significantly dependent on birth month. Of these 19 were previously reported in the literature (P < .001), 20 were for conditions with close relationships to those reported, and 16 were previously unreported. We found distinct incidence patterns across disease categories. Lifetime disease risk is affected by birth month. Seasonally dependent early developmental mechanisms may play a role in increasing lifetime risk of disease.

Cancer Baseline and other big data approaches to extend life
The era of big data brings an unprecedented promise for human health and longevity: we will be able to learn from the individual lives of billions of people, to better know what is good and bad and to pilot our health much more effectively. Here, such a feedback loop is already presented with agregated data about 4 billion persons -- a project called "Baseline", on which more than 200 persons have joined forces in the last 6 months -- and approaches with large-scale individual data are introduced.

Applications of deep learning in biomedicine
Alexander ZHAVORONKOV & Evgeny PUTIN
The availability of big data coupled with advances in highly-parallel high-performance computing led to a renaissance in artificial neural networks resulting in trained algorithms surpassing human performance in image and voice recognition, autonomous driving and many other tasks. While the adoption of deep learning in biomedicine has been reasonably slow, it is reasonable to expect major advances in personalized medicine, pharmaceutical R&D and other areas of healthcare. Here we present new results in developing aging biomarkers using blood biochemistry (www.Aging.AI) and transcriptomic data and demonstrate the applications of deep learning to drug discovery and drug repurposing using large data sets of transcriptional response data. We also discuss recent advances in deep learning and possible applications to actuarial science.

Feedback on Kaggle competition launched by BNP Paribas Cardif on claim process optimization
BNP Paribas Cardif launched its first challenge on Kaggle from Feb to April 2016. The goal was to predict, early in the current claim process, the claims that BNP Paribas Cardif has appetite to accept much quicker, thus generating customer satisfaction.
The challenge was a success in term of participation (2926 teams), and solutions provided by top 3 winners.
In this presentation, Sébastien Conort (BNP Paribas Cardif's Chief Data Scientist) will talk about what was learned from winner's solutions.

Latent and Network Models with Applications to Finance
JingChen LIU
One of the main tasks of statistical models is to characterize the dependence structures of multi-dimensional distributions. Latent variable model takes advantage of the fact that the dependence of a high dimensional random vector is often induced by just a few latent (unobserved) factors. In this talk, we present several problems regarding latent variable models. When the dimension grows higher and the dependence structure becomes more complicated, it is hardly possible to find a low dimensional parametric latent variable model that fits well. We further enrich the model by including a network structure on top of the latent structure. Thus, the main variation of the random vector remains governed by latent variables and the network captures the remainder dependence.

Structuring temporal sparse data with application to opinion mining
All the messages posted on the social media reflect only partially user's opinions. To gather those traces disseminated throughout the Web, evolutionary clustering techniques look eminently promising. In this talk, I will present two possible probabilistic models that address this issue. Our proposals extend the classic multinomial mixture for dealing with the temporal dimensions so that we can capture opinions over time. I will illustrate the two models with recent experiments performed within the ImagiWeb project that aims at studying the image (representation) of entities populating the social Web. I will more specifically use the image of two French politicians during the last presidential elections as a case study.

Tree-based censored regression with applications to insurance
In this paper, we propose a regression tree procedure to estimate the conditional distribution of a variable which is not directly observed due to censoring. The model that we consider is motivated by applications in insurance, including the analysis of guarantees that involve durations, and claim reserving. We derive consistency results for our procedure, and for the selection of an optimal subtree using a pruning strategy. These theoretical results are supported by a simulation study, and an application to insurance datasets.

Distributional robustness and Regularisation in machine learning: Two sides of the same coin
Karthyek MURTHY

Finding the best fit given available data is a common theme encountered in various problems in machine learning. When the number of samples available for training is smaller than the ambient dimension of the problem, usual empirical risk minimisation may not be enough. We introduce RWPI (Robust Wasserstein Profile Based Inference), a novel machine learning methodology that is aimed at enhancing out-of-sample performance in such settings. RWPI exploits the relationship between a suitably defined distributionally robust optimization problem and the Wasserstein profile function. On one hand, a judicious choice of the distributional uncertainty can be used to build a wide range of regularisation procedures (we recover generalized Lasso, support vector machines and regularised logistic regression) as particular cases, and introduce new families. On the other hand, an asymptotic analysis of the Wasserstein profile function allows to optimally select the regularisation parameter. We shall discuss this optimality in the context of a popular regularised linear regression algorithm called generalized Lasso.

SRI investment: applying a supervised learning algorithm to environmental, social and governance historical scores
Christophe GEISSLER
Socially Responsible Investment (SRI) relies on the inclusion of non-financial criteria, mainly environmental (E), social (S) and governance-related (G) scores attributed to corporations by specialized agencies. These scores are used to build a variety of quasi-static investment strategies like best-in-class or sector exclusion. The question of whether these strategies are associated with a positive or negative performance bias remains mostly opened.
We present here an attempt to apply supervised learning to a large set of co-variables comprising various derivations of initial score time series, in order to link future returns with observable variables. User requirements impose the traceability of the predictors, as well as their eligibility with respect to ESG fundamentals. An overview of the predictors provided by this approach, as well as the simulation of their systematic implementation is presented.

On Robust Risk Analysis
We consider the problem of maximizing an expectation over all probability measures which are within a given Wasserstein distance of a target measure. This problem is solved explicitly for a large class of expectations of interest in great generality and we show that the solution has a natural practical interpretations in terms of stress testing. Moreover, when the underlying distribution is supported in R^{d} we show a weak limit for the asymptotic distribution of the empirical version of the problem. The limit laws are non-conventional and qualitative differences arise depending on the value of d (d=1, d=2, and d=3 give rise to three different mathematical developments). These results provide the foundation for a data-driven approach at robust stress testing. This talk is based on joint work with K. Murthy and Y. Kang.

Overreacting random walks tend to become amnesic
We are interested in a discrete random walk on integers that take the steps (1, +1) and (1, −1) with equal probability, when it is not equal to 0. When it reaches the state 0, the behavior of the walk depends on wether it touched the x-axis coming from a positive or a negative excursion. The excursions of this walk do not form an i.i.d. sequence. We will give an explicit expression for the one dimensional distribution of this walk. As a by-product, we obtain a new simple combinatorial interpretation of k-fold convolutions of Catalan numbers.
We can for instance model an overreaction phenomenon at 0, meaning that the walk has a high probability to cross the x-axis when it reaches it. In that particular case, we will prove that in a financial model in discrete time where returns satisfy a symmetric overreaction property at 0, derivative prices are the same than in the model without overreaction.
We will also discuss the continuous time limit of such discrete random walks and explain how they are linked with the skew Brownian motion.
This is a joint work with Dylan Possamaï (Paris Dauphine University).

Emil J. Gumbel - An Extreme Statistician
Matthias SCHERER

Important methods in quantitative risk management and actuarial science – particularly in the areas of extreme value theory and multivariate statistics – were developed and popularised by Emil J. Gumbel. The Gumbel distribution and the Gumbel copula bear witness to this. To mark his 125th birthday – he was born on 18th July 1891 in Munich – we are providing an insight into his mathematical legacy and taking a closer look at his life as an academic, publicist, witness to history and pacifist. In addition to his mathematical work, he published several books on politics and countless newspaper articles on political murders, the justice system and nationalist secret societies in the Weimar Republic. This cost him his post at the University of Heidelberg in 1932 and saw him included on the first list of people whose citizenship was revoked under the Third Reich in 1933. He emigrated to France in 1932, but had to flee to the USA in 1940 to escape the occupying German forces. He died on 10th September 1966 in New York.