Chapter 10 Prior Ratemaking

10.1 Introduction

In so-called pricing, the idea is to divide contracts (and policyholders) into several categories so that within each category, risks can be considered “equivalent.” The foundations of segmented universe pricing were laid out in Chapter 3 of Volume 1 (see Section 3.8).

As we saw in Section 3.7, heterogeneity within a portfolio poses numerous problems, particularly adverse selection. If the same premium is applied to the entire portfolio, “bad” risks will insure themselves (at a lower price than they should), while “good” risks might be discouraged by the higher premium, which tends to deteriorate the result. The natural idea developed in the early sections of this chapter is to partition the portfolio to create sub-portfolios where risks can be considered independent and identically distributed (i.i.d.). These are referred to as risk classes. Classes are considered when risk classification is based on information available prior to the insurance contract (information about the insured, the insured property, etc.) and when information about the policyholder’s claims history is considered (as will be done in the following two chapters).

Chapters 3 and 4 of Volume 1 presented the general principles of premium calculation, primarily based on the calculation of the pure premium, which corresponds to the mathematical expectation of the annual cost of reported claims. However, this pure premium can be decomposed into two components: frequency and average cost. It should be noted that \[ \mathbb{E}\left[\sum_{i=1}^N X_i \right]=\mathbb{E}\left[N\right]\times\mathbb{E}\left[X_i\right], \] for independent claim amounts \(X_1,X_2,\ldots\) and independent claim counts \(N\) (see Property 3.2.11). It is essential to separate these two notions for several reasons: the explanatory factors are not always the same (in motor insurance, frequency is primarily related to the driver, while average cost is related to the vehicle); average cost is subject to inflation, while frequency follows more complex cycles; and while frequency can be known quickly, estimating cost can be time-consuming, especially for bodily injury claims (this will be discussed in detail in the Chapter on provisioning).

Example 10.1 Figure ?? compares the average cost of claims and the claim frequency for different population segments based on data from France. The segmentation takes into account gender, age (18-25, 25-40, 40-65, over 65), vehicle usage (commercial or personal), and vehicle power (graded from A to K, with K representing high-powered vehicles), as well as some socio-professional categories (liberal professions, public service, or agriculture). The two axes represent the average behavior across the entire population (frequency around 13.5% and average cost around €1,100, corresponding to the vertical and horizontal axes of the figure, respectively). It can be observed that the segment that stands out the most is young male drivers (aged 18 to 25, category H-18-25 in the top right), who have both more claims and more costly claims. Men aged 25 to 40 also stand out, especially when they have a powerful car or belong to a liberal profession and use their vehicle for work. Female liberal professionals are also highlighted, with a very high claim frequency but an average cost in the normal range. Conversely, elderly individuals or inactive men have a relatively low frequency for a standard average cost.

Furthermore, while direct modeling of the pure premium may appear more robust and faster, it is relatively complex to implement because it is difficult to find simple laws that correctly model the pure premium. In contrast, simple laws can be used to model frequencies and average costs (see Chapter 2).

10.2 Rating Variables

In insurance rating, for historical reasons and technical considerations, rating variables are generally qualitative variables. Continuous variables are typically grouped into classes.

In motor insurance, the premium can depend on the specific characteristics of the vehicle (power and top speed, vehicle age), its usage, the geographical area (traffic density), or certain specific traits of the regular driver.

The information that insurers try to obtain about the insured, in order to counter adverse selection, must, from a practical point of view, be verifiable. Some information that could reveal risky behavior (i.e., strongly correlated with claims) cannot be used because it would lead to significant fraud (a topic we will not address in this chapter). For example, the annual mileage of a vehicle is difficult to verify.

The explanatory variables comprising \(\mathbf{X}\) can be of different types. Some of them can be quantitative and continuous (such as the car’s power or the age of the insured, for example). Other explanatory variables that the insurer has about policyholders can be discrete quantitative variables (the number of children of the insured, for example). Others are qualitative or categorical (such as the gender of the insured). One can also use indices summarizing the characteristics of the insured’s residential area based on public data, such as those from a census. By condensing relevant information about claims available from a national statistical institute, one can obtain new explanatory variables whose predictive power (recognized by the model) is often very significant. We will revisit this point in Section ??.

Quantitative variables require no particular comments. Concerning categorical variables, it is customary to code any factor that partitions the population into \(k\) categories by integers \(0,1,\ldots,k-1\). Some factors may be ordinal and result from a quantitative variable (such as the vehicle’s power, which is coded into different classes), either ordinal but without a quantitative scale (such as the level of education), or purely qualitative without inducing any order (such as gender). A categorical variable with \(k\) levels is generally encoded by \(k-1\) binary variables, all of which are zero for the reference level.

Example 10.2 Most of the time, explanatory variables are all categorical in a commercial tariff. Consider, for example, an insurance company segmenting based on gender, the sportiness of the vehicle, and the age of the insured (3 age classes: under 30, 30-65, and over 65). A policyholder is represented by a binary vector indicating the values of the variables: \[\begin{eqnarray*} X_1&=&\left\{ \begin{array}{l} 0,\text{ if the insured is male},\\ 1,\text{ if the insured is female}, \end{array} \right. \\ X_2&=&\left\{ \begin{array}{l} 0,\text{ if the vehicle is not sporty},\\ 1,\text{ if the vehicle is sporty}, \end{array} \right. \\ X_3&=&\left\{ \begin{array}{l} 1,\text{ if the insured is under 30 years old},\\ 0,\text{ otherwise}, \end{array} \right. \\ X_4&=&\left\{ \begin{array}{l} 1,\text{ if the insured is over 65 years old},\\ 0,\text{ otherwise.} \end{array} \right. \end{eqnarray*}\] For each variable, the most represented category in the portfolio is chosen as the reference level (i.e., the one for which all binary variables used to code it simultaneously have a value of 0). The results are then interpreted as over- or under-representation relative to this reference class. Thus, the vector (0,1,1,0) represents a male policyholder under 30 years old driving a sporty vehicle.

10.3 Basic Principles of Statistics

This section briefly recalls elementary statistical methods. Our presentation is intuitive and informal. For more details, we refer the reader, for example, to (Monfort 1982).

10.3.1 Empirical Cumulative Distribution Function

10.3.1.1 Ungrouped Data

Suppose we have observations \(x_1,x_2,\ldots, x_n\) and consider them as realizations of random variables \(X_1,X_2,\ldots, X_n\), which are independent and have the same cumulative distribution function \(F\). These observations could represent the claims cost reported to the company.

The empirical cumulative distribution function, denoted as \({\hat F}_n\), provides an idea of the shape of \(F\) based on the observations \(x_1\), \(x_2\), \(\ldots\), \(x_n\). It is obtained by assigning a probability mass of \(1/n\) to each of the \(x_i\), i.e., \[ {\hat F}_n(x)=\frac{\#\{x_i \text{ such that }x_i\leq x\}}{n},\hspace{2mm} x\in \mathbb{R}. \] In other words, \({\hat F}_n(x)\) is the proportion of observations in the sample that are less than or equal to \(x\in \mathbb{R}\). The function \(x\mapsto{\hat F}_n(x)\) is a “staircase” function, with a jump of size \(k/n\) at each value that appears \(k\) times in the sample.

The empirical approach involves using \({\hat F}_n\) instead of \(F\) for all actuarial calculations. This approach is justified by the Glivenko-Cantelli theorem, which ensures that \[ \Pr\Big[\sup_{x\in \mathbb{R}}|{\hat F}_n(x)-F(x)|\to 0 \text{ as }n\text{ tends to }+\infty\Big]=1. \] In other words, the graph of \({\hat F}_n\) fits the graph of \(F\) better as the number of observations \(n\), i.e., our information, increases; therefore, the graph of \({\hat F}_n\) should provide a good approximation of that of \(F\) for a “large” \(n\).

The distribution of \(n\widehat{F}_n(x)\) is \(\text{Binomial}(n, F(x))\). Thus, according to the central limit theorem, for any \(x\), \[ \sqrt{n}\Big(\widehat{F}_n(x)-F(x)\Big)\xrightarrow{\text{d}}\mathcal{N}(0,\sigma_x^2) \text{ as }n\to +\infty, \] where \(\sigma_x^2=F(x)(1-F(x))\). For a sufficiently large \(n\), this allows us to obtain a confidence interval for the unknown value \(F(x)\).

Remark:

One can also “smooth” the empirical cumulative distribution function and derive an empirical density function \(\widehat{f}_n\). This leads to a kernel density estimator defined as follows: \[ \widehat{f}_K(x)=\frac{1}{nh}\sum_{i=1}^nK\left(\frac{x_i-x}{h}\right), \] where \(K\) is called the kernel, and \(h\) is a positive parameter (called the bandwidth) that determines the level of smoothing. In a first approach, one can choose a Gaussian kernel, i.e., \(K(x)=\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)\). Note that \(\widehat{f}_K\) is a convergent estimator of the density as \(n\to +\infty\) and \(h\to 0\) at an adequate rate.

Ungrouped Data:

In many cases, the data available to the actuary is grouped into classes, which may be more or less wide. This implies that the actuary does not have individual observations \(x_1, x_2, \ldots, x_n\). Instead, if we denote \(c_0<c_1<c_2<\ldots<c_r\) as the boundaries of the \(r\) classes, it is known that there are \(n_j\) claims with amounts between \(c_{j-1}\) and \(c_j\), for \(j=1,2,\ldots,r\). Here, \(n_j\) is the frequency of class \(C_j=]c_{j-1},c_j]\). Often, \(c_0=0\), so the first class is something like “claims with amounts \(\leq c_1\).” Sometimes, the upper limit \(c_r\) of \(C_r\) is not specified, and the last class is “claims with amounts \(>c_{r-1}\),” assuming that \(c_r=+\infty\). Sometimes, the average \(m_j\) of the amounts of claims falling into class \(C_j\) is also provided.

Grouping data means that the empirical cumulative distribution function \({\hat F}_n\) is only precisely known at the class boundaries \(c_j\), where it takes the values \({\hat F}_n(c_0)=0\) and \[ {\hat F}_n(c_j)=\left\{ \begin{array}{l} 0,\text{ if }j=0\\ \frac{1}{n}\sum_{i=1}^jn_i,\hspace{2mm} j=1,2,\ldots,r. \end{array} \right. \] To approximate \({\hat F}_n\), linear interpolation is used on the segments \(]c_{j-1},c_j]\) to obtain \[ {\hat F}_n(x)= \left\{ \begin{array}{l} 0,\text{ if }x<c_0, \\ \frac{(c_j-x){\hat F}_n(c_{j-1})+(x-c_{j-1}){\hat F}_n(c_j)} {c_j-c_{j-1}},\text{ if }c_{j-1}\leq x<c_j, \\ 1,\text{ if }x\geq c_r. \end{array} \right. \] It is noteworthy that in the grouped case, \({\hat F}_n\) is not defined on \(]c_{r-1},+\infty[\) when \(c_r=+\infty\), unless \(n_r=0\).

Remark:

Since \({\hat F}_n\) is now a piecewise linear function, \({\hat F}_n\) is differentiable everywhere, except at the ends of the classes \(c_0,c_1,\ldots,c_r\), where the right and left derivatives still exist. This derivative estimates the density function \(f\) associated with \(F\). We will denote this estimator as \({\hat f}_n\), also called the histogram. The estimator \({\hat f}_n\) is given by \[ {\hat f}_n(x)= \left\{ \begin{array}{l} 0,\text{ if }x<c_0, \\ \frac{{\hat F}_n(c_{j})-{\hat F}_n(c_{j-1})} {c_j-c_{j-1}}=\frac{n_j}{n(c_j-c_{j-1})},\text{ if }c_{j-1}\leq x<c_j, \\ 0,\text{ if }x\geq c_r. \end{array} \right. \] The graph of the function \(x\mapsto{\hat f}_n(x)\) appears as a sequence of blocks this time. The area under the graph of \({\hat f}_n\) is 1 by construction, except when \(c_r=+\infty\), in which case we cannot represent the probability of class \(C_r\). It is important to note that the area, not the height, of the blocks is proportional to the number of observations in each class. This accounts for the length of the classes in the calculation of relative frequencies.

In general, for the sake of precision, the actuary will keep the raw data \(x_1, x_2, \ldots, x_n\) for calculations but will often use grouping for presenting results.

10.3.1.2 Estimation of Main Parameters

Once we have \({\hat F}_n\), we can easily estimate the main parameters of \(F\), namely, the mean, variance, coefficient of variation, and various quantiles. In each case, it is sufficient to replace the unknown cumulative distribution function \(F\) with its empirical counterpart \({\hat F}_n\).

The Mean

To estimate the mean, we naturally use the sample mean: \[ {\hat\mu}_1=\int_{x\in \mathbb{R}}xd{\hat F}_n(x) =\left\{ \begin{array}{l} \frac{1}{n}\sum_{i=1}^nx_i\text{ (ungrouped case)}\\ \sum_{j=1}^r\frac{n_j(c_j+c_{j-1})}{2n}\text{ (grouped case)}. \end{array} \right. \] Often, we use the notation \(\overline{x}\) instead of \({\hat\mu}_1\). The sample mean \(\overline{x}\) represents the center of gravity of the data and is highly sensitive to extreme values.

The Variance

Denoting \[ {\hat\mu}_2=\int_{x\in \mathbb{R}}x^2d{\hat F}_n(x) =\left\{ \begin{array}{l} \frac{1}{n}\sum_{i=1}^nx_i^2\text{ (ungrouped case)}\\ \sum_{j=1}^r\frac{n_j(c_j^3-c_{j-1}^3)}{3n(c_j-c_{j-1})} \text{ (grouped case)} \end{array} \right. \] the natural estimator for the variance is then given by \(s^2={\hat\mu}_2-{\hat\mu}_1^2\). The observed standard deviation is the positive square root of the sample variance \(s^2\), denoted as \(s\).

The Coefficient of Variation

Often, actuaries use the coefficient of variation \(cv\), defined as the ratio of the sample standard deviation to the sample mean, i.e., \(cv=\frac{s}{\overline{x}}\). The coefficient of variation has the advantage of being dimensionless, making comparisons easier (e.g., when dealing with different currency units). It can be seen as a normalization of the standard deviation.

Quantiles

The empirical quantiles, denoted as \({\hat q}_p\), are simply obtained using the relation: \[ {\hat q}_p={\hat F}_n^{-1}(p)=\inf\{x\in\mathbb{R}|{\hat F}_n(x)\geq p\},\hspace{2mm}0<p<1. \] In the ungrouped case, \({\hat F}_n^{-1}\) maps \(p\), \(0<p<1\), to the smallest observation that leaves at least \(100p\%\) of the data to its left. If we denote \(x_{(1)},x_{(2)},\ldots,x_{(n)}\) as the observations arranged in ascending order, i.e., \(x_{(1)}\leq x_{(2)}\leq\ldots\leq x_{(n)}\), then \[ {\hat q}_p=x_{(i)}\text{ for }p\in\left]\frac{i-1}{n},\frac{i}{n} \right]. \] In the grouped case, the estimation of the \(p\)th quantile is done as follows. First, we determine \(j_0\) such that \[ {\hat F}_n(c_{j_0-1})\leq p<{\hat F}_n(c_{j_0}). \] Then, we have \[ {\hat F}_n({\hat q}_p)=\frac{(c_{j_0}-{\hat q}_p) {\hat F}_n(c_{j_0-1})+({\hat q}_p-c_{j_0-1}) {\hat F}_n(c_{j_0})}{c_{j_0}-c_{j_0-1}}=p, \] from which we obtain \[ {\hat q}_p=\frac{c_{j_0}-c_{j_0-1}}{{\hat F}_n(c_{j_0})- {\hat F}_n(c_{j_0-1})}p- \frac{c_{j_0}{\hat F}_n(c_{j_0-1})-c_{j_0-1} {\hat F}_n(c_{j_0})}{{\hat F}_n(c_{j_0})- {\hat F}_n(c_{j_0-1})}. \]

10.3.2 The Parametric Approach

The empirical approach may not provide answers to all the questions actuaries have. To illustrate this point, consider the following example.

Example 10.3 An insurer offers policies with a mandatory deductible of 50 Euros per claim. The insurer has recorded the following claim costs (in Euros): 141, 16, 46, 40, 351, 259, 317, 1,511, 107, and 567. The average amount \(\overline{x}\) paid by the insurer per claim is 335.5 Euros. Now, suppose the insurer increases the mandatory deductible to 100 Euros. We can then estimate the average claim cost as \[ \frac{91+301+209+267+1461+57+517}{7}=414.71. \] The insurer, who previously paid 3,355 Euros to compensate policyholders, now only needs to pay 2,903 Euros.

The change from a mandatory deductible of 50 to 100 Euros would reduce costs by \[ \frac{3,355 - 2,903}{3,355} = 13.47\%. \] Now, let’s assume the insurer instead wants to remove the mandatory deductible clause, for example, to gain new market share. The empirical approach does not allow us to evaluate the additional cost incurred by this contractual change. The reason is that the claim amounts recorded will be increased by 50 Euros each, bringing the total cost to 3,855 Euros. However, it’s important to note that the claims with costs less than 50 Euros, for which we have no information (as they were not reported to the company), will now become eligible for indemnification. All the empirical approach can tell us is that the additional cost incurred by removing the mandatory deductible clause is at least \[ \frac{500}{3,355} = 14.9\%. \]

Now, suppose the company introduces an upper limit on indemnification, set at 1,500 Euros per claim, for example, in the policy’s terms and conditions. This would not lead to any change in the insurer’s average cost because no claims exceeding 1,500 Euros have been observed. However, any rational policyholder would most likely expect a premium reduction to offset the increased risk on their part or, at the very least, some form of benefit.

Next, suppose that claim amounts are subject to a 10% inflation. What can be said about the average claim cost? We can easily obtain inflation-adjusted claim amounts as follows: \[\begin{eqnarray*} 1.1(141+50)-50 & = & 160.1\\ 1.1(16+50)-50 & = & 22.6\\ \text{etc.} & &, \end{eqnarray*}\] which provides the new series 160.1, 22.6, 55.6, 391.1, 289.9, 353.7, 1,667.1, 122.7, and 628.7. One might be tempted to claim that the average amount paid by the company per claim now amounts to 374.05 Euros. However, this would ignore the claims falling within the range of 45.45-50, which were not reported to the insurer due to the mandatory deductible of 50 Euros but are now covered due to inflation. Inflation not only leads to a modification of claim amounts but also to a change in the number of claims reported to the company.

The alternative to address the unanswered questions in the above example is the parametric approach. In this approach, it is assumed that the unknown cumulative distribution function \(F\) of the amount paid by the insurer following a claim is part of a family \(\mathcal{F}=\{F_{\boldsymbol{\theta}}, \boldsymbol{\theta}\in\Theta\}\) of cumulative distribution functions with known analytical forms but depending on one or more unknown parameters \(\boldsymbol{\theta}\). Here, \(\Theta\) represents the parameter space, i.e., the set of all permissible values for the parameter. Note that \(\boldsymbol{\theta}\) can be either one-dimensional or multidimensional, depending on the parametric family under consideration. Therefore, identifying \(F\) in \(\mathcal{F}\) means determining the value of \(\boldsymbol{\theta}\) such that \(F\equiv F_{\boldsymbol{\theta}}\).

To fully understand the complementarity of parametric and empirical approaches, let’s continue the study of the scenario presented above from a parametric perspective.

Example 10.4 (Continuation of Example 10.3) Suppose we can validly consider that the amount of a claim follows a negative exponential distribution with mean \(\theta\). Let \(X\) be the claim amount, and \(Y\) be the amount paid by the insurer. A convenient way to determine the value of \(\theta\) is to equate \(\mathbb{E}[Y]\) and \(\overline{x}\) because \[\begin{eqnarray*} \mathbb{E}[Y] & = & \mathbb{E}[X-50|X>50] \\ & = & \int_{x=50}^{+\infty}(x-50)\frac{\exp(-x/\theta)} {\theta\exp(-50/\theta)}dx=\theta. \end{eqnarray*}\] Thus, we choose \(\theta=335.5\). Following inflation, the average claim cost for the insurer becomes \[ \mathbb{E}[1.1X-50|1.1X>50]=369.05. \] Furthermore, the probability that a claim falls under the coverage was \[ \Pr[X>50]=0.86154 \] without inflation, whereas it becomes \[ \Pr[1.1X>50]=0.87329 \] after the effect of inflation.

As demonstrated by this very simple example, the parametric approach allows answering all the questions posed by the insurer.

10.4 Fisher’s Information

10.4.1 Conditions on the Statistical Model

Consider a parametric statistical model \(\mathcal{F}=\{F_{\boldsymbol{\theta}},\hspace{2mm}\boldsymbol{\theta}\in \Theta\}\) with a common support \(\mathcal{S}\), where the parameter space \(\Theta\) is an open subset of \({\mathbb{R}}^p\). Let \(f_{\boldsymbol{\theta}}\) be the probability density (discrete or continuous) associated with \(F_{\boldsymbol{\theta}}\). We will now assume that the following three conditions are satisfied:

\begin{description} 1. \(f_{\boldsymbol{\theta}}(x)>0\) for all \(x\in\mathcal{S}\) and \({\boldsymbol{\theta}}\in\Theta\); 2. \(\frac{\partial}{\partial{\boldsymbol{\theta}}}f_{\boldsymbol{\theta}}(x)\) exists for all \(x\in\mathcal{S}\) and \({\boldsymbol{\theta}}\in\Theta\); 3, For any \({\boldsymbol{\theta}}\) in \(\Theta\), we can differentiate \(\int_{x\in A}dF_{\boldsymbol{\theta}}(x)\) under the integral sign with respect to the components of \({\boldsymbol{\theta}}\).

10.4.2 Definition

Consider a random variable \(X\) with distribution function \(F_{\boldsymbol{\theta}}\), for \(\boldsymbol{\theta}\in\Theta\). The Fisher information of the model is defined as the variance-covariance matrix, if it exists, of the random vector \[ \frac{\partial}{\partial\boldsymbol{\theta}}\ln f_{\boldsymbol{\theta}}(X)=\left( \frac{\partial}{\partial\theta_1}\ln f_{\boldsymbol{\theta}}(X),\ldots, \frac{\partial}{\partial\theta_p}\ln f_{\boldsymbol{\theta}}(X)\right)^\top; \] this matrix will be denoted as \(\mathcal{I}({\boldsymbol{\theta}})\).

Proposition 10.1 The vector \(\frac{\partial}{\partial\boldsymbol{\theta}}\ln f_{\boldsymbol{\theta}}(X)\) is centered, i.e. \[ \mathbb{E}\left[\frac{\partial}{\partial\boldsymbol{\theta}}\ln f_{\boldsymbol{\theta}}(X)\right]=\boldsymbol{0}. \]

Proof. Consider the continuous case. Starting from \[ \int_{x\in{\mathbb{R}}}f_{\boldsymbol{\theta}}(x)dx=1 \] we can differentiate both sides with respect to \(\theta_i\), \(i=1,2,\ldots,p\), and obtain \[\begin{eqnarray*} 0&=&\frac{\partial}{\partial\theta_i}\int_{x\in{\mathbb{R}}}f_{\boldsymbol{\theta}}(x)dx\\ &=&\int_{x\in{\mathbb{R}}}\frac{\partial}{\partial\theta_i}f_{\boldsymbol{\theta}}(x)dx\\ &=&\int_{x\in{\mathbb{R}}}\left\{\frac{\partial}{\partial\theta_i}\ln f_{\boldsymbol{\theta}}(x)\right\} f_{\boldsymbol{\theta}}(x)dx\\ &=& \mathbb{E}\left[\frac{\partial}{\partial\theta_i}\ln f_{\boldsymbol{\theta}}(X)\right]. \end{eqnarray*}\]

Therefore, Property 10.1 allows us to obtain the expression for \(\mathcal{I}({\boldsymbol{\theta}})\) since \[ \mathbb{C}\left[\frac{\partial}{\partial\theta_i}\ln f_{\boldsymbol{\theta}}(X), \frac{\partial}{\partial\theta_j}\ln f_{\boldsymbol{\theta}}(X)\right] =\mathbb{E}\left[\frac{\partial}{\partial\theta_i}\ln f_{\boldsymbol{\theta}}(X) \frac{\partial}{\partial\theta_j}\ln f_{\boldsymbol{\theta}}(X)\right]. \]

This can be expressed in matrix form as

\[ \mathcal{I}({\boldsymbol{\theta}})=\mathbb{E}\left[\frac{\partial}{\partial\boldsymbol{\theta}} \ln f_{\boldsymbol{\theta}}(X)\left(\frac{\partial}{\partial\boldsymbol{\theta}} \ln f_{\boldsymbol{\theta}}(X)\right)^\top\right]. \]

10.4.2.1 Alternative Expression

In general, as long as \(\int_{x\in{\mathbb{R}}}dF_{\boldsymbol{\theta}}(x)\) is twice differentiable under the integral sign, we obtain two equivalent expressions for \(\mathcal{I}({\boldsymbol{\theta}})\). Specifically, in the continuous case, starting with

\[ 0=\frac{\partial^2}{\partial\theta_i\partial\theta_j}\int_{x\in{\mathbb{R}}}f_{\boldsymbol{\theta}}(x)dx, \]

we have

\[ =\int_{x\in{\mathbb{R}}}\frac{\partial^2}{\partial\theta_i\partial\theta_j}f_{\boldsymbol{\theta}}(x)dx, \]

which leads to

\[ =\int_{x\in{\mathbb{R}}}\frac{\frac{\partial^2}{\partial\theta_i\partial\theta_j}f_{\boldsymbol{\theta}}(x)} {f_{\boldsymbol{\theta}}(x)}f_{\boldsymbol{\theta}}(x)dx, \]

and finally

\[ =\mathbb{E}\left[\frac{\frac{\partial^2}{\partial\theta_i\partial\theta_j}f_{\boldsymbol{\theta}}(X)} {f_{\boldsymbol{\theta}}(X)}\right]. \]

This leads to

\[\begin{eqnarray*} \mathbb{E}\left[\frac{\partial^2}{\partial\theta_i\partial\theta_j}\ln f_{\boldsymbol{\theta}}(X)\right] &=&\mathbb{E}\left[\frac{\partial}{\partial\theta_i}\frac{\frac{\partial}{\partial\theta_j}f_{\boldsymbol{\theta}}(X)} {f_{\boldsymbol{\theta}}(X)}\right] \\ &=& \mathbb{E}\left[\frac{\frac{\partial^2}{\partial\theta_i\partial\theta_j}f_{\boldsymbol{\theta}}(X)} {f_{\boldsymbol{\theta}}(X)}\right]-\mathbb{E}\left[\frac{\frac{\partial}{\partial\theta_i}f_{\boldsymbol{\theta}}(X) \frac{\partial}{\partial\theta_j}f_{\boldsymbol{\theta}}(X)} {\{f_{\boldsymbol{\theta}}(X)\}^2}\right]\\ &=&-\mathbb{E}\left[\frac{\frac{\partial}{\partial\theta_i}f_{\boldsymbol{\theta}}(X) \frac{\partial}{\partial\theta_j}f_{\boldsymbol{\theta}}(X)} {\{f_{\boldsymbol{\theta}}(X)\}^2}\right]\\ &=&-\mathbb{E}\left[\frac{\partial}{\partial\theta_i}\ln f_{\boldsymbol{\theta}}(X) \frac{\partial}{\partial\theta_j}\ln f_{\boldsymbol{\theta}}(X)\right]. \end{eqnarray*}\]

We can then write this in matrix form as

\[\begin{eqnarray} \mathcal{I}(\boldsymbol{\theta})&=&\mathbb{E}\left[\frac{\partial}{\partial\boldsymbol{\theta}} \ln f_{\boldsymbol{\theta}}(X)\left(\frac{\partial}{\partial\boldsymbol{\theta}} \ln f_{\boldsymbol{\theta}}(X)\right)^\top\right]\nonumber\\ &=&-\mathbb{E}\left[\frac{\partial^2}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^\top} \ln f_{\boldsymbol{\theta}}(X)\right]. \tag{10.1} \end{eqnarray}\]

So, Fisher’s information \(\mathcal{I}(\boldsymbol{\theta})\) can be obtained either as the expectation of the product of the gradient vectors of \(\ln f_{\boldsymbol{\theta}}(X)\) or as the negative expectation of the Hessian matrix of \(\ln f_{\boldsymbol{\theta}}(X)\).

10.4.2.2 Kullback Information

Let \(\widetilde{\boldsymbol{\theta}}\) be the true parameter value, and define, for outcome \(x\), the discriminative power of \(x\) between the true value \(\widetilde{\boldsymbol{\theta}}\) of the parameter and another possible value \({\boldsymbol{\theta}}\) for the parameter as

\[\begin{equation} \ln\frac{f_{\widetilde{\boldsymbol{\theta}}}(x)}{f_{\boldsymbol{\theta}}(x)}. \tag{10.2} \end{equation}\]

We should interpret (10.2) as the logarithm of the ratio between the “likelihood” of observing \(x\) for the true parameter value \(\widetilde{\boldsymbol{\theta}}\) and the “likelihood” of observing \(x\) if the parameter is \({\boldsymbol{\theta}}\). For values of \(x\) such that \(f_{\boldsymbol{\theta}}(x)>f_{\widetilde{\boldsymbol{\theta}}}(x)\), the quantity (10.2) is negative, which is natural since, in such cases, the actuary would be more inclined to favor \({\boldsymbol{\theta}}\).

Kullback proposed defining an “average discriminative power” or “average discriminative information” as follows: the Kullback information from \(\widetilde{\boldsymbol{\theta}}\) to \({\boldsymbol{\theta}}\) is defined as

\[ \mathbb{E}_{\widetilde{\boldsymbol{\theta}}} \left[\ln\frac{f_{\widetilde{\boldsymbol{\theta}}}(X)}{f_{\boldsymbol{\theta}}(X)}\right] =\int_{x\in{\mathbb{R}}}\ln\frac{f_{\widetilde{\boldsymbol{\theta}}}(x)} {f_{\boldsymbol{\theta}}(x)}f_{\widetilde{\boldsymbol{\theta}}}(x)dx. \]

This is denoted as \(\mathcal{I}(\widetilde{\boldsymbol{\theta}}|{\boldsymbol{\theta}})\). We can easily see that \(\mathcal{I}(\widetilde{\boldsymbol{\theta}}|{\boldsymbol{\theta}})\geq 0\); indeed,

\[ \mathcal{I}(\widetilde{\boldsymbol{\theta}}|{\boldsymbol{\theta}})=- \mathbb{E}_{\widetilde{\boldsymbol{\theta}}} \left[\ln\frac{f_{\boldsymbol{\theta}}(X)}{f_{\widetilde{\boldsymbol{\theta}}}(X)}\right]\geq -\ln \mathbb{E}_{\widetilde{\boldsymbol{\theta}}} \left[\frac{f_{\boldsymbol{\theta}}(X)}{f_{\widetilde{\boldsymbol{\theta}}}(X)}\right] \] due to the Jensen’s inequality applied to the function $-$. Furthermore, \[ \mathbb{E}_{\widetilde{\boldsymbol{\theta}}} \left[\frac{f_{\boldsymbol{\theta}}(X)}{f_{\widetilde{\boldsymbol{\theta}}}(X)}\right] =\int_{x\in{\mathbb{R}}}f_{\boldsymbol{\theta}}(x)dx=1, \] which completes the justification. Moreover, \[ \mathcal{I}(\widetilde{\boldsymbol{\theta}}|{\boldsymbol{\theta}})=0 \Leftrightarrow f_{\widetilde{\boldsymbol{\theta}}}\equiv f_{\boldsymbol{\theta}}. \] Note, however, that \(\mathcal{I}(\widetilde{\boldsymbol{\theta}}|{\boldsymbol{\theta}})\) is generally not a metric because it is not necessarily symmetric and does not always satisfy the triangle inequality.

10.4.3 Parameter Estimation Using Maximum Likelihood Method

10.4.3.1 Likelihood Function

From now on, we will consider identifiable models, i.e., models for which the mapping \({\boldsymbol{\theta}}\mapsto F_{\boldsymbol{\theta}}\) is injective. Suppose we have selected the parametric family \(\{F_{\boldsymbol{\theta}},\hspace{2mm}\boldsymbol{\theta}\in\Theta\subseteq\mathbb{R}^p\}\). We seek to determine the most plausible value of \({\boldsymbol{\theta}}\) given the observations \(x_1,x_2,\ldots,x_n\) we have; we will denote this value as \(\hat{\boldsymbol{\theta}}\), which is the estimation of the parameter of interest.

The maximum likelihood method requires the definition of the likelihood function, denoted as \(\mathcal{L}({\boldsymbol{\theta}}|\boldsymbol{x})\), which is given by

  1. In the non-grouped case: \[ \mathcal{L}({\boldsymbol{\theta}}|\boldsymbol{x})=\prod_{j=1}^nf_{\boldsymbol{\theta}}(x_j), \] where \(f_{\boldsymbol{\theta}}\) is the probability density (discrete or continuous) associated with \(F_{\boldsymbol{\theta}}\);
  2. In the grouped case: \[ \mathcal{L}({\boldsymbol{\theta}}|\boldsymbol{x})=\prod_{j=1}^r(F_{\boldsymbol{\theta}}(c_j) -F_{\boldsymbol{\theta}}(c_{j-1}))^{n_j}. \]

Intuitively, \(\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{x})\) should be considered as the “chance” of observing the values \(x_1,x_2, \ldots,x_n\) for the value \(\boldsymbol{\theta}\) of the parameter. It is very important to note that \(\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{x})\) is a function of \(\boldsymbol{\theta}\), given that the observations \(\boldsymbol{x}=(x_1,x_2,\ldots,x_n)^\top\) are fixed, while the joint density of observations \(f_{\boldsymbol{\theta}}(\boldsymbol{x})=\prod_{i=1}^nf_{\boldsymbol{\theta}}(x_i)\) is a function of the observations \(x_1,x_2,\ldots,x_n\), parametrized by \(\boldsymbol{\theta}\).

10.4.3.2 Estimation Method

The maximum likelihood estimator \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) is obtained by maximizing the “chance” of observing \(x_1,x_2,\ldots,x_n\), i.e.

\[\begin{equation} \widehat{\boldsymbol{\theta}}=\arg\max_{\boldsymbol{\theta}\in\Theta} \mathcal{L}(\boldsymbol{\theta}|\boldsymbol{x}). \tag{10.4} \end{equation}\]

In essence, this is a fairly intuitive method when you have a reliable sample: you estimate the parameter \(\boldsymbol{\theta}\) as the value \(\widehat{\boldsymbol{\theta}}\) that maximizes the probability of obtaining the observations \(x_1,x_2,\ldots,x_n\) from the sample. In practice, it is often easier to work with the logarithm before the maximization step. Thus, we define the log-likelihood \(L(\boldsymbol{\theta}|\boldsymbol{x})\) associated with the sample, given by

\[ L(\boldsymbol{\theta}|\boldsymbol{x})=\ln\mathcal{L}(\boldsymbol{\theta}|\boldsymbol{x}). \]

The maximum likelihood estimator \(\widehat{\boldsymbol{\theta}}\) of \(\boldsymbol{\theta}\) is obtained through the maximization program

\[\begin{equation} \widehat{\boldsymbol{\theta}}=\arg\max_{\boldsymbol{\theta}\in\Theta} L(\boldsymbol{\theta}|\boldsymbol{x}). \tag{10.5} \end{equation}\]

Remark (Censoring). In some cases, the likelihood function can take forms other than those mentioned above. In most branches of insurance, the amount of compensation paid by the insurer does not exactly correspond to the loss suffered by the insured, due to the introduction of deductible or mandatory coverage clauses in policy terms, or even the specification of a coverage limit. This has a significant impact on estimation methods. Indeed, the actuary wishes to model the amount of the loss, not the insurer’s payout. For example, if the insurer has introduced a coverage limit, set at \(\omega\) for example, and the actuary has data \(x_1\), \(x_2\), \(\ldots\), \(x_k\), \(\omega\), \(\omega\), \(\ldots\), \(\omega\), then the likelihood \(\mathcal{L}({\boldsymbol{\theta}}|\boldsymbol{x})\) is written in the form

\[ \mathcal{L}({\boldsymbol{\theta}}|\boldsymbol{x})=\big(\overline{F}_{\boldsymbol{\theta}}(\omega)\big)^{n-k} \prod_{j=1}^kf_{\boldsymbol{\theta}}(x_j). \]

Indeed, an observation of \(\omega\) actually means that the amount of the loss was at least \(\omega\), but the insurer paid only a compensation of \(\omega\).

10.4.3.3 Properties

Maximum likelihood estimators have excellent theoretical properties. Under fairly general assumptions, they are asymptotically unbiased and efficient (i.e., they are the most precise), and always make the best use of the information contained in the sample.

Let’s now examine the behavior of maximum likelihood estimators in large samples. Let \(\widehat{\boldsymbol{\theta}}_n\) be the maximum likelihood estimator of \({\boldsymbol{\theta}}\) obtained from a sample of size \(n\). When \(\Theta\) is an open subset of \({\mathbb{R}}^p\), and as long as the model is identifiable, let’s show that when the number \(n\) of observations is sufficiently large, \(\widehat{\boldsymbol{\theta}}_n-\boldsymbol{\theta}\) is approximately multivariate normally distributed with mean \(\boldsymbol{0}\) and covariance matrix the inverse of the Fisher information matrix \(\mathcal{I}\), i.e.

\[\begin{equation} \widehat{\boldsymbol{\theta}}_n-\boldsymbol{\theta}\approx_{\text{loi}}\mathcal{N}or (\boldsymbol{0},\mathcal{I}^{-1}). \tag{10.6} \end{equation}\]

To justify this statement, suppose that there exists a unique maximum of the log-likelihood and that \(\widehat{\boldsymbol{\theta}}_n\) is close to the true value \(\boldsymbol{\theta}\) of the parameter. Let \(\boldsymbol{U}\) be the gradient vector of the log-likelihood, and \(\boldsymbol{H}\) the corresponding Hessian matrix. A Taylor expansion to the first order then gives

\[ {\boldsymbol{U}}(\boldsymbol{\theta})\approx \underbrace{{\boldsymbol{U}}(\widehat{\boldsymbol{\theta}}_n)}_{=0\text{ by definition of } \widehat{\boldsymbol{\theta}}_n}+{\boldsymbol{H}}(\widehat{\boldsymbol{\theta}}_n) \Big(\boldsymbol{\theta}-\widehat{\boldsymbol{\theta}}_n\Big). \]

Asymptotically, \({\boldsymbol{H}}\) is equal to its mean value \(-\mathcal{I}\), so

\[ {\boldsymbol{U}}(\boldsymbol{\theta})\approx -\mathcal{I} \Big(\boldsymbol{\theta}-\widehat{\boldsymbol{\theta}}_n\Big)=\mathcal{I} \Big(\widehat{\boldsymbol{\theta}}_n-{\boldsymbol{\theta}}\Big) \]

\[ \Rightarrow \widehat{\boldsymbol{\theta}}_n-{\boldsymbol{\theta}}\approx \mathcal{I}^{-1}\boldsymbol{U}(\boldsymbol{\theta}). \]

This last relation allows us to obtain the asymptotic covariance matrix of the maximum likelihood estimator \(\widehat{\boldsymbol{\theta}}_n\) of \(\boldsymbol{\theta}\), which is given by

\[ \mathbb{E}\big[(\widehat{\boldsymbol{\theta}}_n-{\boldsymbol{\theta}})(\widehat{\boldsymbol{\theta}}_n-{\boldsymbol{\theta}})^\top\big] \approx \mathcal{I}^{-1}\underbrace{\mathbb{E}[\boldsymbol{U}\boldsymbol{U}^\top]}_{=\mathcal{I}}\mathcal{I}=\mathcal{I}^{-1}. \]

The central limit theorem ensures that \(\boldsymbol{U}(\boldsymbol{\theta})\) is approximately Gaussian (as the sum of \(n\) independent random variables), so, in a large sample, we indeed have (10.6).

10.4.4 Likelihood Ratio Test

This test is often very useful for answering certain questions regarding the parameters. So, if \({\boldsymbol{\theta}}\) is \(p\)-dimensional, and if the question posed (\(H_0\)) is that there are \(j\) restrictions on the parameter domain of the form \(R_i({\boldsymbol{\theta}})=0\), \(i=1,2,\ldots,j\), where each of the functions \(R_i\) has continuous first partial derivatives with respect to the components of \({\boldsymbol{\theta}}\), and if the alternative (\(H_1\)) is to say that there are no such restrictions, then the maximum likelihood estimator under the constraint (under \(H_0\)), denoted as \(\widetilde{\boldsymbol{\theta}}\), and the unconstrained maximum likelihood estimator, denoted as \(\widehat{\boldsymbol{\theta}}\), are calculated. The test statistic \[ \mathcal{RV}_n=2\Big\{L(\widehat{\boldsymbol{\theta}}|\boldsymbol{X})- L(\widetilde{\boldsymbol{\theta}}|\boldsymbol{X})\Big\} \] is approximately chi-squared distributed with \(j\) degrees of freedom (for sufficiently large \(n\)). We reject \(H_0\) if \(T\) is “too large,” i.e., if \(\mathcal{RV}_n>\chi_{j;1-\alpha}^2\).

10.4.5 Other Estimation Methods

As we will see later, likelihood equations often do not have explicit solutions. Therefore, numerical methods are used, which proceed through iteration. Thus, an initial value as precise as possible for the parameters is required, which can be obtained using “ad hoc” methods.

“Ad hoc” methods represent a set of widely used practical methods, often without a real theoretical basis, which involve equating a number of sample values (i.e., calculated based on \({\hat F}_n\)) to their population counterparts (i.e., calculated based on \(F_{\boldsymbol{\theta}}\)). The choice of these quantities is guided by practical considerations: they are the ones that the actuary wishes to emphasize because their importance is crucial to the problem being addressed. Among this set of methods, we distinguish the method of moments and the method of quantiles.

10.4.5.1 Method of Moments

Suppose that \({\boldsymbol{\theta}}\) is \(p\)-dimensional. The method of moments consists of equating the first \(p\) observed moments to their theoretical counterparts, i.e., \(\widehat{\boldsymbol{\theta}}\) is the solution of the system \[\begin{equation} \int_{x\in\mathbb{R}}x^j d{\hat F}_n(x)= \int_{x\in\mathbb{R}}x^j dF_{\boldsymbol{\theta}}(x),\hspace{2mm} j=1,2,\ldots,p. \tag{10.7} \end{equation}\] Note that there is no guarantee that the solution of the system will be in \(\Theta\).

10.4.5.2 Method of Quantiles

The method of quantiles, on the other hand, involves selecting a number of observed quantiles, \({\hat q}_{\pi_1}\), \({\hat q}_{\pi_2}\), \(\ldots\), \({\hat q}_{\pi_p}\), say, obtained by \({\hat F}_n^{-1}(\pi_i)={\hat q}_{\pi_i}\), \(i=1,2,\ldots,p\), and then taking \(\widehat{\boldsymbol{\theta}}\) as the solution of the system \[\begin{equation} F_{\boldsymbol{\theta}}({\hat q}_{\pi_i})=\pi_i,\hspace{2mm} i=1,2,\ldots,p. \tag{10.8} \end{equation}\]

Of course, the method of moments and the method of quantiles can be combined by requiring that equations (10.7) be satisfied for \(j=1,2,\ldots,\ell\) and those of (10.8) for \(\ell+1,\ldots,p\).

Finally, it should be noted that “ad hoc” methods should be used with caution. Their undeniable advantage is to provide initial values for iterative algorithms that obtain maximum likelihood estimates.

10.4.5.3 Minimum Distance Type Method

This is another type of approach, based on probabilistic distances: this class of methods consists of choosing \({\boldsymbol{\theta}}\) to minimize a “distance” between \(F_{\boldsymbol{\theta}}\) and \({\hat F}_n\). Here are some interesting special cases in the grouped data context:

  1. Cramér-Von Mises type method: \[ {\hat\theta}=\arg\min_{{\boldsymbol{\theta}}\in\Theta} \sum_{j=1}^rw_j\left(F_{\boldsymbol{\theta}}(c_j)-{\hat F}_n(c_j) \right)^2, \] where the weights \(w_1,w_2,\ldots,w_r\) are selected by the actuary to emphasize certain regions where the quality of the fit is crucial (this is equivalent to fitting the nonlinear function \(F_{\boldsymbol{\theta}}\) to the data points \((c_j,{\hat F}_n(c_j))\) by weighted least squares);
  2. \(\chi^2\) type method: \[ \widehat{\boldsymbol{\theta}}=\arg\min_{{\boldsymbol{\theta}}\in\Theta} \sum_{j=1}^r\frac{\left(F_{\boldsymbol{\theta}}(c_j)-F_{\boldsymbol{\theta}}(c_{j-1}) -{\hat F}_n(c_j)+{\hat F}_n(c_{j-1})\right)^2} {F_{\boldsymbol{\theta}}(c_j)-F_{\boldsymbol{\theta}}(c_{j-1})}. \]

Most of the time, \(\widehat{\boldsymbol{\theta}}\) can only be obtained through an iterative algorithm (such as the simplex method). It is strongly recommended to verify the solution proposed by the numerical method by evaluating the objective function at a few points in the vicinity of the solution. Furthermore, the actuary should keep in mind that the proposed solution may correspond only to a local minimum.

10.5 Data Analysis

10.5.1 Principle

Actuaries often have vast amounts of data to analyze. Before opting for a parametric model \(\mathcal{F}\), it is often useful to analyze the data without making any assumptions about them. There are several types of methods for data analysis (in multivariate statistics): factorial methods, which involve projecting the point cloud onto a subspace while retaining as much information as possible, and classification methods, which attempt to group the points.

Among factorial methods, three groups of techniques are generally distinguished: Principal Component Analysis (PCA, based on multiple quantitative variables, ideally continuous), Correspondence Analysis for binary data (CA, for two qualitative variables represented by a contingency table), and Multiple Correspondence Analysis (MCA, for more than two qualitative variables and no quantitative variables). This section briefly recalls the principles of data analysis. For more details, we refer the reader to (Lebart, Morineau, and Piron 1995).

10.5.2 Principal Component Analysis (PCA)

10.5.2.1 Variables and Individuals

PCA provides representations and reductions of the information contained in large numerical data tables \(\boldsymbol{X}\). The element \(x_{ij}\) of the \(n \times p\) matrix \({\boldsymbol{X}}\) represents the numerical value of the \(j\)th variable for the \(i\)th individual (\(j=1,2,\ldots, p\), and \(i=1,2,\ldots, n\)). The data matrix \({\boldsymbol{X}}\) thus has \(np\) elements, which is generally very large.

The natural space for statisticians to represent data is the Euclidean space \(\mathbb{R}^p\), in which the sample takes the form of a cloud of \(n\) points (each point corresponding to an individual). We call this the {}, and point \(i\) is the \(p\)-dimensional vector \(\boldsymbol{x}_i^{\text{v}}\) defined as the \(i\)th row of \({\boldsymbol{X}}\), i.e. \[\begin{equation} \boldsymbol{x}_i^{\text{v}}=(x_{i1}, \ldots, x_{ip})^\top, \quad i=1,\ldots, n. (\#eq:ACP1.5) \end{equation}\] This provides a first cloud of points in \({\mathbb{R}}^p\), called the cloud of individuals or cloud of row points. Each of the \(n\) points in this cloud corresponds to a row of \(\boldsymbol{X}\) and thus summarizes the measurements of the \(p\) variables for one of the \(n\) individuals.

One of the peculiarities of data analysis is to consider an additional space called the {}, in which there is a cloud of \(p\) points representing the column vectors of \({\boldsymbol{X}}\), i.e., the \[\begin{equation} \boldsymbol{x}_j^{\text{o}}=(x_{1j},\ldots, x_{nj})^\top, \quad j=1,\ldots,p. \tag{10.9} \end{equation}\] This provides a second cloud of points, in \({\mathbb{R}}^n\) this time, called the cloud of variables or cloud of column points. Each of the \(p\) points in this cloud corresponds to a column of \(\boldsymbol{X}\), and thus, it summarizes the measurements of the same variable taken for all \(n\) individuals.

The interpretations of each of these spaces are simple: in the space of variables \(\mathbb{R}^p\), \(\boldsymbol{x}_i^{\text{v}}\) represents the \(p\) characteristics or variables measured on the \(i\)th individual, and in the space of observations \({\mathbb{R}}^n\), \(\boldsymbol{x}_j^{\text{o}}\) represents the values taken by the \(j\)th variable over all \(n\) individuals. The points in the space of variables \({\mathbb{R}}^p\) thus represent individuals, and those in the space of observations \({\mathbb{R}}^n\) represent variables.

The geometric proximities between row points and between column points actually reflect statistical associations, either between individuals or between variables. Thus, proximity between two individual points in \(\mathbb{R}^p\) means that these two individuals have a similar behavior with respect to all \(p\) variables; proximity between two variable points in \(\mathbb{R}^n\) means that the \(n\) individuals have a similar behavior with respect to these two variables.

10.5.2.2 Adjusting the Cloud of Individuals in the Space of Variables

A simple way to visually understand the shape of a point cloud is to project it onto lines, or even better, onto planes, while minimizing the distortions that the projection implies.

Given the cloud of \(n\) points in \(\mathbb{R}^p\), we seek the line passing through the origin and determined by the unit direction vector \(\boldsymbol{u}\), which provides the best fit in the least squares sense, i.e., minimizing \(\sum^n_{i=1} d^2_i(\boldsymbol{u})\), where \(d_i(\boldsymbol{u})\) represents the distance from point \(i\) to the line determined by the vector \(\boldsymbol{u}\). If \(p_i(\boldsymbol{u})\) denotes the projection of the vector \(\boldsymbol{x}_i^{\text{v}}\) onto the line determined by the vector \(\boldsymbol{u}\), from the Pythagorean theorem, it follows that it is equivalent to maximizing \[\begin{equation} \max_{\boldsymbol{u}} \sum^n_{i=1} p^2_i (\boldsymbol{u}) (\#eq:ACP1.7) \end{equation}\] over all normalized vectors. By expressing \(p_i(\boldsymbol{u})\) using the dot product, we have \(p_i(\boldsymbol{u})=\boldsymbol{u}^\top\boldsymbol{x}_i^{\text{v}}\). Using the Euclidean norm notation \(||\cdot||\), (??) is equivalent to \[\begin{eqnarray*} \max_{\boldsymbol{u}} \sum^n_{i=1} ||\boldsymbol{u}^\top \boldsymbol{x}_i^{\text{v}}||^2&\Leftrightarrow&\max_{\boldsymbol{u}} \sum^n_{i=1} \boldsymbol{u}^\top \boldsymbol{x}_i^{\text{v}} \boldsymbol{x}_i^{\text{v}t} \boldsymbol{u}\\ &\Leftrightarrow&\max_{\boldsymbol{u}} \boldsymbol{u}^\top \left(\sum^n_{i=1} \boldsymbol{x}_i^{\text{v}} \boldsymbol{x}_i^{\text{v}t}\right) \boldsymbol{u}\\ &\Leftrightarrow&\max_{\boldsymbol{u}} \boldsymbol{u}^\top {\boldsymbol{X}}^\top{\boldsymbol{X}} \; \boldsymbol{u} \end{eqnarray*}\] subject to the constraint \(\boldsymbol{u}^\top\boldsymbol{u}=1\). The problem to solve is thus stated as follows: \[\begin{equation} \max_{\boldsymbol{u}} \boldsymbol{u}^\top{\boldsymbol{X}}^\top{\boldsymbol{X}} \; \boldsymbol{u}\text{ subject to the constraint } \boldsymbol{u}^\top\boldsymbol{u}=1. \tag{10.10} \end{equation}\]

This is a classic problem in differential calculus that is solved by introducing the Lagrangian function \[ \Psi (\boldsymbol{u}; \lambda)=\boldsymbol{u}^\top \; {\boldsymbol{X}}^\top{\boldsymbol{X}} \; \boldsymbol{u}- \lambda (\boldsymbol{u}^\top\boldsymbol{u}-1) \] which is a function of \((p+1)\) variables, the \(p\) coordinates of \(\boldsymbol{u}\) and \(\lambda\). By setting the partial derivative of \(\Psi\) with respect to \(\lambda\) to zero, we obviously obtain the constraint \(\boldsymbol{u}^\top\boldsymbol{u}=1\); setting the partial derivatives of \(\Psi\) with respect to the coordinates of \(\boldsymbol{u}\) to zero leads, after some manipulation of matrix calculations, to the system \[\begin{equation} {\boldsymbol{X}}^\top{\boldsymbol{X}} \; \boldsymbol{u}- \lambda \boldsymbol{u}=0 \Leftrightarrow {\boldsymbol{X}}^\top{\boldsymbol{X}}\boldsymbol{u}=\lambda \boldsymbol{u} (\#eq:ACP1.14) \end{equation}\] where it can be seen that solving (10.10) is identical to finding the eigenvalues and eigenvectors of the symmetric matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) of dimension \(p\times p\). To determine which eigenpair is appropriate, it is sufficient to pre-multiply both sides of equation @ref(eq:ACP1.14) by \(\boldsymbol{u}^\top\), \[ \boldsymbol{u}^\top\; {\boldsymbol{X}}^\top{\boldsymbol{X}} \; \boldsymbol{u}= \lambda \boldsymbol{u}^\top\boldsymbol{u}=\lambda. \] Returning to (10.10), we can conclude that the appropriate eigenpair corresponds to the largest eigenvalue of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\).

Linear algebra teaches us that all eigenvalues of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) are non-negative, and the number of strictly positive eigenvalues is given by the rank \(r\) of \({\boldsymbol{X}}\) (with \(r \leq \min \{n,p\}\)). Furthermore, if \(\lambda\) has multiplicity order \(k\), there exist \(k\) orthogonal eigenvectors associated with that eigenvalue; finally, the set of \(p\) eigenvectors of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) is orthogonal.

Let \(\lambda_1, \ldots, \lambda_r\) be the nonzero eigenvalues of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) arranged in descending order; in most applications, they are distinct, i.e., \(\lambda_1 > \lambda_2 > \ldots > \lambda_r\). Let \(\boldsymbol{u}_1, \boldsymbol{u}_2, \ldots, \boldsymbol{u}_r\) be the normalized eigenvectors corresponding to these \(r\) eigenvalues. Thus, \(\boldsymbol{u}_1\) determines the sought-after line, the solution to problem (10.10), called the first factorial axis and denoted by \(F_1\).

If one wants to fit the cloud of \(n\) points in \(\mathbb{R}^p\) with a hyperplane, optimized in the least squares sense, the same formalism as that used for the line shows that the hyperplane sought is the one spanned by the eigenvectors \(\boldsymbol{u}_1\) and \(\boldsymbol{u}_2\) of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) corresponding to the eigenvalues \(\lambda_1\) and \(\lambda_2\); it is thus spanned by the lines determined by \(\boldsymbol{u}_1\) and \(\boldsymbol{u}_2\) (i.e., the first two factorial axes \(F_1\) and \(F_2\)).

Example 10.5 Figure ?? shows a projection, emphasizing the large dispersion of the cloud, and a one for which the projection does not provide much information.

10.5.2.3 Adjustment of the Cloud of Variables in the Space of Observations

Now, let’s consider the space of observations \({\mathbb{R}}^n\), where the table \(\boldsymbol{X}\) is represented by a cloud of points-variables, with the \(n\) coordinates representing the columns of \(\boldsymbol{X}\). By analogy, the best fit by the least squares method of the cloud of \(p\) points by a line determined by the normalized vector \(\boldsymbol{v}\) leads to the problem: \[\begin{eqnarray*} \max_{\boldsymbol{v}} \sum^p_{j=1} ||\boldsymbol{v}^\top \boldsymbol{x}_j^{\text{o}}||^2 &\Leftrightarrow&\max_{\boldsymbol{v}}\boldsymbol{v}^\top \left(\sum^p_{j=1} \boldsymbol{x}_j^{\text{o}} \boldsymbol{x}_j^{\text{o}t} \right)\boldsymbol{v}\\ &\Leftrightarrow&\max_{\boldsymbol{v}}\boldsymbol{v}^\top {\boldsymbol{X}}{\boldsymbol{X}}^\top \boldsymbol{v} \end{eqnarray*}\] subject to the constraint \(\boldsymbol{v}^\top\boldsymbol{v}=1\), where \({\boldsymbol{X}}{\boldsymbol{X}}^\top\) is, this time, a symmetric matrix of dimension \(n\times n\).

By analogy with the problem (10.10), it can be found that \(\boldsymbol{v}\) is a solution of \[\begin{equation} {\boldsymbol{X}}{\boldsymbol{X}}^\top\boldsymbol{v}=\mu \boldsymbol{v} (\#eq:ACP1.20) \end{equation}\] In other words, \(\boldsymbol{v}\) is a normalized eigenvector of \({\boldsymbol{X}}{\boldsymbol{X}}^\top\) corresponding to the largest eigenvalue \(\mu_1\) of \({\boldsymbol{X}}{\boldsymbol{X}}^\top\).

Remark. It is easy to see that any eigenvalue \(\lambda\) of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) is also an eigenvalue of \({\boldsymbol{X}}{\boldsymbol{X}}^\top\), and vice versa. For example, suppose \({\boldsymbol{X}}^\top{\boldsymbol{X}} \boldsymbol{u}=\lambda \boldsymbol{u}\). By pre-multiplying by \({\boldsymbol{X}}\), we have \(({\boldsymbol{X}}{\boldsymbol{X}}^\top){\boldsymbol{X}}\boldsymbol{u}= \lambda {\boldsymbol{X}} \boldsymbol{u}\), which proves that \({\boldsymbol{X}}\boldsymbol{u}\) is an eigenvector of \({\boldsymbol{X}}{\boldsymbol{X}}^\top\) corresponding to the eigenvalue \(\lambda\). Conversely, if \({\boldsymbol{X}}{\boldsymbol{X}}^\top\boldsymbol{v}=\mu \boldsymbol{v}\), by pre-multiplying by \({\boldsymbol{X}}^\top\), we get \(({\boldsymbol{X}}^\top{\boldsymbol{X}}){\boldsymbol{X}}^\top\boldsymbol{v}= \mu {\boldsymbol{X}}^\top \boldsymbol{v}\), from which we see that \({\boldsymbol{X}}^\top \boldsymbol{v}\) is an eigenvalue of \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) corresponding to the eigenvalue \(\mu\). This implies that the \(n\) eigenvalues (in descending order) of \({\boldsymbol{X}}{\boldsymbol{X}}^\top\) are \(\lambda_1, \ldots, \lambda_r\), with the remaining \(n-r\) being zero.

Remark (Additional Elements). The variables and individuals used to construct the optimal subspaces for representing proximities are called active elements. However, there’s nothing preventing the actuary from placing additional elements (row or column points) in these subspaces that did not participate in the analysis. These are called supplementary or illustrative elements.

Supplementary elements come into play after the analysis to enrich the interpretation of the factors. They do not participate in the adjustment calculations and do not contribute to the formation of the factor axes. They are positioned in either the cloud of individuals or variables by calculating their coordinates on the factor axes post-analysis. You may also want to represent an additional nominal variable. To do this, you create as many groups of individuals as there are levels in this variable and calculate their centroids. These centroid points are then positioned among the individual points as supplementary elements.

10.5.2.4 Application to Index Construction

Significant amounts of information are available, especially from national statistical institutes (INSEE in France, INS in Belgium), law enforcement agencies, national banks, as well as private polling or marketing organizations. Incorporating such data into a pricing model can sometimes lead to significant improvements in the accuracy of pure premium calculations.

The following examples illustrate how the incorporation of such statistics into the pricing scheme can lead to a better risk assessment:

1.In home theft insurance, information about the neighborhood of the residence can be very useful. The type of housing (isolated villas, single-family houses, apartment buildings, etc.), the socio-economic profile of residents, and more can influence the covered risk. 2. In car theft insurance, actuaries can obtain statistics from police services to determine the most stolen car models and the most vulnerable areas. 3. In automobile insurance, it is useful to leverage the technical characteristics of the vehicle (available from manufacturers), especially for assessing their sportiness (and, therefore, less maneuverable). You can refer to (Ingenbleek and Lemaire 1988) for an application of PCA to determine an index reflecting the mechanical performance of automobiles.

However, it is not sufficient to incorporate such information directly into the pricing models that we will develop later in this chapter. The characteristics of vehicles or statistical sectors are contained in several hundred, or even thousands, of variables, whereas the characteristics of the insured and the covered risk are often summarized in a few dozen variables. Before incorporating them into a pricing model, the information must be summarized into relevant indices obtained using techniques like PCA.

10.5.3 Multiple Correspondence Analysis (MCA)

10.5.3.1 Descriptive Analysis of Large Qualitative Data Sets

Multiple Correspondence Analysis (MCA) is a powerful technique for the description of large qualitative data sets. MCA can be seen as the analogue of PCA for qualitative variables. Like PCA, the results are presented graphically (represented in factorial planes).

Here, we consider \(N\) individuals described by \(Q\) qualitative variables with \(J_1\), \(J_2\), \(\ldots\), \(J_Q\) categories. We denote \(J=\sum_{q=1}^Q J_q\) as the total number of categories across all variables.

10.5.3.2 Complete Disjunctive Table

The starting table is a cross-tabulation of qualitative variables and individuals. Each individual is described by the categories of the \(Q\) variables to which they belong. These raw data are therefore presented in the form of a table with \(N\) rows and \(Q\) columns.

Remark. In practice, both categorical and quantitative variables are often available simultaneously. From now on, we assume that the quantitative variables have been made categorical. To do this, we distinguish between:

  1. Variables that only take a few integer values (e.g., the number of dependents), which are made categorical by grouping values into meaningful classes (e.g., 0 dependents, 1-2 dependents, and 3 or more dependents).
  2. Continuous variables (e.g., vehicle horsepower), which are made categorical by choosing quantiles as class boundaries (a partition into 4 or 5 classes is usually sufficient in practice).

Each variable \(q\) corresponds to a table \(\boldsymbol{Z}_q\) with \(N\) rows and \(J_q\) columns. This table is such that its \(i\)th row contains \(J_q-1\) zeros and one 1 (in the column corresponding to the category of variable \(q\) for individual \(i\)).

The table \(\boldsymbol{Z}\) with \(N\) rows and \(J\) columns describing the \(Q\) characteristics of the \(N\) individuals using binary coding is obtained by juxtaposing \(\boldsymbol{Z}_1,\boldsymbol{Z}_2,\ldots,\boldsymbol{Z}_Q\). \(\boldsymbol{Z}\) is called the complete disjunctive table.

10.5.3.3 Burt’s Table

The complete disjunctive table \(\boldsymbol{Z}\) is then transformed into a multiple contingency table \(\boldsymbol{B}\) (also known as Burt’s table) to make it usable for MCA. This table is obtained from \(\boldsymbol{Z}\) through \(\boldsymbol{B}=\boldsymbol{Z}^\top\boldsymbol{Z}\); \(\boldsymbol{B}\) appears as a juxtaposition of contingency tables. More precisely, this table is formed by the juxtaposition of \(Q^2\) blocks where we distinguish between:

  1. The block \(\boldsymbol{Z}_q^\top\boldsymbol{Z}_{q'}\) indexed by \((q,q')\) of size \(J_q\times J_{q'}\), which is nothing but the contingency table crossing the modalities of variables \(q\) and \(q'\).
  2. The \(q\)-th square block \(\boldsymbol{Z}_q^\top\boldsymbol{Z}_q\), which appears as a diagonal matrix of size \(J_q\times J_q\) (since two modalities of the same variable cannot be chosen simultaneously). The diagonal terms are the frequencies of the \(J_q\) modalities of variable \(q\).

MCA is the analysis of correspondences in a complete disjunctive table or, equivalently, in the corresponding Burt’s table.

10.5.3.4 Binary Correspondence Analysis

Since the Burt’s table can be seen as a juxtaposition of contingency tables crossing variables pairwise, the goal is to be able to analyze a contingency table. This is the subject of Binary Correspondence Analysis, or BCA.

Let’s assume that observations have been made on a set of \(N\) individuals for two characters denoted \(I\) and \(J\), taking \(n\) and \(p\) possible values, respectively. These characters can be either quantitative (in which case grouping has been done into \(n\) and \(p\) groups) or qualitative.

The result of observing the \(N\) individuals can be represented as a table with \(n\) rows and \(p\) columns crossing the two characters in such a way that the intersection of the \(i\)th row and \(j\)th column contains the total number of times \(N_{ij}\) that both the value \(i\) of \(I\) and the value \(j\) of \(J\) were observed. This table is called a “contingency table.”

By dividing all \(N_{ij}\) by \(N\), we obtain the relative frequencies \(f_{ij}\) defined as:

\[\begin{equation} f_{ij}=\frac{N_{ij}}{N} (\#eq:1.69) \end{equation}\]

The table of \(f_{ij}\) is called the “frequency table.” It is supplemented by an additional row and column providing the marginal frequencies associated with the \(n\) levels of \(I\) and the \(p\) levels of \(J\):

\[\begin{equation} f_{i \bullet}=\sum_j f_{ij},\quad i=1,\ldots,n, (\#eq:1.70) \end{equation}\]

\[\begin{equation} f_{\bullet j}=\sum_i f_{ij}, \quad j=1,\ldots, p. (\#eq:1.71) \end{equation}\]

BCA essentially involves performing PCA on the matrix \(\boldsymbol{X}\), where the element \(x_{ij}\) is defined as: \[ x_{ij}=\frac{f_{ij}}{f_{i\bullet}\sqrt{f_{\bullet j}}}. \]

10.6 Scoring Methods

10.6.1 Classification Methods

Linear regression, which will be discussed in Section 10.7, is used to predict a continuous variable based on explanatory variables, involving a linear combination of these explanatory variables. However, this method cannot model a dichotomous variable of the type - client (an indicator of whether or not a claim occurred during the year, for example).

In a simplified view, we aim to model such a dichotomous variable (denoted as \(Y\)) using two quantitative variables, \(X_1\) and \(X_2\). In a first approach, we can treat the qualitative variable as a quantitative variable (taking two values, \(0\) if the person had no claims and \(1\) if they had at least one) and use linear regression with the other two variables. In this case, we use the model:

\[ Y=\beta_0+\beta_1X_1+\beta_2X_2+\varepsilon, \]

where \(\varepsilon\) represents the error term.

This technique allows us to separate the space \((X_1,X_2)\) into two parts using a hyperplane (in this case, a line). More formally, we aim to construct a function \(y=f(x_1,x_2)\) such that the set of points \(\{(x_1,x_2)\in\mathbb{R}^2|f(x_1,x_2)<1/2\}\) corresponds to non-claimants (those for whom \(Y=0\)), and \(\{(x_1,x_2)\in\mathbb{R}^2|f(x_1,x_2)>1/2\}\) corresponds to claimants (those for whom \(Y=1\)). The boundary is defined as \(\{(x_1,x_2)\in\mathbb{R}^2|f(x_1,x_2)=1/2\}\). Note that the choice of the factor \(1/2\) is arbitrary and corresponds to the natural boundary point separating \(0\) and \(1\). Choosing a value closer to \(1\) allows for better isolation of individuals with \(Y=1\), but many claimants may then be outside the correct region of the space. Linear regression corresponds to the case where \(f\) is an affine function, and we separate the space using the line:

\[ f(x_1,x_2) = \widehat{\beta}_0+\widehat{\beta}_1 x_1+\widehat{\beta}_2 x_2 = 1/2, \]

where \(\widehat{\beta}_i\) are the estimators obtained by least squares. More generally, we can consider a curvilinear regression by regressing \(Y\) not only on \(X_1\) and \(X_2\) but also on \(X_1\cdot X_2\), \(X_1^2\), and \(X_2^2\). In this case, the separation is given by:

\[ f(x_1,x_2) = \widehat{\beta}_0+\widehat{\beta}_1 x_1+\widehat{\beta}_2 x_2 +\widehat{\beta}_{1,1} x_1^2+\widehat{\beta}_{1,2} x_1\cdot x_2+\widehat{\beta}_{2,2} x_2^2= 1/2, \]

Figure ?? illustrates these two methods, where points with \(Y=0\) are represented in white, and points with \(Y=1\) are in black.

Another method that can be considered is the “nearest neighbors” method. For a given point \((X_1,X_2)\), we calculate the average value of \(Y\) within a neighborhood and then round it to \(0\) or \(1\). The result is shown in Figure ??.

More formally, we choose a distance metric in the space of explanatory variables (\(X_1,X_2\)). The Euclidean distance is a common choice, where the distance between two individuals, say \(i\) and \(j\), is given by:

\[ d(X^i,X^j)=\sqrt{(X_1^i-X_1^j)^2+(X_2^i-X_2^j)^2}. \]

For a point \((x_1,x_2)\) in the space, the classification function \(f(x_1,x_2)\) is the average of \(Y\) obtained from the \(k\) nearest neighbors of point \((x_1,x_2)\) within the sample.

Therefore, if two customer classes exist within the population (the and the ), based on information about the individual (corresponding to the values of the rating variables \(X_1\) and \(X_2\)), the individual will be classified in the region of clients (lower left part in the plots, dominated by mostly white points) or in the region of clients (upper right part).

10.6.2 Definition of a Score

A score corresponds to a ranking of individuals based on their characteristics. These techniques are used by banks for credit approval (the score reflecting the probability of not being able to repay a loan). Insurers can also use these techniques by modeling the probability of having an accident: the lower this probability, the better the client.

As explained in Volume I, the production cycle is reversed in insurance. Indeed, insurers promise benefits in case of a claim, while premiums are set . Insurers must, therefore, carefully consider their acceptance policy. Once the insurance proposal is completed, the company would like to predict whether the insured will report claims or not, given their observable characteristics. Several methodologies are possible: file inspection, interviews, additional information searches, etc. However, due to costs, the number of cases to be studied, and the level of subjectivity introduced in the case-by-case examination, statistical classification methods are increasingly used for mass risks, as they are less costly, faster, and more systematic.

Remark. Although the rest of this section will be devoted to the acceptance problem, the methods described here can be applied to solve many other problems that insurance companies face, including the detection of policyholders who are at risk of canceling their policies at the next renewal, for example. This can be done by analyzing the file containing the portfolio policies with the dependent variable being the indicator of the event “policy \(i\) leaves the portfolio at the end of the period.” Make sure to include crucial information in this file, such as the number of years the policy has been in the portfolio, claims history, the number of modifications made to the contract, etc.

Let’s define the indicator variable \(Y\), taking values \(0\) or \(1\) depending on whether the risk is good or not (typically, a good risk is an insured who reports no claims over the period). As with classification methods, the decision rule is based on a set of explanatory variables \(\mathbf{X}\): we consider the insured as a risk if \(\mathbf{X} \in \mathcal{A}\), and a risk otherwise, where \(\mathcal{A}\) represents an acceptability domain. Formally, ranking prospective policyholders into acceptance and rejection amounts to choosing a partition of \({\mathbb{R}}^p\) into an acceptance zone \(\mathcal{A}\) and a rejection zone \(\overline{\mathcal{A}}={\mathbb{R}}^p\setminus \mathcal{A}\). Thus, for an insured whose characteristics are summarized in the vector \(\mathbf{x}\):

\[\begin{eqnarray*} \mathbf{x} \in \mathcal{A} &\Rightarrow& \text{acceptance} \\ \mathbf{x} \in \overline{\mathcal{A}} &\Rightarrow& \text{rejection}. \end{eqnarray*}\]

The simplest partition is the one generated by a hyperplane, but much more complex sets can be imagined. So, there are as many classifiers as partitions of \({\mathbb{R}}^p\) into \(\mathcal{A}\) and \(\overline{\mathcal{A}}\).

Since qualitative variables can always be replaced by numerical codes as described above, \(\mathbf{X}\) takes values in \({\mathbb{R}}^p\). We will start by assuming that \(\mathbf{X}\) is continuous, and later we will see how to handle cases where some components of \(\mathbf{X}\) are discrete or categorical.

10.6.3 Principle of Scoring

The canonical score is the score that classifies individuals based on the probability of being a good client or not. It is the function that associates the tariff variables \(\mathbf{X}\) with the score \(S^*\) defined as: \[ S^*(\mathbf{x}) = \Pr[Y=1|\mathbf{X}=\mathbf{x}]. \]

We can then define two classes: the and the risks. The good risks correspond to individuals who have obtained a score lower than a threshold \(s\) - set . It should be noted that with \(s\) fixed, \(\Pr[S\leq s]\) is the proportion of individuals retained, and \(\Pr[Y=0|S\leq s]\) represents the proportion of good individuals among those retained.

In general, let’s denote $p_{1}=$ and \(p_{0}=1-p_{1}\) as the unknown marginal probabilities (but which can be estimated from historical data). Assuming that the covariates are continuous and follow a density \(f(\mathbf{x})\), we can also denote \(f_{0}\) and \(f_{1}\) as the conditional densities: \[ f_{j}(\mathbf{x}) = f(\mathbf{x}|Y=j), \text{ where } j=0,1. \] The discriminatory (or segmentation) power will be stronger the more different these densities are. Finally, we denote \(p_{j}(\mathbf{x})\) as the probabilities of \(Y=j\) given \(\mathbf{X}=\mathbf{x}\): \[ p_{j}(\mathbf{x}) = \Pr \left[ Y=j|\mathbf{X}=\mathbf{x}\right], \text{ where } j=0,1. \] Using Bayes’ theorem (as recalled in Section 2.2.8), we can relate these quantities as follows: \[ p_{j}(\mathbf{x}) = \frac{p_{j}f_{j}(\mathbf{x})}{f(\mathbf{x})}, \text{ with } f(\mathbf{x}) = p_{0}f_{0}(\mathbf{x}) + p_{1}f_{1}(\mathbf{x}). \]

10.6.4 Optimal Classification and Threshold Selection

Several methods can be used to determine the threshold \(s\). For this, let’s assume that the insurer can refuse to insure bad clients. Let \(g\) be the insurer’s gain in case of a correct decision: accepting to insure a good client (which happens with probability \(\Pr [\mathbf{X}\in \mathcal{A},Y=0]\)). Let \(c_{0}\) be the cost associated with insuring a bad risk (with probability \(\Pr [\mathbf{X}\in \mathcal{A},Y=1]\)), and let \(c_{1}\) be the loss in revenue from refusing to insure a good client (with probability \(\Pr [\mathbf{X}\notin \mathcal{A},Y=0]\)).

It is essential to consider the very different consequences of misclassifications for the company. Indeed, placing a bad policyholder among the good ones means accepting to cover a policyholder who will cause a claim (or multiple claims), exposing the company to heavy losses. Conversely, refusing a good risk costs little to the company (but could have disastrous commercial consequences if it happened too often). It is essential to introduce this distinction into the model.

By defining the profitability of the insurer as: \[\begin{eqnarray*} R(\mathcal{A}) &=& g\Pr [\mathbf{X}\in \mathcal{A},Y=0] - c_{0}\Pr [\mathbf{X}\in \mathcal{A},Y=1] \\ && - c_{1}\Pr [\mathbf{X}\notin \mathcal{A},Y=0], \end{eqnarray*}\] the optimal acceptance region is given by: \[ \mathcal{A}^{*}=\underset{\mathcal{A}}{\text{argmax}}\{ R(\mathcal{A}) \}. \] It can be noted that the insurer’s profitability can be rewritten as: \[ R(\mathcal{A}) = \int_{\mathbf{x}\in \mathcal{A}}\Big( \left( g+c_{1}\right) p_{0}f_{0}(\mathbf{x}) - c_{0}p_{1}f_{1}(\mathbf{x}) \Big)d\mathbf{x} - c_{1}p_{0}. \] The optimal acceptance region \(\mathcal{A}^*\) is the one for which the integrand is always positive, i.e.: \[ \mathcal{A}^{*} = \{ \mathbf{x}|\frac{f_{1}(\mathbf{x})}{f_{0}(\mathbf{x})} \leq \frac{(g+c_{1})p_0}{c_0 p_1} \}, \] or using Bayes’ formula with \(p_{j}(\mathbf{x}) = p_{j}f_{j}(\mathbf{x})/f(\mathbf{x})\): \[ \mathcal{A}^{*} = \{ \mathbf{x}|p_{1}(\mathbf{x}) \leq \frac{g+c_{1}}{g+c_{0}+c_{1}} \}. \] One can then select policyholders using either the conditional laws of \(\mathbf{X}\) given \(Y\) (i.e., \(f_0\) and \(f_1\)), or the conditional laws of \(Y\) given \(\mathbf{X}\) (i.e., \(p_0(\mathbf{X})\) and \(p_1(\mathbf{X})\)). However, it should be noted that while these two approaches are equivalent from a mathematical point of view, they reflect two substantially different perspectives:

  • The law of \(\mathbf{X}|Y\) is used in discriminant analysis (the question is whether an individual, whose status as a good or bad risk is known, will be classified among the or clients).
  • The law of \(Y|\mathbf{X}\) is used in a predictive context (to which population - or clients - does an individual, whose characteristics \(\mathbf{X}\) are known, have the highest chance of belonging?).

Remark. It is also possible to imagine that the costs of misclassification \(c_0\) and \(c_1\), as well as the gain \(g\), depend on the characteristics of the insured individual, i.e. \(c_0=c_0(\mathbf{x})\), \(c_1=c_1(\mathbf{x})\), and \(g=g(\mathbf{x})\). The reasoning is very similar to the one followed above.

10.6.5 Practical Construction of a Score

Five main steps are fundamental when constructing a score:

  1. The choice of the dichotomous criterion to model: not having a claim in the year, not having a severe claim in the year, not having a responsible claim in the year, …
  2. The choice of the population: the difficulty arises from the fact that the population of insureds in the portfolio can be substantially different from the population applying for insurance (due to selection using a score, precisely). A portion of the insured population is used to calculate the score, while the rest is used to test its performance.
  3. The choice of covariates \(\mathbf{X}\).
  4. Model estimation: schematically, it is possible to use logistic or probit models and estimate the parameters using maximum likelihood.
  5. Performance analysis: a number of tests and criteria can be used to assess the quality of the score’s discrimination (performance, selection, discrimination curves, for example).

Let \(\mathcal{H}\) designate the historical data used to construct the classifier (i.e., the observations \((\mathbf{x},Y)\) recorded by the company in the past for a large number of individuals). This set will be partitioned into two subsets (most often randomly) that will be used to respectively estimate the parameters and evaluate the classifier, namely:

The “training set”

This is the subset of \(\mathcal{H}\) used to determine the parameters. Most often, the training set has the form: \[\begin{equation} \left\{ (\mathbf{x}_{0;k},0)\text{ }k=1,...,n_0,(\mathbf{x}_{1;k},1)\text{ }k=1,...,n_1\right\} (\#eq:2.12) \end{equation}\] where the \(\mathbf{x}_{0,j}\), numbering \(n_0\), correspond to policyholders classified as good risks, and the \(\mathbf{x}_{1,j}\), numbering \(n_1\), correspond to policyholders classified as bad risks.

The “test set”

This is the subset of \(\mathcal{H}\) used to assess the classifier’s performance.

If a parametric model is selected, we estimate the densities of observable characteristics for good and bad policyholders by: \[ \widehat{f_1(\mathbf{x})} = f_1(\mathbf{x};\widehat{\boldsymbol{\theta}}) \text{ and } \widehat{f_0(\mathbf{x})} = f_0(\mathbf{x};\widehat{\boldsymbol{\theta}}), \] where \(\widehat{\boldsymbol{\theta}}\) is the maximum likelihood estimator of \(\boldsymbol{\theta}.\) The likelihood is given by: \[\begin{equation} \mathcal{L}(\boldsymbol{\theta})=\prod_{j=1}^{n_0}\left(f_0(\mathbf{x}_{0,j},\boldsymbol{\theta}) p_0\right)\prod_{j=1}^{n_1}\left(f_1(\mathbf{x}_{1,j},\boldsymbol{\theta})p_1\right). (\#eq:2.13) \end{equation}\] To estimate the vector of parameters \((\boldsymbol{\theta},p_0,p_1)\), we write the log-likelihood: \[\begin{equation} L(\boldsymbol{\theta},\boldsymbol{p})=\sum_{j=1}^{n_0 }\ln f_0(\mathbf{x}_{0,j};\boldsymbol{\theta}) +\sum_{j=1}^{n_1 }\ln f_1(\mathbf{x}_{1,j};\boldsymbol{\theta})+n_0\ln p_0+n_1\ln p_1. (\#eq:2.14) \end{equation}\] We first maximize \(n_0\ln p_0+n_1\ln p_1\) to obtain: \[ \hat{p}_0=\frac{n_0}{n_0+n_1} \text{ and } \hat{p}_1=1-\hat{p}_0=\frac{n_1}{n_0+n_1}. \] This is the natural estimator of \(p_0\), i.e., the proportion of good clients in the training set. It is important to realize that we assume that the target population for new business is similar to the individuals in \(\mathcal{H}\).

We then maximize: \[\begin{equation} \sum_{j=1}^{n_0 }\ln f_0(\mathbf{x}_{0,j};\boldsymbol{\theta}) +\sum_{j=1}^{n_1 }\ln f_1(\mathbf{x}_{1,j};\boldsymbol{\theta}) (\#eq:2.14bis) \end{equation}\] over \(\boldsymbol{\theta}\) to obtain the maximum likelihood estimator of \(\boldsymbol{\theta}\).

10.6.6 (Linear) Discriminant Analysis

The objective here is to describe an individual’s membership in a predefined class based on their characteristics. We are interested in the law of \(\mathbf{X}|Y\) described by the probability density functions \(f_0\) and \(f_1\). Recall that the canonical score function \(S^*\) leads to a classification rule based on the ratio \(f_1/f_0\).

Example 10.6 Suppose that, conditional on \(Y=j\), \(\mathbf{X}\sim\mathcal{N}or_p(\boldsymbol{\mu}_j,\boldsymbol{\Sigma})\). Note that the variance-covariance matrix does not depend on the value of \(j\) for \(Y\). The ratio of densities is then an increasing function of: \[ \left( \mathbf{X}-\boldsymbol{\mu}_0\right)^\top \boldsymbol{\Sigma}^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_0\right)-\left( \mathbf{X}-\boldsymbol{\mu}_1\right)^\top \boldsymbol{\Sigma}^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_1\right). \]% We define the score as: \[ S(\mathbf{X}) = \mathbf{X}^\top \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{\mu}_1-\boldsymbol{\mu}_0\right). \]%

If the variance-covariance matrices are not equal, the score becomes: \[ S(\mathbf{X}) = \left( \mathbf{X}-\boldsymbol{\mu}_0\right)^\top \boldsymbol{\Sigma}_0^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_0\right)-\left( \mathbf{X}-\boldsymbol{\mu}_1\right)^\top \boldsymbol{\Sigma}_1^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_1\right), \]% where \(\boldsymbol{\Sigma}_0\) is the variance-covariance matrix of \(\mathbf{X}\) if \(Y=0\), and \(\boldsymbol{\Sigma}_1\) is that of \(\mathbf{X}\) if \(Y=1\). In this case, we no longer have a linear criterion (as in the previous case), but a quadratic one.

10.6.7 Discriminant Analysis

The objective here is to describe an individual’s membership in a predefined class based on their characteristics. We are interested in the law of \(\mathbf{X}|Y\) described by the probability density functions \(f_0\) and \(f_1\). Recall that the canonical score function \(S^*\) leads to a classification rule based on the ratio \(f_1/f_0\).

Example 10.6 Suppose that, conditional on \(Y=j\), \(\mathbf{X}\sim\mathcal{N}or_p(\boldsymbol{\mu}_j,\boldsymbol{\Sigma})\). Note that the variance-covariance matrix does not depend on the value of \(j\) for \(Y\). The ratio of densities is then an increasing function of: \[ \left( \mathbf{X}-\boldsymbol{\mu}_0\right)^\top \boldsymbol{\Sigma}^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_0\right)-\left( \mathbf{X}-\boldsymbol{\mu}_1\right)^\top \boldsymbol{\Sigma}^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_1\right). \] We define the score as: \[ S(\mathbf{X}) = \mathbf{X}^\top \boldsymbol{\Sigma}^{-1}\left( \boldsymbol{\mu}_1-\boldsymbol{\mu}_0\right). \]

If the variance-covariance matrices are not equal, the score becomes: \[ S(\mathbf{X}) = \left( \mathbf{X}-\boldsymbol{\mu}_0\right)^\top \boldsymbol{\Sigma}_0^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_0\right)-\left( \mathbf{X}-\boldsymbol{\mu}_1\right)^\top \boldsymbol{\Sigma}_1^{-1}\left( \mathbf{X}-\boldsymbol{\mu}_1\right), \] where \(\boldsymbol{\Sigma}_0\) is the variance-covariance matrix of \(\mathbf{X}\) if \(Y=0\), and \(\boldsymbol{\Sigma}_1\) is that of \(\mathbf{X}\) if \(Y=1\). In this case, we no longer have a linear criterion (as in the previous case), but a quadratic one.

10.6.8 The DISQUAL Method

The discriminant analysis methods we have presented seem to be particularly well-suited for solving the acceptance problem. However, there is a significant challenge that needs to be overcome. In practice, the information available to the actuary about future policyholders is mostly composed of qualitative variables (binary, such as gender, or with multiple categories, such as socio-professional category), and integer variables that only take a few distinct values (such as the number of dependent children or the number of household vehicles, for example).

The joint density of these variables is far from resembling that of a multivariate normal distribution. To get closer to the assumptions of validity of discriminant analysis, we can first transform the qualitative variables into continuous and uncorrelated variables by performing a Multiple Correspondence Analysis (MCA) on all qualitative variables. We will then work with the coordinates of individuals on the factorial axes.

MCA allows us to replace the original qualitative characteristics, which may not always be suitable for scoring, with continuous variables. Then, a discriminant analysis is performed, and it is straightforward to return to the original variables to derive a score.

This strategy can be represented schematically as follows:

10.6.9 The Probit Model

The fact that \(Y\) takes its values in \(\{0,1\}\) makes any linear modeling approach outlined in Section 10.6 inappropriate. The use of a latent variable, as we will see, is much more appropriate.

Here, we assume that there exists a latent variable \(Y^*\) such that: \[ Y^* = \mathbf{X}^\top \boldsymbol{\beta} + \varepsilon, \text{ and } Y = \mathbb{I}[Y^* \geq 0], \] where \(\mathbb{I}[A]\) is the indicator of the event \(A\), equal to 1 if the event occurred and 0 otherwise, and where \(\varepsilon \sim \mathcal{N}or(0,1)\). Therefore, the score associated with this model is given by: \[ S(\mathbf{x}) = \Pr[Y=1|\mathbf{X}=\mathbf{x}] = \Pr[Y^* \geq 0|\mathbf{X}=\mathbf{x}] = \Pr[\mathbf{x}^\top \boldsymbol{\beta} + \varepsilon \geq 0] = 1 - \Phi(-\mathbf{x}^\top \boldsymbol{\beta}) = \Phi(\mathbf{x}^\top \boldsymbol{\beta}), \] where \(\Phi\) denotes the cumulative distribution function of the \(\mathcal{N}or(0,1)\) distribution.

For this model, the estimation of the parameters \(\boldsymbol{\beta}\) is done by maximum likelihood. Given observations \((y_i, \mathbf{x}_i)\), \(i=1,2,\ldots,n\), this amounts to maximizing: \[ \mathcal{L} = \prod_{i=1}^n \left(\Phi(\mathbf{x}_i^\top\boldsymbol{\beta})\right)^{1-y_i} \left(1-\Phi(\mathbf{x}_i^\top\boldsymbol{\beta})\right)^{y_i}. \] In fact, it is sufficient to solve the system of likelihood equations obtained by setting the gradient of the log-likelihood to zero: \[ \frac{\partial}{\partial \boldsymbol{\beta}} \ln \mathcal{L} = \sum_{i=1}^n \left((1-y_i) \frac{\mathbf{x}_i \phi(\mathbf{x}_i^\top\boldsymbol{\beta})}{\Phi(\mathbf{x}_i^\top\boldsymbol{\beta})} - y_i \frac{\mathbf{x}_i \phi(\mathbf{x}_i^\top\boldsymbol{\beta})}{1-\Phi(\mathbf{x}_i^\top\boldsymbol{\beta})}\right) = \boldsymbol{0}, \] where \(\phi\) is the probability density function associated with the \(\mathcal{N}or(0,1)\) distribution.

This system does not have an explicit solution. We will return more extensively to this nonlinear regression model in Section 10.9.

10.6.10 The Logit Model

The idea here is also to use a latent variable \(Y^*\), but we assume that \(\varepsilon\) follows a logistic distribution, with cumulative distribution function: \[ F(x) = \frac{1}{1+\exp(-x)}, \quad x \in \mathbb{R}. \]

In this case, the score is given by: \[ S^*(\mathbf{x}) = \Pr[Y=1|\mathbf{X}=\mathbf{x}] = F(\mathbf{x}^\top\boldsymbol{\beta}) = \frac{1}{1+\exp(-\mathbf{x}^\top\boldsymbol{\beta})}. \]

Additionally: \[ \Pr[Y=0|\mathbf{X}=\mathbf{x}] = F(-\mathbf{x}^\top\boldsymbol{\beta}) = \frac{1}{1+\exp(\mathbf{x}^\top\boldsymbol{\beta})}. \]

Note that the ratio \(\Pr[Y=1|\boldsymbol{X}=\boldsymbol{x}] / \Pr[Y=0|\boldsymbol{X}=\boldsymbol{x}]\) is called the odds ratio. For horse racing enthusiasts, odds of 5 to 1 mean that the probability of losing is 5 times greater than the probability of winning. So, when a horse is rated at 5 to 1, it means that 5 bettors have placed losing bets against 1 winning bet. In other words, there are 5 bettors out of 6 who have placed losing bets, resulting in odds of \(\frac{5/6}{1/6}=5\).

We will delve into the details of this nonlinear regression model in Section 10.9.

10.6.11 Duality of Approaches

The approaches of discriminant analysis and logistic regression are, in fact, dual approaches; see (Celeux and Nakache 1994), (Gourieroux 1992) and (Gouriéroux 1999). Discriminant analysis is based on the specification of the conditional distributions of \({\boldsymbol{X}}|Y\), i.e., \(f_{0}\) and \(f_{1}\). Thus, the canonical score can be expressed as \[ S^{\ast }\left( {\boldsymbol{x}}\right) =\Pr \left[ Y=1|{\boldsymbol{X}}={\boldsymbol{x}}% \right] =\frac{p_{1}f_{1}\left( {\boldsymbol{x}}\right) }{p_{0}f_{0}\left( {\boldsymbol{x}}\right) +p_{1}f_{1}\left( {\boldsymbol{x}}\right) }. \]% Under the assumptions of Example 10.6 (Gaussian conditional distributions with the same covariance matrix), this canonical score can be written after simplifications as% \[ S^{\ast }\left( {\boldsymbol{x}}\right) =\frac{1}{1+\exp\big(-\Delta(\boldsymbol{x})\big)} \] where \[\begin{eqnarray*} \Delta(\boldsymbol{x})&=& \boldsymbol{X}^\top\boldsymbol{\Sigma}^{-1}\left( \boldsymbol{\mu }_{1}-\boldsymbol{\mu }_{0}\right)-\ln \left( \frac{1-p_{1}}{p_{1}}\right)\\ &&-\frac{1}{2}\Big( \boldsymbol{\mu}_{0}^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu }_{0}-\boldsymbol{\mu }_{1}^\top \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu }_{1}\Big). \end{eqnarray*}\] This corresponds to a logistic model with a constant term, i.e., $% _{0}+{}^$ where \(\boldsymbol{\beta}=\boldsymbol{\Sigma}^{-1}\left( \boldsymbol{\mu }_{1}-\boldsymbol{\mu }_{0}\right)\) and% \[ \beta _{0}=-\ln\left( \frac{1-p_{1}}{p_{1}}\right) -\frac{1}{2}% \Big( \boldsymbol{\mu }_{0}^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu }_{0}-% \boldsymbol{\mu }_{1}^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu }_{1}\Big) . \]% In other words, the linear discriminant analysis described in Example 10.6 is a special case of the LOGIT model. Equivalently, the quadratic discriminant analysis obtained with different covariance matrices (second part of Example 10.6) appears as a specific case of the LOGIT model when the explanatory variables include quadratic transformations of the components of \(\boldsymbol{X}\).

10.6.12 Performance and Selection Curves

Suppose the score \(S\) is used to discriminate between two subpopulations, good and bad clients, using a threshold \(s\). The idea of (Gourieroux 1992) is then to represent the performance of the score by the performance curve \[ \mathcal{P}=\big\{(x(s),y(s))\big| s\in[0,1]\big\}, \] where \[ x(s)=\Pr[S\leq s]\text{ and }y(s)=\frac{\Pr[Y=0|S\leq s]}{\Pr[Y=0]}, \] whose explicit equation is \(y=\mathcal{P}(x)\). We define the selection curve as \[ \mathcal{S}=\big\{(x(s),y(s))\big| s\in[0,1]\big\}, \] where \[ x(s)=\Pr[S\leq s]\text{ and }y(s)=\Pr[S\leq s|Y=0], \] whose explicit equation is \(y=\mathcal{S}(x)\). The performance curve is necessarily increasing, while the selection curve of a canonical score is always increasing and convex. Note also that the two curves are related by the equation \(\mathcal{S}(x)=x\mathcal{P}(x)\). Both of these curves are depicted in Figure ??.

Performance curves are invariant under strictly increasing transformations of the score: let \(h\) be a strictly increasing transformation, \[ x_h(s)=\Pr[h(S)\leq s]=\Pr[S\leq h^{-1}(s)]=x(h^{-1}(s)), \] and \[ y_h(s)=\frac{\Pr[Y=0|h(S)\leq s]}{\Pr[Y=0]}=y(h^{-1}(s)). \] In other words, these curves do not take into account the value of the score, but only the order it establishes.

10.6.13 Desirable Properties of a Score

For any score \(S\) (not necessarily canonical), it is desirable that it strongly depends on \(Y\). In particular, it is possible to show that the random variables \(Y\) and \(S\) are associated (this notion was presented in Section 8.5.3 of Volume 1) if, and only if, the performance curve \(\mathcal{P}\) is below the line \(y=1\), or equivalently, if, and only if, the selection curve \(\mathcal{S}\) is below the first bisector.

Furthermore, if the selection curve is increasing and convex, then the selection curve can be viewed as the Lorenz curve associated with \(\Pr[Y=0|S]\).

10.6.14 Score Comparison

Performance curves can be used to compare scores with each other. In particular, a score \(S_1\) is considered more effective than a score \(S_2\) if, and only if, its performance curve is below that of \(S_2\).

We can also define the so-called score discrimination curve: if we denote \(G_0(s)=\Pr[S\leq s|Y=0]\) and \(G_1(s)=\Pr[S\leq s|Y=1]\), then the score discrimination curve is the function \([0,1]\rightarrow[0,1]\) defined as \[ \mathcal{D}(x)=G_1\circ G_0^{-1}(x), \hspace{2mm}x\in[0,1]. \] Note that this function is increasing and invariant under strictly increasing transformations of the score. Also, the concavity of this function is equivalent to having \(\Pr[Y=1|S=s]\) increasing in \(s\).

The term “discrimination” in this curve comes from the following property: if the score is not discriminative, and thus \(G_0=G_1\) (Y and S are independent), then the curve \(\mathcal{D}\) coincides with the first bisector. On the other hand, for a population partitioned into two subsets, where \(G_0\) and \(G_1\) would then be concentrated at two points, the discrimination curve would be the curve \(y=0\) on \([0,1[\). Between these two extreme cases, it is possible to consider the following preorder: score \(S_1\) is more discriminative than score \(S_2\) if, and only if, its discrimination curve is below that of \(S_2\).

10.7 Linear Model

10.7.1 Definition

Since generalized linear models extend the Gaussian linear model, it seems natural to begin by briefly recalling the main results related to the classic linear regression approach. We strongly recommend the interested reader to delve deeper into this subject by consulting the excellent work of (Cornillon and Matzner-Løber 2007).

For a long time, models used to explain the variations of continuous variables \(Y_1, Y_2, \ldots, Y_n\) in the presence of explanatory variables summarized in vectors \(\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_n\) took the form \[ Y_i=\beta_0+\sum_{j=1}^p\beta_jx_{ij}+\epsilon_i\text{ with }\epsilon_i\sim\mathcal{N}or(0,\sigma^2). \] Equivalently, we have \[ Y_i\sim\mathcal{N}or\left(\beta_0+\sum_{j=1}^p\beta_jx_{ij},\sigma^2\right),\hspace{2mm}i=1,2,\ldots,n. \] The observations \(Y_i\) are assumed to be normally distributed with a mean of \(\beta_0+\sum_{j=1}^p\beta_jx_{ij}\), an affine function of the explanatory variables, and constant variance \(\sigma^2\). The linear combination of the explanatory variables \(\beta_0+\sum_{j=1}^p\beta_jx_{ij}\) that yields \(\mathbb{E}[Y_i]\) is called the score (or linear predictor) and will be denoted as \(\eta_i\) subsequently.

Even though the linear model imposes serious limitations and its realism is questionable in many problems faced by actuaries, it remains highly significant because most more realistic models (including generalized linear models to be discussed in the next section) borrow many techniques from it.

10.7.2 Matrix Formalism

Matrix formalism is quite useful for analyzing the generalized linear model. We can easily see that the linear regression model can be vectorized as follows: \[\begin{equation} {\boldsymbol{Y}}={\boldsymbol{X}}\boldsymbol{\beta}+\boldsymbol{\epsilon} \tag{10.11} \end{equation}\] where \({\boldsymbol{Y}}=\left(Y_1,Y_2,\ldots, Y_n\right)^\top\) is an \(n\times 1\) vector containing the variables to be explained, \(\boldsymbol{\beta}=\left(\beta_0,\beta_1, \ldots,\beta_p\right)^\top\) is a \((p+1)\times 1\) vector of parameters, \[ {\boldsymbol{X}}=\left( \begin{array}{cc} 1&\boldsymbol{x}_1^\top \\ 1&\boldsymbol{x}_2^\top \\ \vdots&\vdots \\ 1&\boldsymbol{x}_n^\top \end{array} \right)=\left( \begin{array}{ccccc} 1&x_{11} & x_{12} & \ldots & x_{1p} \\ 1&x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots&\vdots & \vdots & \ddots & \vdots \\ 1&x_{n1} & x_{n2} & \ldots & x_{np} \end{array} \right) \] is an \(n\times (p+1)\) matrix containing the explanatory variables, and \(\boldsymbol{\epsilon}=\left(\epsilon_1,\epsilon_2,\ldots,\epsilon_n\right)^\top\sim\mathcal{N}or_n(\boldsymbol{0},\sigma^2\boldsymbol{I})\) is an \(n\times 1\) vector representing the errors. The matrix \({\boldsymbol{X}}\) is assumed to have rank \(p+1\), i.e., the square matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) of dimension \((p+1)\times (p+1)\) is assumed to be invertible.

With these notations, the observed value of the random vector \({\boldsymbol{Y}}\) is the sum of a deterministic component \({\boldsymbol{X}}\boldsymbol{\beta}\) and a random component \(\boldsymbol{\epsilon}\) that models the noise. The deterministic component of the vector \({\boldsymbol{Y}}\) represents the observations that would have been made in the absence of noise. The assumption \(\mathbb{E}[\boldsymbol{\epsilon}]=\boldsymbol{0}\) means that the deterministic component of the vector \({\boldsymbol{Y}}\) is its mean.

10.7.3 Parameter Estimation

Suppose we have realizations \(y_1,y_2,\ldots,y_n\) of the variables \(Y_1,Y_2,\ldots,Y_n\). The likelihood function associated with the observations \(y_1,y_2,\ldots,y_n\) is \[\begin{eqnarray*} \mathcal{L}(\boldsymbol{\beta},\sigma|{\boldsymbol{y}})&=&\left(\frac{1}{\sigma\sqrt{2\pi}}\right)^n \prod_{i=1}^n\exp\left(-\frac{1}{2\sigma^2}(y_i-{\boldsymbol{x}}_i^\top\boldsymbol{\beta})^2\right)\\ &=&\left(\frac{1}{\sigma\sqrt{2\pi}}\right)^n \exp\left(-\frac{1}{2\sigma^2}({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})^\top ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})\right). \end{eqnarray*}\] As explained above, the maximum likelihood estimator of \(\boldsymbol{\beta}\) is the value of \(\boldsymbol{\beta}\) that maximizes \(\mathcal{L}(\boldsymbol{\beta},\sigma|{\boldsymbol{y}})\). Intuitively, this is the value of \(\boldsymbol{\beta}\) that makes the observations \(y_1,y_2,\ldots,y_n\) most likely in the model (10.11). This estimator is obtained as follows.

Proposition 10.2 The maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\) is a solution of the normal equations \[ {\boldsymbol{X}}^\top{\boldsymbol{X}}\widehat{\boldsymbol{\beta}}-{\boldsymbol{X}}^\top{\boldsymbol{Y}}=\boldsymbol{0} \Leftrightarrow {\boldsymbol{X}}^\top\Big({\boldsymbol{X}}\widehat{\boldsymbol{\beta}}-{\boldsymbol{Y}}\Big)=\boldsymbol{0}. \] Since the matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) has been assumed to be invertible, the system of normal equations has a unique solution given by \[\begin{equation} \widehat{\boldsymbol{\beta}}=\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}}=\left(\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\top\right)^{-1} \sum_{i=1}^n\boldsymbol{x}_iy_i, \tag{10.12} \end{equation}\] which defines the maximum likelihood estimator of \(\boldsymbol{\beta}\).

Proof. The log-likelihood is given by \[ L(\boldsymbol{\beta},\sigma|{\boldsymbol{y}})= -\frac{n}{2}\ln\sigma^2-\frac{1}{2\sigma^2}({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})^\top ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})-\frac{n}{2}\ln(2\pi). \] Now, \[ \sup_{(\boldsymbol{\beta},\sigma^2)}L(\boldsymbol{\beta},\sigma|{\boldsymbol{y}}) =\sup_{\sigma^2}\sup_{\boldsymbol{\beta}}L(\boldsymbol{\beta},\sigma|{\boldsymbol{y}}). \] So, regardless of the value of \(\sigma^2\), maximizing \(L(\boldsymbol{\beta},\sigma|{\boldsymbol{y}})\) with respect to \(\boldsymbol{\beta}\) is equivalent to minimizing \[ S_2(\boldsymbol{\beta})= ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})^\top ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta}). \] To minimize \(S_2\) with respect to \(\boldsymbol{\beta}\), it is necessary for it to be a stationary point of this expression. Therefore, it is obtained by taking the derivative of \(S_2\) with respect to \(\boldsymbol{\beta}\) and setting the gradient to \(\boldsymbol{0}\). As \[ S_2(\boldsymbol{\beta})={\boldsymbol{y}}^\top{\boldsymbol{y}}-2{\boldsymbol{y}}^\top{\boldsymbol{X}}\boldsymbol{\beta} +\boldsymbol{\beta}^\top{\boldsymbol{X}}^\top{\boldsymbol{X}}\boldsymbol{\beta}. \] It follows that \[ \frac{\partial S_2(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}} =-2{\boldsymbol{X}}^\top{\boldsymbol{Y}}+2{\boldsymbol{X}}^\top{\boldsymbol{X}}\boldsymbol{\beta}, \] yielding the system of normal equations. Also, note that any solution to the normal equations corresponds to a minimum of the function \(S_2\) because the Hessian matrix is the positive definite matrix \(2{\boldsymbol{X}}^\top{\boldsymbol{X}}\).

If the matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) is not invertible, the system of normal equations may have more than one solution, and the concept of the generalized inverse matrix is used (note, however, that the estimator of the mean \({\boldsymbol{X}}\widehat{\boldsymbol{\beta}}\) is unique). This occurs if the columns of \(\boldsymbol{X}\) are linearly dependent. This can happen, for example, if for each individual, the number of pre-university study years, the number of higher education study years, and the total number of study years are measured. In practice, it is sufficient to review the variables and eliminate any redundancy of information. Often, problems are caused by the determinant of \(\boldsymbol{X}^\top\boldsymbol{X}\) being very close to zero, causing numerical instability when estimating \(\boldsymbol{\beta}\) and \(\sigma^2\).

The normal equations can also be written as \[ \sum_{i=1}^n\boldsymbol{x}_i^\top(y_i-\boldsymbol{\beta}^\top\boldsymbol{x}_i)=\boldsymbol{0}\text{ for }i=1,2,\ldots,n. \] Written in this way, they have a very simple and intuitive interpretation: \(y_i-\boldsymbol{\beta}^\top\boldsymbol{x}_i\) is the residual associated with observation \(i\), and the normal equations thus impose orthogonality between the residual vector and the vector of explanatory variables. This orthogonality intuitively means that there is nothing left in the explanatory variables that can provide information about the residuals. We will see in the next section that this interpretation is preserved in generalized linear models.

By calculating the partial derivative of the log-likelihood \(L(\boldsymbol{\beta},\sigma|{\boldsymbol{y}})\) with respect to \(\sigma^2\), we verify that the maximum likelihood estimator of \(\sigma^2\) is \[ \widetilde{\sigma}^2=\sigma^2(\widehat{\boldsymbol{\beta}})= \frac{S_2(\widehat{\boldsymbol{\beta}})}{n}=\frac{R_0^2}{n}. \] We will see later that we often prefer an unbiased estimator of \(\sigma^2\) over \(\widetilde{\sigma}^2\), which will be defined in (10.13).

10.7.4 Prediction Matrix

Having obtained the estimate of the vector \(\boldsymbol{\beta}\), we can define an estimator \(\widehat{{\boldsymbol{Y}}}={\boldsymbol{X}}\widehat{\boldsymbol{\beta}}\) for the mean of the vector \({\boldsymbol{Y}}\) and an estimator \(\widehat{\boldsymbol{\epsilon}}={\boldsymbol{Y}}-\widehat{{\boldsymbol{Y}}}\) for the unobservable vector \(\boldsymbol{\epsilon}\), which is called the residual vector. Note that we can still write \[ \widehat{{\boldsymbol{Y}}}={\boldsymbol{X}}\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top{\boldsymbol{Y}}, \] which leads to the following definition.

Definition 10.1 The projection matrix (hat matrix) associated with a data matrix \({\boldsymbol{X}}\) is the square \(n\times n\) matrix \({\boldsymbol{H}}\) defined as \[ {\boldsymbol{H}}={\boldsymbol{X}}\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top. \]

This matrix transforms \(\boldsymbol{Y}\) into \(\widehat{{\boldsymbol{Y}}}\) since \(\widehat{\boldsymbol{Y}}=\boldsymbol{X}\widehat{\boldsymbol{\beta}}={\boldsymbol{H}}\boldsymbol{Y}\). That’s why this matrix is also called the prediction matrix. Consequently, \[ \widehat{\boldsymbol{\epsilon}}={\boldsymbol{Y}}-{\boldsymbol{X}}\widehat{\boldsymbol{\beta}} =({\boldsymbol{I}}-{\boldsymbol{H}}){\boldsymbol{Y}}. \]

The projection matrix \({\boldsymbol{H}}\) associated with \({\boldsymbol{X}}\) has the following properties.

Proposition 10.3 Let \({\boldsymbol{X}}\) be a real matrix of dimension \(n \times (p+1)\), with rank \(p+1\), and \({\boldsymbol{H}}\) be the associated prediction matrix. Then:

  1. \(\sum \limits ^n _ {i=1} h_{ii} = p+1,\) so the trace of \(\boldsymbol{H}\) is equal to the number of regression coefficients,
  2. \(\sum \limits ^n _{i=1} \sum \limits ^n _ {j=1} h^2 _{ij} = p+1,\)
  3. $0 h_{ij} $ for all \(i\).
  4. $ -1/2 h_{ij} /2 $ for any \(j \neq i\),
  5. if $ h_{ii} = 1 $ or \(h_{ii} = 0,\) then $ h_{ij} = 0 $ for $ j i $,
  6. $(1-h_{ii})(1-h_{jj}) - h^2 _{ij} $
  7. $ h_{ii} h_{jj} - h^2 _ {ij} .$

10.7.5 Estimation of Means and Variance

The following result gives the properties of the estimators \(\widehat{\boldsymbol{Y}}\) for the mean of \(\boldsymbol{Y}\) and \(\widehat{\boldsymbol{\varepsilon}}\) for the vector of residuals.

Proposition 10.4 The random vector \(\widehat{{\boldsymbol{Y}}}\) is an unbiased estimator of the mean of \({\boldsymbol{Y}}\) with a variance-covariance matrix of \(\sigma^2{\boldsymbol{H}}\). The vector \(\widehat{\boldsymbol{\epsilon}}\) of estimated residuals is centered, and its variance-covariance matrix is \(\sigma^2({\boldsymbol{I}}-{\boldsymbol{H}})\). Furthermore, these two vectors are uncorrelated.

The components of the vector \(\widehat{\boldsymbol{\epsilon}}\) are generally correlated; their correlation depends on the design matrix \({\boldsymbol{X}}\).

Since \[\begin{eqnarray*} \mathbb{E}\left[\sum_{i=1}^n\widehat{\epsilon}_i^2\right]&=&\mathbb{E}\left[{\boldsymbol{Y}}^\top{\boldsymbol{Y}}- \widehat{\boldsymbol{\beta}}^\top{\boldsymbol{X}}^\top{\boldsymbol{X}}\widehat{\boldsymbol{\beta}}\right]\\ &=&\mathbb{E}\left[ {\boldsymbol{Y}}^\top({\boldsymbol{I}}-{\boldsymbol{H}}){\boldsymbol{Y}}\right]\\ &=&\text{Trace}\big(\sigma^2({\boldsymbol{I}}-{\boldsymbol{H}})\big)= \sigma^2(n-p-1), \end{eqnarray*}\] we only need to consider \[\begin{equation} \widehat{\sigma}^2=\frac{1}{n-p-1}\sum_{i=1}^n\widehat{\epsilon}_i^2. \tag{10.13} \end{equation}\]

10.7.6 Measurement of Fit Quality: The Coefficient of Determination

To assess the quality of the fit provided by the model, the coefficient of determination is generally used, or the percentage of variance explained by the model, defined as \[ R^2=1-\frac{\sum_{i=1}^n\big(\widehat{y}_i-y_i\big)^2} {\sum_{i=1}^n\big(y_i-\overline{y}\big)^2}= \frac{\sum_{i=1}^n\big(\widehat{y}_i-\overline{y}\big)^2} {\sum_{i=1}^n\big(y_i-\overline{y}\big)^2}. \] The value of \(R^2\) is between 0 and 1, with the model being better as \(R^2\) is closer to 1. The coefficient of determination measures “the proportion of variability in \(Y\) due to its linear regression by least squares on the explanatory variables \(X_i\).” In the case where there is only one explanatory variable (i.e., \(p=2\)), \(R^2\) is the square of the linear correlation coefficient between \(Y\) and \(X_1\). It should be noted that \(R^2\) is only useful if the model includes an intercept term \(\beta_0\).

10.7.7 Standardized Residuals

Unlike the theoretical residuals \(\epsilon_i\) (unobservable), the estimated residuals \(\widehat{\epsilon}_i\) do not have a constant variance and are generally correlated. Therefore, standardized residuals are preferred, given by \[ T_i=\frac{\widehat{\epsilon}_i}{\widehat{\sigma}\sqrt{1-h_{ii}}} \] which remedies these inconveniences.

10.7.8 Inferential Results for Parameters

Let’s continue the statistical analysis from before by completing the properties of the estimators \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\sigma}^2\). The following results are fundamental.

Proposition 10.5

  1. The estimator \(\widehat{\boldsymbol{\beta}}\) given in (10.12) has a distribution of \(\mathcal{N}or_{p+1}\left(\boldsymbol{\beta},\sigma^2\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)^{-1}\right)\).
  2. The estimator \(\widehat{\sigma}^2\) given in (10.13) is such that \[ \frac{(n-p-1)\widehat{\sigma}^2}{\sigma^2} \] follows a chi-squared distribution with \(n-p-1\) degrees of freedom.
  3. The estimators \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\sigma}^2\) are independent.

10.7.9 Testing a Simple Hypothesis

Let \(\boldsymbol{\beta}_0\) be a fixed value of the parameter vector. To test the null hypothesis \(H_0:\boldsymbol{\beta}=\boldsymbol{\beta}_0\), against its negation, we will apply the likelihood ratio test.

Proposition 10.6 Let \(\boldsymbol{\beta}_0\) be a fixed value of the parameter. We reject \(H_0\) at the confidence level \(\alpha\) when \[ \sum_{i=1}^n(y_i-{\boldsymbol{x}}_i^\top\boldsymbol{\beta}_0)^2- \sum_{i=1}^n(y_i-{\boldsymbol{x}}_i^\top\widehat{\boldsymbol{\beta}})^2\geq\widehat{\sigma}^2 F_{p+1,n-p-1;1-\alpha}, \] where \(F_{p+1,n-p-1;1-\alpha}\) is the \((1-\alpha)\) quantile of the Fisher-Snedecor distribution with \(p+1\) and \(n-p-1\) degrees of freedom.

10.7.10 Comparison of Nested Models

Suppose we have to choose between a model \(M_0\) and another model \(M_1\), such that \(M_0\) involves only a subset of the explanatory variables included in \(M_1\). Formally, \(M_0\) is obtained by setting \(\beta_j=0\) for a set of indices \(j\in \mathcal{E}_0\). In this case, we have nested models.

The choice between \(M_0\) and \(M_1\) is therefore equivalent to testing the nullity of the \(\beta_j\) for \(j\in \mathcal{E}_0\). We will base our decision on the likelihood ratio statistic \[ \frac{\max_{(\boldsymbol{\beta},\sigma^2)\in M_1}\mathcal{L}(\boldsymbol{\beta},\sigma^2|\boldsymbol{y})} {\max_{(\boldsymbol{\beta},\sigma^2)\in M_0}\mathcal{L}(\boldsymbol{\beta},\sigma^2|\boldsymbol{y})} \] and reject \(M_0\) in favor of \(M_1\) if this ratio is sufficiently large.

By denoting \(\widehat{\boldsymbol{\beta}}_1\) and \(\widehat{\sigma}_1^2\) as the maximum likelihood estimators in model \(M_1\), and \(\widehat{\boldsymbol{\beta}}_0\) and \(\widehat{\sigma}_0^2\) as the maximum likelihood estimators in model \(M_0\), we can still use the test statistic \[ \frac{\sum_{i=1}^n\big(y_i-\widehat{\boldsymbol{\beta}}_0^\top\boldsymbol{x}_i\big)^2 -\sum_{i=1}^n\big(y_i-\widehat{\boldsymbol{\beta}}_1^\top\boldsymbol{x}_i\big)^2} {\sum_{i=1}^n\big(y_i-\widehat{\boldsymbol{\beta}}_1^\top\boldsymbol{x}_i\big)^2}. \] If model \(M_0\) has \(p_0+1\) parameters and model \(M_1\) has \(p_1+1\) parameters, this statistic, multiplied by \((n-p_1-1)/(p_1-p_0)\), follows the Fisher distribution with parameters \(p_1-p_0\) and \(n-p_1-1\).

::: {.example}[Test of Linear Constraints on Parameters]

Suppose that, in the context of credit risk analysis, we have data on a married couple’s individual income levels, one for the husband and one for the wife. We may wonder if it is relevant to include both of these variables as explanatory factors in the model, i.e., to use a model of the form: \[ \ldots+\beta_j\times\text{ husband's income }+\beta_{j+1}\times\text{ wife's income }+\ldots \] or, on the contrary, if only the total household income matters. In this case, we can test the hypothesis \(H_0:\beta_j=\beta_{j+1}\). We will proceed as above with model \(M_0\) defined by the constraint \(\beta_j=\beta_{j+1}\) while model \(M_1\) allows these two parameters to differ.

10.7.11 Confidence Regions

Confidence regions are defined by the set of parameters \(\boldsymbol{\beta}\) that are not rejected, at the given level, as possible values of the parameter based on the maximum likelihood ratio test. These are regions in the parameter space that appear reasonable given the collected observations.

Proposition 10.7 The confidence region for \(\boldsymbol{\beta}\) at the \(1-\alpha\) level is defined by the set of values of \(\boldsymbol{\beta}\) such that \[ (\boldsymbol{\beta}-\widehat{\boldsymbol{\beta}})^\top{\boldsymbol{X}}^\top{\boldsymbol{X}} (\boldsymbol{\beta}-\widehat{\boldsymbol{\beta}})\leq \widehat{\sigma}^2 F_{p+1,n-p-1;1-\alpha}. \]

10.7.12 Confidence Intervals

Rather than confidence regions, most commercial software often presents confidence intervals for individual parameters. These intervals are typically obtained by noting that the \(j\)-th component of \(\widehat{\boldsymbol{\beta}}\), \(\widehat{\beta}_j\), follows a normal distribution with mean \(\beta_j\) and variance \(\sigma^2\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)_{jj}^{-1}\), where \(\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)_{jj}^{-1}\) is the \(j\)-th diagonal element of the inverse of the matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\).

When \(\sigma^2\) is known, the confidence interval for \(\beta_j\) is formed by the values \(\xi\in\mathbb{R}\) that satisfy \[ |\widehat{\beta}_j-\xi|\leq\sigma\sqrt{\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)_{jj}^{-1}}z_{\alpha/2}, \] where \(z_{\alpha/2}\) is the \((1-\alpha/2)\) quantile associated with the \(\mathcal{N}or(0,1)\) distribution. When \(\sigma^2\) is unknown, a \(1-\alpha\) confidence interval for \(\beta_j\) is given by the set of values \(\xi\in\mathbb{R}\) that satisfy \[ |\widehat{\beta}_j-\xi|\leq\widehat{\sigma} \sqrt{\big({\boldsymbol{X}}^\top{\boldsymbol{X}}\big)_{jj}^{-1}}t_{n-p-1;1-\alpha/2}, \] where \(t_{n-p-1;1-\alpha/2}\) is the \((1-\alpha/2)\) quantile associated with the Student’s t-distribution with \(n-p-1\) degrees of freedom.

Remark. The confidence intervals defined above are not adequate if you want to consider multiple parameters simultaneously because they do not account for the parameter dependencies. When the parameter \(\boldsymbol{\beta}\) has two components \(\beta_1\) and \(\beta_2\), the confidence region, which in this case is an ellipse, consists of pairs \((\beta_1, \beta_2)\) that reasonably explain the observations, taking into account the correlation between \(\hat\beta_1\) and \(\hat\beta_2\). In contrast, the separate confidence intervals for parameters \(\beta_1\) and \(\beta_2\) locate the value of one component without considering the value taken by the other.

10.7.13 Measures of Influence

10.7.13.1 Principle

The results of the least squares fitting of a linear model to a set of observations can be significantly altered by the removal or perturbation of certain data points. Various statistics have been defined to quantify the influence of each observation on the model’s fit. These statistics are primarily based on the residuals \(\hat{\epsilon}_i\) and the projection matrix \(\boldsymbol{H}\).

Here, we present the approach based on omission, primarily relying on the comparison between the results obtained when fitting the model to the entire dataset and those obtained after fitting the model with one or more observations omitted.

10.7.13.2 Effect of Omitting an Observation

The characteristics obtained after omitting observation \(i\) will be denoted with the subscript \((i)\). Thus, \(\widehat{\boldsymbol{\beta}}_{(i)}\) is the estimated regression vector based on the remaining \(n-1\) observations, namely \(y_1, \ldots, y_{i-1}, y_{i+1}, \ldots, y_n\). We can study the effect of each observation on the estimations in a regression model.

Proposition 10.8 After omitting the \(i\)th observation, the least squares estimators \(\widehat{\boldsymbol{\beta}}_{(i)}\) and \(\hat{\sigma}^2_{(i)}\) for the parameters \(\boldsymbol{\beta}\) and \(\sigma^2\) of the linear model satisfy the following equations:

\[ \widehat{\boldsymbol{\beta}}_{(i)} = \widehat{\boldsymbol{\beta}} - (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{x}_i \cdot \frac{\epsilon_i}{1-h_{ii}} \]

and

\[ (n-p-2) \hat{\sigma}^2_{(i)} = (n-p-1) \hat{\sigma}^2 - \frac{\epsilon_i^2}{1-h_{ii}} \]

The two preceding formulas demonstrate that the estimators \(\widehat{\boldsymbol{\beta}}_{(i)}\) and \(\hat{\sigma}^2_{(i)}\) depend solely on \(\widehat{\boldsymbol{\beta}}\), \(\hat{\sigma}^2\), and the \(i\)th diagonal element of \(\boldsymbol{H}\). Therefore, when omitting an observation, it is not necessary to perform model fitting again to compute these values.

10.7.13.3 Diagnostic Measures Based on Residuals

To assess the suitability of an observation for the model, we can examine whether its omission has an impact on its prediction. Specifically, we want to determine if observation \(y_i\) is close enough to its predicted value \(\hat{y}_{(i)}\) obtained by excluding the \(i\)th observation from the calculation. Since \(y_i\) is not used in the calculation of \(\hat{y}_{(i)}\), the random variables \(y_i\) and \(\hat{y}_{(i)}\) are uncorrelated. Their difference \(y_i - \hat{y}_{(i)}\) has a variance given by:

\[ \sigma^2\left(1+\boldsymbol{x}_i^\top (\boldsymbol{X}_{(i)}^\top \boldsymbol{X}_{(i)})^{-1}\boldsymbol{x}_i\right) \]

When the parameter \(\sigma^2\) is unknown, it is estimated by the residual variance \(\hat{\sigma}^2_{(i)}\) obtained from the regression equation after removing the \(i\)th observation. This estimator is independent of \(Y_i\). This leads to the definition of the following statistics:

\[\begin{eqnarray*} T^\ast_i &=& \frac{Y_i - \hat{Y}_{(i)}}{\hat{\sigma}_{(i)} \sqrt{1+\boldsymbol{x}_i^\top (\boldsymbol{X}_{(i)}^\top \boldsymbol{X}_{(i)})^{-1}\boldsymbol{x}_i}} \\ &=& \frac{\hat{\epsilon}_i}{\hat{\sigma}_{(i)} \sqrt{1-h_{ii}}}, \quad i=1,2,\ldots,n, \end{eqnarray*}\]

called cross-validation residuals. The probability distribution of \(T^\ast_i\) is given by the following theorem.

Proposition 10.9 Assuming that the matrix \(\boldsymbol{X}\) has rank \(p+1\), if the removal of the \(i\)th row of \(\boldsymbol{X}\) does not change its rank, then the cross-validation residuals \(T^\ast_i\), \(i=1, \ldots, n\), follow the Student’s t-distribution with \(n-p-2\) degrees of freedom.

Empirical evidence shows that to detect “outlier” observations, standardized residuals \(T_i\) and cross-validation residuals \(T^\ast_i\) are equivalent. Nevertheless, several authors prefer \(T^\ast_i\) over $T_i” for the following reasons:

  1. The \(T^\ast_i\), \(i=1,2, \ldots, n\), are identically distributed and follow the Student’s t-distribution with \((n-p-2)\) degrees of freedom.
  2. A simple calculation shows that \(T^\ast_i = T_i \sqrt{\frac{n-p-1}{n-p-T_i^2}}\). This relationship demonstrates that \(T^\ast_i\) is a monotonic function of \(T_i\) and is more sensitive to observations with large residuals.
  3. Since \(\hat{\sigma}_{(i)}\) is independent of \(y_i\), this estimator is robust to gross errors in the \(i\)th observation, which can occur during data acquisition.

10.7.13.4 Outlier Observations

We now aim to clarify what is meant by an outlier observation in a linear model.

Definition 10.2 An outlier data point is a point \((\boldsymbol{x}_i^\top, y_i)\) for which the associated value \(t^\ast_i\) of \(T^\ast_i\) is high (compared to the threshold given by the Student’s t-distribution).

Outlier data points are typically detected by plotting the \(t^\ast_i\) (or the \(t_i\) or the partial residuals) sequentially or as a function of other variables such as \(y_i\) or \(\boldsymbol{x}_i\). Detecting outlier data depends solely on the magnitude of the residuals.

Graphical representations of residuals often allow not only the detection of outlier data but also the verification of the model’s validity. We will recall some of the most common representations.

Empirical Distribution Representation

One way to represent residuals is by plotting the histogram, smoothed density, etc., of the empirical distribution of residuals. The use of such representations to check the assumption of normality of data makes sense only for large samples.

Representations as a Function of Fitted Values

Residuals can also be plotted as a function of fitted values. We have seen that when the assumptions associated with the model are correct, residuals and predicted values are uncorrelated. Therefore, the plot of points should not exhibit any particular structure. This type of plot provides insights into the validity of linearity assumptions and the homogeneity of error variance. For example, curvature in the shape of residuals suggests that the linearity assumption may not be appropriate, while a monotonic behavior of residual variability with \(\hat{y}_i\) indicates non-constant error variance.

10.7.13.5 Diagnostic Measures Based on the Projection Matrix

The arrangement of points in the space of regressors plays an important role. The projection matrix \(\boldsymbol{H}\) partially evaluates this influence. Recall that the components of \(\widehat{\boldsymbol{\epsilon}}\) are centered variables with variance equal to \(\sigma^2(1 - h_{ii})\), where

\[ h_{ii} = \boldsymbol{x}_i^\top (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{x}_i, \quad i=1,2, \ldots, n. \]

The estimated residual variance is lower when \(h_{ii}\) is greater, and the value of \(h_{ii}\) measures the influence of observation \(y_i\) on the fitted value \(\hat{y}_i\). The matrix \(\boldsymbol{H}\) is symmetric and idempotent as it projects \(\boldsymbol{Y}\) onto the subspace spanned by the column vectors of matrix \(\boldsymbol{X}\).

The \(i\)th component of \(\widehat{\boldsymbol{Y}}\) is:

\[ \hat{Y}_i = \sum_{j=1}^n h_{ij} Y_j = h_{ii} Y_i + \sum_{j \neq i} h_{ij} Y_j, \quad i=1,2, \ldots, n, \]

and thus, \(h_{ii}\) represents the “weight” of observation \(y_i\) in determining the prediction \(\hat{y}_i\). In particular, if \(h_{ii} = 1\), \(\hat{y}_i\) is determined solely by observation \(y_i\). Furthermore, if \(h_{ii} = 0\), observation \(y_i\) has no influence on \(\hat{y}_i\).

Since the trace of \(\boldsymbol{H}\) is equal to \(p+1\), the average of \(h_{ii}\) is \((p+1)/n\). When \(h_{ii}\) is “large,” as we have seen, the corresponding observation is influential. This has led to labeling points for which \(h_{ii} > 2(p+1)/n\) as influential. Other authors prefer to set the bounds at 0.2 and 0.5: values of \(h_{ii} \leq 0.2\) are normal, and values between 0.2 and 0.5 are considered influential.

10.7.13.6 Cook’s Distance

Under the assumption of normality of observations and when \(\sigma\) is unknown, the confidence region for the vector \(\boldsymbol{\beta}\) of the linear model coefficients is given by: \[ C_{\alpha} = \left\{\boldsymbol{\beta}\in\mathbb{R}^{p+1} \, \middle| \, (\boldsymbol{\beta} - \widehat{\boldsymbol{\beta}})^\top\boldsymbol{X}^\top\boldsymbol{X} (\boldsymbol{\beta} - \widehat{\boldsymbol{\beta}}) \leq \hat \sigma^2 p F_{p+1, n-p-1; 1-\alpha}\right\}. \] This inequality defines an ellipsoid centered at the point \(\widehat{\boldsymbol{\beta}}\). The influence of the \(i\)th observation can be measured by the displacement of this ellipsoid when the \(i\)th observation is omitted.

Cook introduced the statistic: \[ C_i = \frac{(\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})^\top\boldsymbol{X}^\top\boldsymbol{X} (\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})}{\hat \sigma^2 (p+1)}, \] to detect the influence of the \(i\)th observation on the regression coefficients. This statistic is called Cook’s distance. One can consider it as a weighted distance between \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\boldsymbol{\beta}}_{(i)}\). At first glance, it seems that to calculate Cook’s distance for all observations, one would need to perform \(n+1\) regressions (one using all the data and \(n\) using reduced data). However, it can be shown that \[ C_i = \frac{h_{ii}}{(p+1)(1-h_{ii})} T^2_i. \] Thus, \(C_i\) can be computed using quantities already calculated during the fitting of the full model. Furthermore, this relationship shows that Cook’s distance \(C_i\) is an increasing function of the square of the standardized residual and \(h_{ii}\). When \(C_i\) is large, the corresponding observation has a simultaneous influence on all model parameters. Cook suggests comparing each \(C_i\) to the quantiles of the Fisher-Snedecor distribution with \(p+1\) and \(n-p-1\) degrees of freedom, even though the \(C_i\) do not exactly follow such a distribution. This is not a rigorous test.

10.7.13.7 Likelihood Distance

If \(L (\boldsymbol{\beta}, \sigma^2)\) denotes the log-likelihood for \(\boldsymbol{\beta}\) and \(\sigma^2\), then the statistic: \[ LD_i = 2 \cdot \left\{ L(\widehat{\boldsymbol{\beta}}, \widehat{\sigma}^2) - L (\widehat{\boldsymbol{\beta}}_{(i)}, \widehat \sigma^2_{(i)})\right\}, \] is called the likelihood distance. It can also be shown that \[ LD_i = n \ln \left(\frac{n(n-p-1-T^2_i)}{(n-1)(n-p-1)}\right)+ \frac{(n-1) T^2_i}{(1-h_{ii})(n-p-1-T^2_i)}-1. \] This measure is useful when considering the joint influence of observation \(i\) on the estimators of \(\boldsymbol{\beta}\) and \(\sigma^2\). The likelihood distance is compared to the \(1-\alpha\) percentile of the chi-squared distribution with \(p+1\) degrees of freedom.

10.7.14 Weighted Least Squares

10.7.14.1 Definition

Let’s now consider the model where the observations \(Y_1, Y_2, \ldots, Y_n\) have the representation: \[ Y_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} + \epsilon_i \text{ with } \epsilon_i \sim \mathcal{N}or(0, \sigma^2/w_i), \] where \(w_i\) is a weight associated with observation \(i\). Typically, this weight arises when \(Y_i\) is the average of \(w_i\) observations. Note that a high weight is associated with low variance. In other words, deviations from the mean \(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}\) will be less tolerated as the weights increase. Equivalently, we have: \[ Y_i \sim \mathcal{N}or\left(\beta_0 + \sum_{j=1}^p \beta_j x_{ij}, \frac{\sigma^2}{w_i}\right), \hspace{2mm} i=1,2,\ldots,n. \]

The matrix formalism is also useful for analyzing this model. It is easy to see that the linear regression model can be rewritten vectorially as in (10.11), where \({\boldsymbol{Y}}\), \(\boldsymbol{\beta}\), \({\boldsymbol{X}}\) are as defined previously, and \(\boldsymbol{\epsilon} \sim \mathcal{N}or_n(\boldsymbol{0}, \sigma^2\boldsymbol{W})\), with: \[ \boldsymbol{W}= \begin{pmatrix} 1/w_1 & 0 & \cdots & 0 \\ 0 & 1/w_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1/w_n \end{pmatrix}. \]

10.7.14.2 Parameter Estimation

The likelihood function for the observations \(y_1, y_2, \ldots, y_n\) is: \[\begin{eqnarray*} \mathcal{L}(\boldsymbol{\beta},\sigma|\boldsymbol{y}) &=& \frac{1}{(2\pi)^{n/2}\sigma|\boldsymbol{W}|^{1/2}} \exp\left(-\frac{1}{2\sigma^2}(\boldsymbol{y}-{\boldsymbol{X}}\boldsymbol{\beta})^\top\boldsymbol{W}^{-1} (\boldsymbol{y}-{\boldsymbol{X}}\boldsymbol{\beta})\right). \end{eqnarray*}\] Regardless of the value of \(\sigma^2\), maximizing \(L(\boldsymbol{\beta},\sigma|\boldsymbol{y}) = \ln \mathcal{L}(\boldsymbol{\beta},\sigma|\boldsymbol{y})\) with respect to \(\boldsymbol{\beta}\) is equivalent to minimizing: \[\begin{eqnarray*} S_2(\boldsymbol{\beta}) &=& ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})^\top\boldsymbol{W}^{-1} ({\boldsymbol{y}}-{\boldsymbol{X}}\boldsymbol{\beta})\\ &=& \sum_{i=1}^n w_i(y_i-\boldsymbol{x}_i^\top\boldsymbol{\beta})^2, \end{eqnarray*}\] where the column vector \({\boldsymbol{x}}_i\) contains the elements of the \(i\)th row of matrix \({\boldsymbol{X}}\). This means minimizing a weighted sum of squares of the differences between the response \(y_i\) and the linear predictor \(\boldsymbol{x}_i^\top\boldsymbol{\beta}\). Observations for which the weight \(w_i\) is high will be penalized during the minimization of \(S_2(\boldsymbol{\beta})\): less deviation between \(y_i\) and \(\boldsymbol{x}_i^\top\boldsymbol{\beta}\) will be tolerated for these indices.

To find \(\widehat{\boldsymbol{\beta}}\) that minimizes \(S_2\), we need to find the stationary point of this expression. It is obtained by taking the derivative of \(S_2\) with respect to \(\boldsymbol{\beta}\) and setting the gradient equal to \(\boldsymbol{0}\). The maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\) satisfies the normal equations: \[ {\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{X}}\widehat{\boldsymbol{\beta}}-{\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{Y}}=0, \] which will give: \[\begin{equation} \widehat{\boldsymbol{\beta}} = \big({\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{Y}}, \tag{10.12} \end{equation}\] defining the maximum likelihood estimator of \(\boldsymbol{\beta}\).

10.7.14.3 Projection Matrix

Having obtained the estimate of the vector \(\boldsymbol{\beta}\), we can define an estimator \(\widehat{{\boldsymbol{Y}}}={\boldsymbol{X}}\widehat{\boldsymbol{\beta}}\) for the mean of the vector \({\boldsymbol{Y}}\) and the vector of residuals \(\widehat{\boldsymbol{\epsilon}}={\boldsymbol{Y}}-\widehat{{\boldsymbol{Y}}}\). Note that we can still write: \[ \widehat{{\boldsymbol{Y}}}={\boldsymbol{X}}\big({\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{Y}}. \] By defining the projection matrix (hat matrix) as the square \(n\times n\) matrix \(\boldsymbol{H}\): \[ {\boldsymbol{H}}={\boldsymbol{X}}\big({\boldsymbol{X}}^\top\boldsymbol{W}{\boldsymbol{X}}\big)^{-1}{\boldsymbol{X}}^\top\boldsymbol{W}, \] we can see that this matrix maps \({\boldsymbol{Y}}\) to \(\widehat{{\boldsymbol{Y}}}\) since \(\widehat{\boldsymbol{Y}}=\boldsymbol{X}\widehat{\boldsymbol{\beta}}={\boldsymbol{H}}\boldsymbol{Y}\). Note that \(\boldsymbol{H}\) does not depend on the \(\boldsymbol{Y}\) but only on the weights \(\boldsymbol{W}\) and the regressors \(\boldsymbol{X}\).

10.8 Additive Models

10.8.1 Principle

In the score \(\eta_i\), we cannot be sure that continuous variables will intervene linearly. Thus, if the first variable is continuous (think, for example, of the age of the insured person), it might be advantageous to consider a model of the form: \[ Y_i \sim \mathcal{N}or(\eta_i, \sigma^2) \text{ where } \eta_i = \beta_0 + f(x_{i1}) + \sum_{j=2}^p \beta_j x_{ij}. \] The function \(f\), to be estimated based on the data, will express the relationship between the score \(\eta_i\) and \(x_{i1}\), taking into account the other variables \(x_{i2},\ldots,x_{ip}\). This is precisely the advantage of this approach compared to the prior determination of classes for a quantitative variable: when this approach is part of a regression model, it corrects for the other explanatory variables.

Remark. Another, more traditional approach is to substitute various transformations of the variable (e.g., using polynomials, sinusoids, etc.) to capture non-linear influence. However, this approach is much less convincing than estimating \(f\) directly. It leads to an increase in the number of parameters to estimate and imposes an arbitrary choice of transformations for the explanatory variables (which can be erroneous).

The innovation of additive models is to allow the score to be a non-linear additive function of the covariates. Specifically, we will consider the form: \[ \eta = c + \sum_{j=1}^p f_j(x_j) \] for the score, where the functions \(f_j(\cdot)\), \(j=1,\ldots,p\), assumed to be smooth, reflect the influence of the explanatory variables \(x_1,\ldots,x_p\) on the response \(y\). If all functions \(f_j\) are linear, we are reduced to the linear model.

To explain how an additive model is fitted, we will proceed in two successive steps. We will start by addressing the simple model \(y=f(x)+\epsilon\) where the error \(\epsilon\) is normally distributed. We will see how it is possible to estimate the unknown function \(f(\cdot)\), which reflects the influence of \(x\) on \(y\), using smoothing or local linear fitting techniques. Then, we will move on to the case of multiple explanatory variables \(y=\sum_{j=1}^p f_j(x_j)+\epsilon\).

10.8.2 Single Regressor Case

Let’s begin by considering the following elementary model. We have \(n\) observations \((x_i,y_i)\), \(i=1,\ldots,n\), where the response variable \(y_i\) and the regressor \(x_i\) are continuous. The influence of \(x_i\) on \(y_i\) is modeled using a function \(f(\cdot)\), assumed to be smooth, plus additive noise, i.e.: \[\begin{equation} y_i = f(x_i) + \epsilon_i \tag{10.14} \end{equation}\] where the errors \(\epsilon_i\) are assumed to be independent with a \(\mathcal{N}or(0,\sigma^2)\) distribution. The specification (10.14) allows us to free ourselves from the linearity constraint imposed in classical regression (where \(f(x_i)=\beta_0+\beta_1x_i\)). The estimation of \(f(\cdot)\) will be done using smoothing techniques, and we will focus specifically on linear smoothers, in the sense that \(\widehat{f}(x_i)\) can be expressed as a linear combination of the values \(y_1,\ldots,y_n\), i.e.: \[\begin{equation} \widehat{f}(x_i) = \sum_{j=1}^n h_{ij} y_j \tag{10.15} \end{equation}\] where the weights \(h_{ij}=h(x_i,x_j)\) depend on the location \(x_i\) where the response \(f(x_i)\) needs to be estimated. If we define the vector \(\boldsymbol{f}=(f(x_1),\ldots,f(x_n))^\top\), (10.15) can be rewritten as \(\widehat{\boldsymbol{f}}=\boldsymbol{H}\boldsymbol{y}\).

10.8.2.1 Loess Method

Principle: Weighted Least Squares

This method proposed by (Cleveland 1979) is part of the family of local polynomial regressions. It consists of locally approximating \(f(\cdot)\) with a line (note that it is also possible to locally approximate \(f\) with a constant, in which case, we obtain weighted moving averages, or with a polynomial of degree 2. Here, we consider only local linear fitting). This approach was developed by (Cleveland and Devlin 1988) and (Cleveland and Grosse 1991). The idea is to use the \(\lambda\) nearest neighbors of \(x\) to estimate \(f(x)\). The neighborhood is defined based on the explanatory variables: we use the \(\lambda\) observations with explanatory variables closest to \(x\) to estimate the response \(f(x)\).

The Loess method can be broken down as follows: given the \(n\) observations \((x_i,y_i)\), \(i=1,2,\ldots,n\),

  1. Identify the \(\lambda\) nearest neighbors of \(x\) (let \(\mathcal{V}(x)\) be the set of these neighbors).
  2. Calculate the distance \(\Delta(x)\) between \(x\) and the furthest of its \(\lambda\) nearest neighbors as \[ \Delta(x) = \max_{i\in\mathcal{V}(x)} |x-x_i|; \]
  3. Assign weights \[ w_i(x) = K\left(\frac{|x-x_i|}{\Delta(x)}\right) \] to each element in \(\mathcal{V}(x)\). The function \(K(\cdot)\) assigning weights \(w_i(x)\) to the observations in \(\mathcal{V}(x)\) should have the following properties:
  • \(K(u) \geq 0\) for all \(u\)
  • \(K(u) = 0\) for \(u > 1\)
  • \(K\) is non-decreasing on (0,1). As suggested by (Cleveland 1979), we will use the function \(K(\cdot)\) given by \[ K(u) = \left\{ \begin{array}{l} (1-u^3)^3\text{ for }0\leq u < 1\\ 0\text{ otherwise}. \end{array} \right. \] This way, more weight is given to points in the nearest neighborhood of \(x\).
  1. \(\widehat{f}(x)\) is obtained by regressing the \(y_i\), \(i\in\mathcal{V}(x)\), on the corresponding \(x_i\) using weighted least squares, and then using the regression line to predict the response corresponding to \(x\).

This approach provides a response in the form of (10.15).

In the case we are dealing with here, that is, with a single regressor, the fitted value \(\widehat{y}_i=\widehat{\beta}_0(x_i) +\widehat{\beta}_1(x_i)x_i\) is obtained by determining \(\widehat{\beta}_0(x_i)\) and \(\widehat{\beta}_1(x_i)\) in a way that minimizes \[ \sum_{k\in\mathcal{V}(x_i)}w_k(x_i)\big(y_k-{\beta}_0(x_i)-{\beta}_1(x_i)x_k\big)^2 \] which yields \[ \widehat{y}_i=\sum_{k=1}^nh_k(x_i)y_k, \] where \(h_k(x_i)\) does not depend on the \(y_j\), \(j=1,\ldots,n\) (\(h_k(x_i)\) depends only on the regressors).

If we are interested in the response for an unobserved value \(x\), the model used to estimate \(f(x)\) is therefore \[ y_i=\beta_0(x)+\beta_1(x)x_i+\epsilon_i\text{ for }i\in\mathcal{V}(x) \] where the estimates \(\widehat{\beta}_0(x)\) and \(\widehat{\beta}_1(x)\) of the parameters \(\beta_0(x)\) and \(\beta_1(x)\) are obtained by minimizing \[ \sum_{k\in\mathcal{V}(x)}w_k(x)\big(y_k-{\beta}_0(x)-{\beta}_1(x)x_k\big)^2. \] This ultimately gives \(\widehat{f}(x)=\widehat{\beta}_0(x) +\widehat{\beta}_1(x)x\).

Confidence Intervals

If we denote \(\widehat{\boldsymbol{y}}\) as the vector of fitted values \((\widehat{f}(x_1),\ldots,\widehat{f}(x_n))^\top\) and \(\widehat{\boldsymbol{\varepsilon}}\) as the vector of residuals, we have \[ \widehat{\boldsymbol{y}} = \boldsymbol{H}\boldsymbol{y}\text{ and }\widehat{\boldsymbol{\varepsilon}} = (\boldsymbol{I}-\boldsymbol{H})\boldsymbol{y}. \] Therefore, both \(\widehat{\boldsymbol{y}}\) and \(\widehat{\boldsymbol{\varepsilon}}\) follow multivariate normal distributions with variance-covariance matrices \(\sigma^2\boldsymbol{H}\boldsymbol{H}^\top\) and \(\sigma^2(\boldsymbol{I}-\boldsymbol{H})(\boldsymbol{I}-\boldsymbol{H})^\top\), respectively. This allows us to obtain confidence intervals for \(f(x)\). If we define \[ \delta_k = \text{Trace}\Big((\boldsymbol{I}-\boldsymbol{H})(\boldsymbol{I}-\boldsymbol{H})^\top\Big)^k\text{ for }k=1,2, \] we can easily see that \[ \mathbb{E}\left[\sum_{i=1}^n\widehat{\epsilon}_i^2\right] = \sigma^2\delta_1 \] so that \[ \widehat{\sigma^2} = \frac{1}{\delta_1}\sum_{i=1}^n\widehat{\epsilon}_i^2 \] is an unbiased estimator of \(\sigma^2\). Furthermore, \[ \frac{\delta_1^2}{\delta_2}\times\frac{\widehat{\sigma}^2}{\sigma^2} \] approximately follows the chi-squared distribution with \(\delta_1^2/\delta_2\) degrees of freedom (where \(\delta_1^2/\delta_2\) is rounded to the nearest integer). Therefore, \[ \frac{\widehat{f}(x)-f(x)}{\widehat{\sigma}\sqrt{\sum_{i=1}^nh_i^2(x)}} \] approximately follows the Student’s t-distribution with \(\delta_1^2/\delta_2\) degrees of freedom. This allows for obtaining confidence intervals for the response \(f(\cdot)\) at different points \(x\).

Smoothing Parameter

As the reader may have noticed, the Loess approach depends on the number \(\lambda\) of points contained in the neighborhood \(\mathcal{V}(x_0)\) of the considered point \(x_0\). The number of nearest neighbors, most often expressed as a percentage of the dataset size, acts as the smoothing parameter. The selection of an optimal value for \(\lambda\) is discussed below.

When comparing models, it is helpful to have a measure of their complexity. This can be done using a degree of freedom associated with the smoothers satisfying (10.15). The number of degrees of freedom \(DF_\lambda\) is provided by the trace of the matrix \(\boldsymbol{H}_\lambda\), whose elements are the \(h_{ij}\) involved in (10.15). This choice stems from the fact that in the classical linear regression model, as seen earlier, the trace of the matrix that maps observations \(y_i\) to fitted values \(\widehat{y}_i=\widehat{\boldsymbol{\beta}}^\top\boldsymbol{x}_i\) equals the number of parameters (i.e., the dimension of \(\boldsymbol{\beta}\)).

Fit Quality Measure

Measuring the quality of fit is more complex for models that aim to estimate a function than for classical parametric models. Most criteria involve both a measure of fit quality and a measure of model complexity. Minimizing such criteria also helps in selecting the smoothing parameter.

Let \(\widehat{\sigma}^2\) be the sum of squared residuals. (Hurvich, Simonoff, and Tsai 1998) proposed the following criterion: \[ AICC1=n\ln\widehat{\sigma}^2+n\frac{\delta_1/\delta_2(n+DF_\lambda)}{\delta_1^2/\delta_2-2} \] called AIC corrected. The first term in AICC1 measures the quality of the fit, while the second term assesses the complexity of the model. This criterion helps select the optimal value of \(\lambda\) (i.e., the one minimizing AICC1).

Extension to Multiple Regressors

Extending the Loess method to more than one regressor is straightforward. In this case, we consider a model of the form \[ y_i=f(x_{i1},x_{i2},\ldots,x_{ip})+\epsilon_i. \] Note that this model is not additive. The approach involves locally approximating the function \(f\) of the explanatory variables by a hyperplane and then proceeding as explained above for the single regressor case. It is important in this case that the different regressors have comparable values (so that multidimensional neighborhoods are not determined solely based on the variable with the largest values). To achieve this, the explanatory variables are often normalized (e.g., by dividing them by the interquartile range). The distance used is typically the Euclidean distance in \({\mathbb{R}}^p\).

Remark. Depending on the size of the dataset, one can consider using “k-d trees” to define multidimensional neighborhoods.

10.8.2.2 Penalized Maximum Likelihood and Cubic Splines

Principle: Penalized Least Squares

An ingenious way to estimate the function \(f(\cdot)\) in (10.14) is to minimize the objective function \[\begin{equation} \mathcal{O}(f)=\sum_{i=1}^n\Big(y_i-f(x_i)\Big)^2+\lambda\int_{u\in{\mathbb{R}}}\big(f''(u)\big)^2du. \tag{10.16} \end{equation}\] The first term in \(\mathcal{O}(f)\) ensures that \(f(\cdot)\) fits the data as closely as possible, while the second term penalizes excessive irregularity in the fit. This technique assumes that \(f(\cdot)\) is twice continuously differentiable and that \(f''(\cdot)\) is square-integrable. The integral in (10.16) measures the irregularity of the function \(f(\cdot)\). Note that two functions differing only by a linear term will have the same second derivatives and, consequently, the same level of irregularity as measured in (10.16). This property is particularly useful in a regression context.

The objective function (10.16) can be seen as a penalized normal log-likelihood (PML) approach: a penalty term for the irregularity of the estimator is added to the log-likelihood before maximizing it. This approach dates back to the work of actuary E. Whittaker, who used a similar technique to smooth mortality tables as early as 1923. For more details, see, for example, (Hastie and Tibshirani 1990).

When \(\lambda\to +\infty\), the term penalizing the irregularity of \(f(\cdot)\) dominates, forcing the second derivative of \(f(\cdot)\) to vanish everywhere, resulting in a straight line as the solution. For large values of \(\lambda\), the integral dominates in (10.16), and the resulting estimator will have very low curvature. Conversely, when \(\lambda\to 0\), the penalty disappears, and a perfect interpolation is obtained (when the \(x_i\) are distinct).

Suppose first that \(x_1<x_2<\ldots <x_n\). The solution \(\widehat{f}_\lambda\) of the minimization of (10.16) is a cubic spline with knots at \(x_1,x_2,\ldots,x_n\) (meaning that \(\widehat{f}_\lambda\) coincides with a cubic polynomial on each interval \((x_i,x_{i+1})\) and has continuous first and second derivatives at each \(x_i\)). This allows us to reduce the minimization of (10.16) to that of \[\begin{equation} (\boldsymbol{y}-\boldsymbol{f})^\top(\boldsymbol{y}-\boldsymbol{f})+\lambda\boldsymbol{f}^\top\boldsymbol{K}\boldsymbol{f} \tag{10.17} \end{equation}\] where \(\boldsymbol{y}^\top=(y_1,\cdots,y_n)\), \(\boldsymbol{f}^\top=(f(x_1),\cdots,f(x_n))\), and \(\boldsymbol{K}=\boldsymbol{D}^\top\boldsymbol{C}^{-1}\boldsymbol{D}\) with \(\boldsymbol{D}\) being a tri-diagonal matrix of size \((n-2)\times n\) given by { \[ \boldsymbol{D}=\left( \begin{array}{cccccccc} \frac{1}{\Delta_1}&-\left(\frac{1}{\Delta_1}+\frac{1}{\Delta_2}\right) & \frac{1}{\Delta_2} & 0 & 0&\cdots & 0 \\ 0 & \frac{1}{\Delta_2} & -\left(\frac{1}{\Delta_2}+\frac{1}{\Delta_3}\right) & \frac{1}{\Delta_3} & 0& \cdots & 0 \\ 0 & 0&\frac{1}{\Delta_3} & -\left(\frac{1}{\Delta_3}+\frac{1}{\Delta_4}\right) & \frac{1}{\Delta_4} & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots&\vdots&\ddots & \vdots \\ 0&0&0&0&0&\cdots & \frac{1}{\Delta_{n-1}}\\ 0&0&0&0&0&\cdots & -\left(\frac{1}{\Delta_{n-2}}+\frac{1}{\Delta_{n-1}}\right)\\ 0&0&0&0&0&\cdots& \frac{1}{\Delta_{n-2}} \end{array} \right), \] } where \(\Delta_i=x_{i+1}-x_i\), and \(\boldsymbol{C}\) is a symmetric tri-diagonal matrix of size \((n-2)\times (n-2)\) given by { \[ \boldsymbol{C}=\frac{1}{6}\left( \begin{array}{cccccc} 2\left(\Delta_1+\Delta_2\right) & \Delta_2 & 0 & \cdots & 0&0 \\ \Delta_2 & 2(\Delta_2+\Delta_3) & \Delta_3 & \cdots & 0&0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0&0&0&\cdots & 2(\Delta_{n-3}+\Delta_{n-2})&\Delta_{n-2}\\ 0&0&0&\cdots & \Delta_{n-2} & 2(\Delta_{n-2}+\Delta_{n-1}) \end{array} \right). \] } The solution \(\widehat{f}_\lambda\) can then be obtained by setting the gradient \(-2(\boldsymbol{y}-\boldsymbol{f})+2\lambda\boldsymbol{K}\boldsymbol{f}\) of the objective function (10.17) to zero, which gives \[ \widehat{\boldsymbol{f}}_\lambda=(\boldsymbol{I}+\lambda\boldsymbol{K})^{-1}\boldsymbol{y} \] which has the form (10.15) with \(\boldsymbol{H}=(\boldsymbol{I}+\lambda\boldsymbol{K})^{-1}\). Note that it is not necessary to explicitly invert \(\boldsymbol{I}+\lambda\boldsymbol{K}\) to obtain \(\widehat{f}_\lambda\), and it is more efficient to use numerical techniques such as the Reinsch algorithm.

Remark. In the case where each observation \((x_i,y_i)\) is weighted with \(w_i\), the objective function becomes \[ \mathcal{O}_w(f)=\sum_{i=1}^nw_i\Big(y_i-f(x_i)\Big)^2+\lambda\int_{u\in{\mathbb{R}}}\big(f''(u)\big)^2du. \] Again, the minimum of \(\mathcal{O}_w(f)\) is obtained by using a cubic spline for \(f\), which allows us to write \[ \mathcal{O}_w(f)=(\boldsymbol{y}-\boldsymbol{f})^\top\boldsymbol{W}(\boldsymbol{y}-\boldsymbol{f})+\lambda\boldsymbol{f}^\top\boldsymbol{K}\boldsymbol{f} \] where the diagonal matrix \(\boldsymbol{W}\) contains the weights. The solution is then given by \[ \widehat{\boldsymbol{f}}_\lambda=(\boldsymbol{W}+\lambda\boldsymbol{K})^{-1}\boldsymbol{W}\boldsymbol{y}. \] The weight \(w_i\) can, for example, represent the number of observations \(y_i\) in the sample corresponding to the same value \(x_i\).

Smoothing Parameter

Measuring the complexity of the model is done exactly as in Loess. The choice of \(\lambda\) is often made using the cross-validation criterion: \[ CV(\lambda)=\frac{1}{n}\sum_{i=1}^n\left(y_i-\widehat{f}_\lambda^{-i}(x_i)\right)^2 \] where \(\widehat{f}_\lambda^{-i}(x_i)\) is the estimate of \(f(x_i)\) obtained using the sample \(\{(y_j,x_j),\hspace{2mm} j\neq i\}\), of size \(n-1\). The cross-validation criterion favors the predictive power of the selected model (while the sum of squares of residuals favors the quality of the fit to a given set of observations). In fact, \(\widehat{f}_\lambda^{-i}(x_i)\) is the prediction of \(y_i\) provided by the data excluding the \(i\)-th observation, so the difference \(y_i-\widehat{f}_\lambda^{-i}(x_i)\) measures the quality of the prediction provided by the model.

The value of \(\widehat{f}_\lambda^{-i}(x_i)\) for cubic splines is given by: \[ \widehat{f}_\lambda^{-i}(x_i)=\sum_{j\neq i}\frac{h_{ij}}{1-h_{ii}}y_j \] where the weights \(\frac{h_{ij}}{1-h_{ii}}\) sum to 1. The idea is straightforward: zero weight is assigned to observation \(i\), and the weights are normalized (to have a sum equal to 1). Therefore: \[ \widehat{f}_\lambda^{-i}(x_i)=\frac{1}{1-h_{ii}}\widehat{f}_\lambda(x_i)-\frac{h_{ii}}{1-h_{ii}}y_i \] so that: \[ CV(\lambda)=\frac{1}{n}\sum_{i=1}^n\left(\frac{y_i-\widehat{f}_\lambda(x_i)}{1-h_{ii}}\right)^2 \]

Remark. Sometimes, the generalized cross-validation criterion is also used, replacing \(h_{ii}\) with the average \(\frac{1}{n}\sum_{i=1}^nh_{ii}\), which results in: \[\begin{eqnarray*} GCV(\lambda)&=&\frac{1}{n\left(1-\frac{1}{n}\sum_{i=1}^nh_{ii}\right)^2} \sum_{i=1}^n\left(y_i-\widehat{f}_\lambda(x_i)\right)^2\\ &=&\frac{1}{n\left(1-\frac{DF_\lambda}{n}\right)^2} \sum_{i=1}^n\left(y_i-\widehat{f}_\lambda(x_i)\right)^2. \end{eqnarray*}\]

10.8.3 Estimation with Multiple Regressors: Backfitting

Now, suppose that we have observations \((y_i,\boldsymbol{x}_i)\), \(i=1,2,\ldots,n\), where \(\boldsymbol{x}_i^\top=(x_{i1},\ldots,x_{ip})\) are \(p\) continuous explanatory variables. We consider the model: \[\begin{equation} y_i=c+\sum_{j=1}^pf_j(x_{ij})+\epsilon_i \tag{10.18} \end{equation}\] where the errors \(\epsilon_i\) are assumed to be independent with a \(\mathcal{N}ormal(0,\sigma^2)\) distribution. Our goal is to estimate the functions \(f_1(\cdot),\ldots,f_p(\cdot)\) representing the effect of the explanatory variables on the response. To achieve this, we will use the backfitting algorithm, which will successively re-adjust the partial residuals for each of the explanatory variables.

Let’s show how model (10.18) can be handled based on the simple case (10.14). The principle consists of, given a first estimation \(\widehat{f_k}(\cdot)\) of \(f_k(\cdot)\), \(k=1,2,\ldots,p\), to re-estimate \(f_j(\cdot)\) by adjusting the residuals obtained from the \(f_k(\cdot)\), \(k\neq j\). These residuals, denoted as \(r_i^{(j)}\), are given by: \[ r_i^{(j)}=y_i-\widehat{c}-\sum_{k\neq j}\widehat{f_k}(x_{ik}),\hspace{2mm}i=1,2,\ldots,n, \] which are then fitted to the values of the \(j\)-th regressor \(x_{ij}\). This process continues until the results stabilize. The key idea behind the backfitting algorithm is that, for any \(j\): \[ \mathbb{E}\left[Y-c-\sum_{k\neq j}f_k(X_k)\Big|X_j\right]=f_j(X_j), \] so the residuals \(r_i^{(j)}\), \(i=1,\ldots,n\), reflect the part of the behavior of the dependent variable attributable to the \(j\)-th regressor.

From now on, let’s denote by \(\boldsymbol{f}_j^\top=(f_j(x_{j1}),\cdots,f_j(x_{jn}))\) the vector of evaluations of \(f_j(\cdot)\) at the observed values of the \(j\)-th regressor. As it is obviously possible to absorb the constant \(c\) into any of the functions \(f_1,\ldots,f_p\), from now on, we will take \(\widehat{c}=\overline{y}\) for identifiability purposes. The main idea of the method is to define residuals \(r_1^{(j)},\cdots,r_n^{(j)}\) that will be explained using an additive model with the \(j\)-th regressor. More precisely, the algorithm proceeds as follows:

  • Initialization: \(\widehat{c}\gets\overline{y}\), \(\widehat{\boldsymbol{f}_j}^{(0)}\gets \boldsymbol{0}\), \(j=1,2,\ldots,p\).
  • Cycle: For \(r=1,2,\ldots,\) and \(j=1,2,\ldots,p\), update \(\widehat{\boldsymbol{f}}_j^{(r)}\) as follows: \[ \widehat{\boldsymbol{f}}_j^{(r+1)}\gets \boldsymbol{H}_{\lambda_j}\left(\boldsymbol{y}-(\overline{y},\ldots,\overline{y})^\top-\sum_{k<j}\widehat{\boldsymbol{f}}_k^{(r+1)}- \sum_{k>j}\widehat{\boldsymbol{f}}_k^{(r)}\right) \] where \(\boldsymbol{H}_{\lambda_j}\) is the smoothing matrix applied to the partial residual obtained by subtracting the prediction from the observation \(\boldsymbol{y}\) using all regressors except the \(j\)-th one.
  • Stopping criterion: Iterate the above step and stop when the sum of the squares of residuals \[ \left(\boldsymbol{y}-(\overline{y},\ldots,\overline{y})^\top-\sum_{j=1}^p\widehat{\boldsymbol{f}}_k^{(r+1)}\right)^\top \left(\boldsymbol{y}-(\overline{y},\ldots,\overline{y})^\top-\sum_{j=1}^p\widehat{\boldsymbol{f}}_k^{(r+1)}\right) \] ceases to decrease.

Note that the influence of each of the regressors can be estimated using a different smoothing parameter.

10.8.4 Comparison of Different Approaches

Let’s consider simulations with 50 observations following a model \(Y_i=(X_i+1)^2-1+\epsilon_i\), where the errors \(\epsilon\sim\mathcal{N}ormal(0,1/2)\) are independent, and where \(X_i\sim\mathcal{N}ormal(0,1)\).

Figure ?? shows the least squares estimation of \(Y\) on \(X\) on the left, and \(X\) on \(Y\) on the right. As shown on the left-hand side, the least squares principle consists of minimizing the sum of squares of distances between \(Y_i\) and \(\widehat{Y}_i\) (represented vertically).

Figure ?? shows the result of PCA of the cloud \((X_1,Y_1),...,(X_n,Y_n)\), where the first axis is obtained by minimizing the sum of squares of distances from \((X_i,Y_i)\) to the line (corresponding to the orthogonal projection of \((X_i,Y_i)\) onto the line). Note that the obtained line is relatively close to that obtained by linear regression.

10.9 Generalized Linear Models (GLM)

10.9.1 A Brief History of Actuarial Applications of Regression Models

For a long time, actuaries limited themselves to using the Gaussian linear model when quantifying the impact of explanatory variables on an interest phenomenon (frequency or cost of claims, probability of insured events, etc.). Now that the complexity of statistical problems faced by actuaries has greatly increased, it is crucial to turn to models that better account for the reality of insurance than the linear model does. The linear model imposes a series of constraints that are not very compatible with the reality of numbers or claim costs: approximate Gaussian probability density, linearity of the score, and homoscedasticity. Even though it is possible to overcome some of these constraints by transforming the response variable using well-chosen functions, the linear approach comes with many disadvantages (working on an artificial scale, difficulties in returning to the original quantities, etc.).

A first step in using models more appropriate to the reality of insurance was taken in actuarial science at the end of the 20th century by actuaries in London’s City University, who applied Generalized Linear Models (GLMs). These models, introduced in statistics by (Nelder and Wedderburn 1972), allow for the abandonment of the normality assumption by treating responses whose distribution is part of the exponential family in a unified way. This family includes, in addition to the normal distribution, the Poisson, binomial, Gamma, and Inverse Gaussian distributions. Notable works in this area include (Gourieroux, Monfort, and Trognon 1984).

Poisson regression (and related models such as negative binomial regression) is now a tool of choice for developing automobile insurance pricing, largely replacing the general linear model and logistic regression for analyzing claim counts. The breakthrough of this method within insurance companies dates back to the inclusion of procedures in widely-used statistical software (with SAS leading the way) that allow for the application of this technique (specifically, the GENMOD procedure). In addition to the maximum likelihood approach, GLM techniques allow the analysis of a wide range of phenomena from a quasi-likelihood perspective by specifying only the mean-variance structure. French econometricians have proven fundamental results regarding the convergence of estimators obtained in this way. See, for example, (Gourieroux, Monfort, and Trognon 1984).

Recently, GLM techniques have been successfully applied to life insurance problems (establishing mortality tables, estimating demographic indicators, mortality projection, etc.). See (Delwarde and Denuit 2005) for numerous examples.

This section is based on (McCullagh and Nelder 1989), (Dobson 2001), and (Fahrmeir et al. 1994). We emphasize that maximum likelihood estimators can be obtained using repeated weighted least squares adjustments by defining appropriate pseudo-responses. This allows us to relax the assumption of linearity of the score by transitioning to generalized additive models.

To better understand the generalizations of the Gaussian linear model discussed in this section, let us recall that within the framework of this model, we seek to model a variable \(Y\) using a set of explanatory variables \(\mathbf{X}=(X_1,\ldots,X_p)^\top\). Naturally, linear regression assumes that \[ Y \sim \mathcal{N}(\mu, \sigma^2)\text{ where }\mu = \mathbf{X}^\top\boldsymbol{\beta}. \] This model proposed by Legendre and Gauss in the early 19th century, and extensively studied by Fisher in the 1920s, has become established in econometrics but is challenging to use in insurance.

The variables we seek to model in insurance are costs (taking values in \(\mathbb{R}^+\)), numbers of claims (taking values in \(\mathbb{N}\)), or indicators of being a claimant in a given year (taking values in \(\{0,1\}\)). In the latter case, we saw that latent variables could be an interesting solution. Specifically, we considered models of the form \[ Y \sim \text{Binomial}(1, \mu)\text{ where }\mu = \mathbb{E}[Y] = F(\mathbf{X}^\top\boldsymbol{\beta}), \] where \(F\) denotes the cumulative distribution function associated with the logistic distribution (for LOGIT models) or the centered and scaled Gaussian distribution (for PROBIT models).

In general, we want to retain the linear structure of the score with respect to \(\boldsymbol{\beta}\) and consider that the expectation of \(Y\) is a transformation of this linear combination. More precisely, we now wish to move to regression models of the form \[ Y \sim \text{Distribution}(\mu)\text{ where }\mu = \mathbb{E}[Y] = g^{-1}(\mathbf{X}^\top\boldsymbol{\beta}), \] where \(g^{-1}\) is a “well-chosen” function, and \(\text{Distribution}\) denotes a parametric distribution that properly models our variable of interest.

This type of approach is the basis of so-called “generalized linear models,” which extend the Gaussian model to a specific family of distributions known as the exponential (natural) family.

10.9.2 Definition

In this section, we will focus on the family of generalized linear models. This class includes, in addition to the normal distribution, probability distributions with two parameters \(\theta\) and \(\phi\), whose density (discrete or continuous) can be expressed in the form \[\begin{equation} f(y|\theta,\phi) = \exp\left(\frac{y\theta - b(\theta)}{\phi} + c(y,\phi)\right),\quad y \in \mathcal{S}, \tag{10.19} \end{equation}\] where the support \(\mathcal{S}\) is a subset of \(\mathbb{N}\) or \(\mathbb{R}\). The parameter \(\theta\) is called the natural parameter, and \(\phi\) is the dispersion parameter. Often, weighting is required, and \(\phi\) is replaced by \(\phi/\omega\), where \(\omega\) is a known a priori weight.

Let’s examine some examples of common distributions whose density can be expressed in the form (10.19).

::: {.example}[Normal Distribution]

The normal distribution \(\mathcal{N}(\mu, \sigma^2)\) has a density that can be expressed in the form (10.19), with \(\mathcal{S} = \mathbb{R}\), \(\theta = \mu\), \(b(\theta) = \theta^2/2\), \(\phi = \sigma^2\), and \[ c(y,\phi) = -\frac{1}{2}\left(\frac{y^2}{\sigma^2} + \ln(2\pi\sigma^2)\right). \] :::

::: {.example}Poisson Distribution

For the Poisson distribution \(\mathcal{P}oi(\lambda)\), we have \[ f(y|\lambda) = \exp(-\lambda)\frac{\lambda^y}{y!} = \exp\left(y\ln\lambda - \lambda - \ln y!\right), \quad y \in \mathbb{N}, \] where \(\mathcal{S} = \mathbb{N}\), \(\theta = \ln\lambda\), \(\phi = 1\), \(b(\theta) = \exp(\theta) = \lambda\), and \(c(y,\phi) = -\ln y!\). :::

::: {.example}Binomial Distribution

The \(\mathcal{B}in(n,p)\) distribution has a density that can be expressed in the form (10.19) with \(\mathcal{S} = \mathbb{N}\), \(\theta = \ln\{p/(1-p)\}\), \(b(\theta) = n\ln(1+\exp(\theta))\), \(\phi = 1\), and \(c(y,\phi) = \ln\left(\frac{n!}{y!(n-y)!}\right)\). :::

::: {.example}Gamma Distribution

The density associated with the Gamma distribution can be rewritten as \[ \frac{1}{\Gamma(\nu)}\left(\frac{\nu}{\mu}\right)^\nu y^{\nu-1}\exp\left(-\frac{\nu}{\mu}y\right), \] which can be expressed in the form (10.19) with \(\mathcal{S} = \mathbb{R}^+\), \(\theta = -\frac{1}{\mu}\), \(b(\theta) = -\ln(-\theta)\), and \(\phi = \nu^{-1}\). :::

Not all probability distributions whose density can be expressed in the form (10.19) have dispersion parameters \(\phi\). Thus, the examples above teach us, for instance, that for the Poisson distribution, \(\phi = 1\). For distributions with a dispersion parameter \(\phi\), it controls the variance, as we will see later. The pure premium depends only on the natural parameter \(\theta\). Therefore, when actuaries are interested only in the pure premium, the parameter \(\theta\) is the parameter of interest, while \(\phi\) is considered a nuisance parameter. However, the parameter \(\phi\) is also crucial as it controls the dispersion (and hence the risk).

10.9.3 Mean and Variance

For a random variable \(Y\) whose density can be expressed in the form (10.19), we can express the first two moments of \(Y\) using the functions \(b\) and \(c\). To do this, let \[ U = \frac{\partial}{\partial\theta}\ln f(Y|\theta,\phi). \] and \[ U' = \frac{\partial^2}{\partial\theta^2} \ln f(Y|\theta,\phi), \] so that the Fisher information is \(\text{Var}[U] = -\mathbb{E}[U']\) by (10.1).

Proposition 10.10 For a random variable \(Y\) whose density is of the form (10.19), we have \[ \mathbb{E}[Y] = b'(\theta)\text{ and } \mathbb{V}[Y] = \frac{b''(\theta)\phi}{\omega}, \] where \('\) and \(''\) denote the first and second derivatives with respect to \(\theta\).

Proof. We know from Proposition 10.1 that \(\mathbb{E}[U] = 0\). It suffices to observe that \[ \frac{d}{d\theta}\ln f(y|\theta,\phi) = \frac{\partial}{\partial\theta}\left(\frac{y\theta - b(\theta)}{\phi/\omega} + c(y,\phi)\right) = \frac{y - b'(\theta)}{\phi/\omega}, \] which gives \[ \mathbb{E}[U] = \frac{\mathbb{E}[Y] - b'(\theta)}{\phi/\omega} = 0, \] yielding the announced expression for the mean of \(Y\). Furthermore, since \(\mathbb{E}[U] = 0\), \[ \mathbb{V}[U] = \mathbb{E}[U^2] = \mathbb{E}\left[\left(\frac{Y - b'(\theta)}{\phi/\omega}\right)^2\right] = \frac{\mathbb{V}[Y]}{(\phi/\omega)^2} \] and \[\begin{eqnarray*} \mathbb{E}[U^2] &=& \int_{y\in\mathcal{S}}\left(\frac{\partial}{\partial\theta}\ln f(y|\theta,\phi)\right)^2f(y|\theta,\phi)dy\\ &=& \int_{y\in\mathcal{S}}\frac{\partial}{\partial\theta}\ln f(y|\theta,\phi)\frac{\partial}{\partial\theta}f(y|\theta,\phi)dy\\ &=&\mathbb{E}\left[- \frac{\partial^2}{\partial\theta^2}\ln f(Y|\theta,\phi)\right]=\frac{b''(\theta)}{\phi/\omega}. \end{eqnarray*}\] Thus, \[ \mathbb{V}[U] = \mathbb{E}[-U'] = \frac{b''(\theta)}{\phi/\omega}. \] Combining the last two equalities gives the desired result.

Therefore, the variance of \(Y\) appears as the product of two functions:

  1. the first one, \(b''(\theta)\), which depends solely on the parameter \(\theta\), is called the variance function;
  2. the second one is independent of \(\theta\) and depends solely on \(\phi\).

Denoting \(\mu = \mathbb{E}[Y]\), we see that the parameter \(\theta\) is related to the mean \(\mu\) as indicated in Proposition 10.10. The variance function can thus be defined in terms of \(\mu\); we will henceforth denote it as \(V(\mu)\).

The variance function is very important in various models, as can be seen in Table 10.1. It is important to note that, except for the normal distribution case, the variance of \(Y\) is always a function of the mean and increases with the mean for Poisson, Gamma, and Inverse Gaussian distributions (with a fixed \(\phi\) parameter).

Table 10.1: Variance functions associated with common probability distributions
Probability Distribution \(V(\mu)\)
Normal 1
Poisson \(\mu\)
Gamma \(\mu^2\)
Binomial \(\mu(1-\mu)\)

10.9.4 Regression Model

Consider independent but not identically distributed random variables \(Y_1,Y_2,\ldots,Y_n\) whose densities are of the form (10.19). More precisely, suppose that the probability density of \(Y_i\) is \[\begin{equation} f(y_i|\theta_i,\phi)=\exp\left(\frac{y_i\theta_i-b(\theta_i)}{\phi/\omega_i}+c(y_i,\phi)\right), \hspace{2mm}y_i\in\mathcal{S}. \tag{10.20} \end{equation}\] Then, the joint density of \(Y_1,Y_2,\ldots, Y_n\) is \[\begin{eqnarray*} f(\boldsymbol{y}|\boldsymbol{\theta},\phi)&=&\prod_{i=1}^nf(y_i|\theta_i,\phi)\\ &=&\exp\left(\frac{\sum_{i=1}^ny_i\theta_i-\sum_{i=1}^nb(\theta_i)}{\phi/\omega_i}+ \sum_{i=1}^nc(y_i,\phi)\right). \end{eqnarray*}\] Of course, the likelihood is \(\mathcal{L}(\boldsymbol{\theta},\phi|\boldsymbol{y})=f(\boldsymbol{y}|\boldsymbol{\theta},\phi)\). We assume that the \(\theta_i\) are functions of a set of \(p+1\) parameters \(\beta_0,\beta_1,\ldots,\beta_p\), for example. More precisely, denoting \(\mu_i\) as the mean of \(Y_i\), we assume that \[ g(\mu_i)=\beta_0+\sum_{j=1}^p\beta_jx_{ij}=\boldsymbol{x}_i^\top\boldsymbol{\beta}=\eta_i \] where the monotone and differentiable function \(g\) is called the link function, the vector \(\boldsymbol{x}_i\) contains explanatory variables related to individual \(i\), and the vector \(\boldsymbol{\beta}\) contains the \(p+1\) parameters.

Thus, a generalized linear model consists of three elements:

  1. the random variables to be explained \(Y_1,Y_2,\ldots,Y_n\) whose densities are of the form (10.20);
  2. a set of parameters \(\boldsymbol{\beta}=(\beta_0,\beta_1,\ldots,\beta_p)^\top\) belonging to a non-empty open set in \({\mathbb{R}}^{p+1}\) and explanatory variables \({\boldsymbol{X}}=(\boldsymbol{x}_1,\boldsymbol{x}_2,\ldots, \boldsymbol{x}_n)^\top\): the \(n\times (p+1)\) matrix \({\boldsymbol{X}}\) is assumed to have rank \(p+1\), i.e., the square matrix \({\boldsymbol{X}}^\top{\boldsymbol{X}}\) of dimension \((p+1)\times (p+1)\) is invertible;
  3. a link function \(g\) such that \[ g(\mu_i)=\boldsymbol{x}_i^\top\boldsymbol{\beta}\text{ where } \mu_i=\mathbb{E}[Y_i] \] which relates the linear predictor \(\eta_i=\boldsymbol{x}_i^\top\boldsymbol{\beta}\) to the mean \(\mu_i\) of \(Y_i\).

Most of the time, explanatory variables are all categorical in an insurance rate. Consider the example given in Section 10.2 of an insurance company segmenting based on gender, the sportiness of the vehicle, and the age of the insured (3 age classes, namely less than 30 years old, 30-65 years old, and over 65 years old). An insured individual will be represented by a binary vector that provides the values of the variables used to encode the individual’s characteristics.

We choose as the reference level (i.e., the one for which all \(X_i\) values are 0) the modalities most represented in the portfolio. The results will then be interpreted as over- or under-risk compared to this reference class. Thus, the vector (0,1,1,0) represents a male insured under 30 years old driving a sporty vehicle. The linear predictor (or score) will be of the form \(\beta_0+\sum_{j=1}^4\beta_jX_j\), and the number or average cost of claims is generally a non-decreasing function of the score. The intercept, \(\beta_0\), represents the score associated with the reference class (i.e., men between 30 and 65 years old with non-sporty vehicles); if \(\beta_j > 0\), it indicates that presenting the modality translated by \(X_j\) is a factor aggravating the risk compared to that of the reference individual, while \(\beta_j < 0\) will indicate classes of insured individuals less risky than the reference individuals.

Let’s now examine some examples.

::: {.example}[Gaussian Regression]

The classical linear model, where \(Y_i\sim\mathcal{N}or(\mu_i,\sigma^2)\) with \(\mu_i=\boldsymbol{x}_i^\top\boldsymbol{\beta}\). The identity function serves as the link function here. This model has been studied in detail in Section 10.7. :::

Example 10.7 (Binomial Regression) Binomial regression is obtained by considering \(Y_i\sim\mathcal{B}in(1,q_i)\). The quantity to be explained, \(q_i\), can be, for example, the probability that policy \(i\) produces at least one claim. Since \(q_i\in[0,1]\), modeling is done as \(q_i=F(\boldsymbol{x}_i^\top\boldsymbol{\beta})\), where \(F\) is a distribution function, or equivalently, \(\boldsymbol{x}_i^\top\boldsymbol{\beta}=F^{-1}(q_i)\). Although theoretically any distribution function \(F\) could be used as a link function, one of the following three functions is usually used:

  1. The logit model in which \[\begin{eqnarray*} &&\text{logit}(q_i)=\ln\frac{q_i}{1-q_i}=\boldsymbol{x}_i^\top\boldsymbol{\beta}\\ & \Leftrightarrow & q_i=\frac{\exp(\boldsymbol{x}_i^\top\boldsymbol{\beta})}{1+\exp(\boldsymbol{x}_i^\top\boldsymbol{\beta})} =\frac{\exp\eta_i}{1+\exp\eta_i}. \end{eqnarray*}\]
  2. The probit model in which \[ \text{probit}(q_i)=\Phi^{-1}(q_i)=\boldsymbol{x}_i^\top\boldsymbol{\beta} \] \[ \Leftrightarrow q_i=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\boldsymbol{x}_i^\top\boldsymbol{\beta}}\exp\left(-\frac{z^2}{2}\right)dz. \]
  3. The complementary log-log model in which \[ \text{cloglog}(q_i)=\ln\Big(-\ln(1-q_i)\Big)=\boldsymbol{x}_i^\top\boldsymbol{\beta} \] \[ \Leftrightarrow q_i=1-\exp\big(-\exp(\boldsymbol{x}_i^\top\boldsymbol{\beta})\big). \]

Unlike the logit and probit functions, the complementary log-log function is not symmetric around 0.5. For small values of \(q\) (as is often the case in actuarial science), there is practically no difference between the logistic and complementary log-log transformations. The use of the complementary log-log transformation assumes that the probabilities of success and failure should be treated differently.

You can see the shapes of the three functions described above in Figure ??. However, this graph does not provide a perfect indication of the potential differences between the three regression models. It should be noted that the score is linear in \(\boldsymbol{\beta}\). Therefore, if we replace the link function \(F\) with its standardized version \[ F_{\text{stand}}(z)=F\left(\frac{z-\mu}{\sigma}\right) \] where \(\mu\) and \(\sigma^2\) are the mean and variance associated with \(F\), respectively, then \[ q_i=F(\boldsymbol{x}_i^\top\boldsymbol{\beta})=F_{\text{stand}}(\boldsymbol{x}_i^\top\widetilde{\boldsymbol{\beta}}) \] where \(\widetilde{\beta}_0=\mu+\sigma\beta_0\) and \(\widetilde{\beta}_j=\sigma\beta_j\), \(j=1,2\ldots,p\). Consequently, the three models presented above can only be compared on appropriate scales for the score, i.e., after centering and scaling. The expectations associated with the normal, logistic, and Gumbel distributions involved in the probit, logit, and cloglog links are 0, 0, and \(\Gamma '(1)=-0.5772\), respectively, where \(\Gamma '\) denotes the first derivative of the Gamma function. The associated variances are 1, \(\frac{\pi^2}{3}\), and \(\frac{\pi^2}{6}\). After standardization, the functions associated with the logit and probit models are almost identical. However, in the cloglog model, the probability tends to zero and one more quickly as the score diverges to \(-\infty\) or \(+\infty\).

Example 10.8 (Poisson Regression) The log-linear Poisson regression is obtained by considering \(Y_i\sim\mathcal{P}oi(\lambda_i)\), with the link function induced by the natural parameter, i.e., \[ \ln\lambda_i=\boldsymbol{x}_i^\top\boldsymbol{\beta}\Leftrightarrow \lambda_i=\exp\big(\boldsymbol{x}_i^\top\boldsymbol{\beta}\big). \] Most of the time, we have a measure of exposure to risk, and we consider \(Y_i\sim\mathcal{P}oi(d_i\lambda_i)\), where \(d_i\) is the duration of coverage granted to insured individual \(i\) (this duration multiplies the annual frequency \(\lambda_i\) under the assumption of a Poisson process governing the occurrence of claims).

10.9.6 Likelihood Equations

In practice, the regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\), and the dispersion parameter \(\phi\) are unknown and must be estimated from the data. In this section, we focus on estimating the regression coefficients \(\boldsymbol{\beta}\) using the maximum likelihood method. This involves maximizing the log-likelihood:

\[\begin{eqnarray*} L\big(\boldsymbol{\theta}(\boldsymbol{\beta})|\boldsymbol{y},\phi\big) &=& \sum_{i=1}^n \ln f(y_i|\theta_i,\phi) \\ &=& \sum_{i=1}^n \frac{y_i\theta_i - b(\theta_i)}{\phi/\omega_i} + \sum_{i=1}^n c(y_i,\phi) \end{eqnarray*}\]

where \(\mathbb{E}[Y_i] = b'(\theta_i) = \mu_i\) and \(g(\mu_i) = \boldsymbol{x}_i^\top\boldsymbol{\beta} = \eta_i\), with \(g\) being a monotone and differentiable function. Finding the maximum likelihood estimators of \(\beta_0, \beta_1, \ldots, \beta_p\) involves solving the equations: \[\begin{equation} U_j = 0 \text{ for } j = 0,1,\ldots,p, \tag{10.21} \end{equation}\] where \[\begin{eqnarray*} U_j &=& \frac{\partial L\big(\boldsymbol{\theta}(\boldsymbol{\beta})|\boldsymbol{y},\phi\big)}{\partial\beta_j} \\ &=& \sum_{i=1}^n \frac{(y_i-\mu_i)x_{ij}}{{\mathbb{V}}[Y_i]g'(\mu_i)}. \end{eqnarray*}\]

If we choose the canonical link function, the likelihood equations become:

\[\begin{equation} \sum_{i=1}^n \omega_i(y_i-\mu_i)x_{ij} = 0 \text{ for } j = 0,1,\ldots,p, \end{equation}\]

where \(\omega_i\) is the weight associated with each observation.

These equations can be interpreted as orthogonality relationships between the explanatory variables and the estimation residuals.

10.9.7 Solving Likelihood Equations

The maximum likelihood estimators \(\hat{\beta}_j\) of the parameters \(\beta_j\) are solutions to the system (10.21). In general, the equations that make up this system do not have explicit solutions and must be solved numerically. One common numerical method for this is the Newton-Raphson method, which we briefly explain below.

Let \({\boldsymbol{U}}(\boldsymbol{\beta})\) be the gradient vector of the log-likelihood, with its \(j\)-th component defined as \[ U_j(\boldsymbol{\beta}) = \frac{\partial}{\partial\beta_j}L(\boldsymbol{\beta}|\boldsymbol{y}), \] and let \({\boldsymbol{H}}(\boldsymbol{\beta})\) be the Hessian matrix of \(L(\boldsymbol{\beta}|\boldsymbol{y})\), with its \((j,k)\) element defined as \[ \frac{\partial^2}{\partial\beta_j\partial\beta_k}L(\boldsymbol{\beta}|\boldsymbol{y}). \] For \(\boldsymbol{\beta}^*\) close to \(\widehat{\boldsymbol{\beta}}\), a limited Taylor expansion gives \[ 0 = {\boldsymbol{U}}(\widehat{\boldsymbol{\beta}}) \approx {\boldsymbol{U}}(\boldsymbol{\beta}^*) + {\boldsymbol{H}}(\boldsymbol{\beta}^*) \Big(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}^*\Big) \] which can be rewritten as \[ {\boldsymbol{U}}(\boldsymbol{\beta}^*) + {\boldsymbol{H}}(\boldsymbol{\beta}^*) \Big(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}^*\Big) \approx 0 \] or \[\begin{equation} \widehat{\boldsymbol{\beta}} \approx \boldsymbol{\beta}^* - {\boldsymbol{H}}^{-1}(\boldsymbol{\beta}^*) {\boldsymbol{U}}(\boldsymbol{\beta}^*). \tag{10.22} \end{equation}\] This suggests an iterative procedure to obtain the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta\): starting from an initial value \(\widehat{\boldsymbol{\beta}}^{(0)}\) that is expected to be close to \(\widehat{\boldsymbol{\beta}}\), we define the \((r+1)\)-th approximate value \(\widehat{\boldsymbol{\beta}}^{(r+1)}\) of \(\widehat{\boldsymbol{\beta}}\) from the \(r\)-th \(\widehat{\boldsymbol{\beta}}^{(r)}\) by \[\begin{equation} \widehat{\boldsymbol{\beta}}^{(r+1)} \approx \widehat{\boldsymbol{\beta}}^{(r)} - {\boldsymbol{H}}^{-1}(\widehat{\boldsymbol{\beta}}^{(r)}) {\boldsymbol{U}}(\widehat{\boldsymbol{\beta}}^{(r)}). \tag{10.23} \end{equation}\] This iterative procedure to obtain the maximum likelihood estimator corresponds to the Newton-Raphson method.

Remark. There is a clever iterative method for solving the likelihood equations. At each step \(r\), you only need to minimize a weighted least squares criterion of the form \[ \sum_{k=1}^n w_k(z_k-\boldsymbol{x}_k^\top\boldsymbol{\beta})^2 \] where the pseudo-responses \(z_k\) are given by \[ z_k = \boldsymbol{x}_k^\top\boldsymbol{\beta}^{(r)} + (y_k-\mu_k)\frac{\partial\eta_k}{\partial\mu_k} \] and the weights are \[ w_k^{-1} = \left(\frac{\partial\eta_k}{\partial\mu_k}\right)^2 V(\mu_k). \] In these formulas, \(\mu_k\) and \(\eta_k\) are calculated for the current values \(\boldsymbol{\beta}^{(r)}\) of the parameter \(\boldsymbol{\beta}\). The procedure is stopped when the difference between \(\boldsymbol{\beta}^{(r)}\) and \(\boldsymbol{\beta}^{(r-1)}\) is sufficiently small.

Example 10.9 (Binomial Regression) If the observations \(y_i\) follow a \(\mathcal{B}in(n_i,q_i)\) distribution, \(i=1,\ldots,n\), we have \[ U_j(\boldsymbol{\beta}) = \sum_{i=1}^n \frac{y_i-n_iq_i}{q_i(1-q_i)} \frac{x_{ij}}{g'(q_i)}. \] %where %\[ %g'(q_i) = \frac{\partial \eta_i}{\partial q_i} = \left(\frac{\partial q_i}{\partial \eta_i}\right)^{-1}. %\]

In this case, \(\widehat{\boldsymbol{\beta}}^{(r+1)}\) is obtained by an ordinary linear regression of the dependent random variable \({\boldsymbol{z}}_r\), where the \(i\)-th element is \[ \widehat{\eta}_{ir} + \frac{(y_i-n_i\widehat{q}_{ir})g'(\widehat{q}_{ir})}{n_i} \] as a function of the \(p\) explanatory variables, using weights \(v_{ir}\) where \[ v_{ir} = \frac{n_i}{\widehat{q}_{ir}(1-\widehat{q}_{ir})\big(g'(\widehat{q}_{ir})\big)^2} \] and \(\widehat{q}_{ir}\) and \(\widehat{\eta}_{ir}\) are the success probabilities and linear predictors for observation \(i\), evaluated using the \(r\)-th iteration \(\widehat{\boldsymbol{\beta}}^{(r)}\).

To initiate the iterative process, you can use the initial estimates of \(q_i\) given by \(\widehat{q}_{i0} = \frac{y_i+0.5}{n_i+1}\), the initial weights \(v_{ir} = \frac{n_i}{\widehat{q}_{i0}(1-\widehat{q}_{i0})\big(g'(\widehat{q}_{i0})\big)^2}\), and the initial values of pseudo-variables \({\boldsymbol{z}}\) where \(z_{i0} = \widehat{\eta}_{i0} = g(\widehat{q}_{i0})\). The values of \(z_{i0}\) are then regressed against the \(p\) explanatory variables using a weighted least squares method (with weights being the \(v_{i0}\)). The coefficients of the explanatory variables are the components of \(\widehat{\boldsymbol{\beta}}^{(1)}\). You can then obtain \[\begin{eqnarray*} \widehat{\eta}_{i1} & = & \widehat{\boldsymbol{\beta}}^{(1)}\boldsymbol{x}_i\\ \widehat{q}_{i1} & = & g^{-1}(\widehat{\eta}_{i1})\\ v_{i1} & = & \frac{n_i}{\widehat{q}_{i1}(1-\widehat{q}_{i1})\big(g'(\widehat{q}_{i1})\big)^2}\\ z_{i1} & = & \widehat{\eta}_{i1} + \frac{(y_i-n_i\widehat{q}_{i1})g'(\widehat{q}_{i1})}{n_i}. \end{eqnarray*}\] Then, \(\widehat{\boldsymbol{\beta}}^{(2)}\) results from a regression of \(z_{i1}\) against the explanatory variables, taking into account the weights \(v_{i1}\), and so on. You stop the process when the difference between \(\widehat{\boldsymbol{\beta}}^{(r)}\) and \(\widehat{\boldsymbol{\beta}}^{(r+1)}\) is sufficiently small.

In the special case of logistic regression, the pseudo-observations are \[ z_i = \widehat{\eta}_i + \frac{y_i-n_i\widehat{q}_i}{n_i\widehat{q}_i(1-\widehat{q}_i)} \] and the weights are given by \[ v_i = n_iq_i(1-q_i). \] Note that the weights are simply the variances of \(Y_i\).

Example 10.10 (Poisson Regression) If the observations \(n_i\) follow a \(\mathcal{P}oi(\lambda_i)\) distribution, the gradient vector of \(L(\boldsymbol{\beta}|\boldsymbol{n})\), of dimension \(p+1\), is given by \[ \boldsymbol{U}(\boldsymbol{\beta}) = \sum_{i=1}^n \boldsymbol{x}_i(n_i-{\lambda}_i), \] where we’ve added an extra component \(x_{i0}=1\) to the vector \(\boldsymbol{x}_i\). The Hessian matrix, of dimension \((p+1)\times(p+1)\), is given by \[ \boldsymbol{H}(\boldsymbol{\beta}) = -\sum_{i=1}^n \boldsymbol{x}_i\boldsymbol{x}_i^\top{\lambda}_i = -\boldsymbol{X}^\top\text{diag}(\boldsymbol{\lambda})\boldsymbol{X}, \] where diag\((\boldsymbol{\lambda})\) denotes the diagonal matrix of dimension \(n\times n\) with main elements \(\lambda_1,\ldots,\lambda_n\).

The iterative procedure to obtain the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\) is as follows: starting from an initial value \(\widehat{\boldsymbol{\beta}}^{(0)}\) that is expected to be close to \(\widehat{\boldsymbol{\beta}}\), you define the \((r+1)\)-th approximate value \(\widehat{\boldsymbol{\beta}}^{(r+1)}\) of \(\widehat{\boldsymbol{\beta}}\) from the \(r\)-th \(\widehat{\boldsymbol{\beta}}^{(r)}\) by \[\begin{eqnarray} \widehat{\boldsymbol{\beta}}^{(r+1)} %&=&\widehat{\boldsymbol{\beta}}^{(r)} %-{\Hvec}^{-1}(\widehat{\boldsymbol{\beta}}^{(r)}) %{\Uvec}(\widehat{\boldsymbol{\beta}}^{(r)})\nonumber\\ &=&\widehat{\boldsymbol{\beta}}^{(r)}+\big(\boldsymbol{X}^\top\text{diag}(\widehat{\boldsymbol{\lambda}}^{(r)})\boldsymbol{X}\big)^{-1} \boldsymbol{X}^\top(\boldsymbol{n}-\widehat{\boldsymbol{\lambda}}^{(r)}). \tag{10.22} \end{eqnarray}\] A good initial value \(\widehat{\boldsymbol{\beta}}^{(0)}\) can be obtained by taking \(\widehat{\beta_0}^{(0)}=\ln\overline{n}\), where \(\overline{n}\) is the average number of claims per policy, and \(\widehat{\beta_j}^{(0)}=0\) for \(j=1,\ldots,p\). Note that \(\widehat{\boldsymbol{\beta}}^{(0)}\) corresponds to the homogeneous Poisson model.

Example 10.11 (Iterative Algorithm for Poisson Model) The iterative algorithm providing the maximum likelihood estimator of \(\boldsymbol{\beta}\) in the Poisson model can also be written as follows: \[\begin{eqnarray*} \widehat{\boldsymbol{\beta}}^{(r+1)} & = & \widehat{\boldsymbol{\beta}}^{(r)} + \left(\sum_{i=1}^n\widehat{\lambda}_i^{(r)}\boldsymbol{x}_i\boldsymbol{x}_i^\top\right)^{-1} \sum_{i=1}^n\boldsymbol{x}_i(n_i-\widehat{\lambda}_i^{(r)})\\ & = & \left(\sum_{i=1}^n\left(\sqrt{\widehat{\lambda}_i^{(r)}}\boldsymbol{x}_i\right)\left(\sqrt{\widehat{\lambda}_i^{(r)}}\boldsymbol{x}_i\right)^\top\right)^{-1}\\ & & \sum_{i=1}^n\left(\sqrt{\widehat{\lambda}_i^{(r)}}\boldsymbol{x}_i\right)\left(\sqrt{\widehat{\lambda}_i^{(r)}} \frac{n_i-\widehat{\lambda}_i^{(r)}} {\widehat{\lambda}_i^{(r)}}+\sqrt{\widehat{\lambda}_i^{(r)}}\boldsymbol{x}_i^\top\widehat{\boldsymbol{\beta}}^{(r)}\right). \end{eqnarray*}\] Comparing the recurrence relation above to (10.12), we see that \(\widehat{\boldsymbol{\beta}}^{(r+1)}\) is nothing but the least squares estimator associated with the linear regression model \[ \sqrt{\widehat{\lambda}_i^{(r)}}\left(\frac{n_i-\widehat{\lambda}_i^{(r)}} {\widehat{\lambda}_i^{(r)}}+\boldsymbol{x}_i^\top\widehat{\boldsymbol{\beta}}^{(r)}\right) =\left(\sqrt{\widehat{\lambda}_i^{(r)}}\boldsymbol{x}_i\right)^\top\widehat{\boldsymbol{\beta}}^{(r+1)} +\epsilon_i \] where \(\epsilon_i\) is a centered Gaussian error term. The maximum likelihood estimator of the parameter \(\boldsymbol{\beta}\) can thus be obtained using an iterative least squares method.

Equivalently, \(\widehat{\boldsymbol{\beta}}\) can be obtained through a weighted least squares fit of pseudo-variables \[ z_i^{(r)} = \frac{n_i-\widehat{\lambda}_i^{(r)}} {\widehat{\lambda}_i^{(r)}}+\boldsymbol{x}_i^\top\widehat{\boldsymbol{\beta}}^{(r)} \] on \(\boldsymbol{x}_i\), where the weights \(\sqrt{\widehat{\lambda}_i^{(r)}}\) change at each iteration.

10.9.8 Fisher Information

The \((jk)\) element of the Fisher information matrix \(\mathcal{I}\) is given by \((\mathcal{I})_{jk} = \mathbb{E}[U_jU_k]\). The contribution of each observation \(Y_i\) to \((\mathcal{I})_{jk}\) is \[\begin{eqnarray*} \mathbb{E}\left[\frac{\partial\ln f_{\theta_i}(Y_i)}{\partial\beta_j} \frac{\partial\ln f_{\theta_i}(Y_i)}{\partial\beta_k}\right] &=&\mathbb{E}\left[\frac{(Y_i-\mu_i)^2x_{ij}x_{ik}}{\left({\mathbb{V}}[Y_i]\right)^2} \left(\frac{\partial\mu_i}{\partial\eta_i}\right)^2\right]\\ &=&\frac{x_{ij}x_{ik}}{{\mathbb{V}}[Y_i]} \left(\frac{\partial\mu_i}{\partial\eta_i}\right)^2. \end{eqnarray*}\] Therefore, \[\begin{equation} (\mathcal{I})_{jk} =\sum_{i=1}^n\frac{x_{ij}x_{ik}}{{\mathbb{V}}[Y_i]} \left(\frac{\partial\mu_i}{\partial\eta_i}\right)^2. \tag{10.24} \end{eqnarray} ::: {.example}[Binomial Regression] If the observations $y_i$ follow a $\mathcal{B}in(n_i,q_i)$ distribution, $i=1,\ldots,n$, the $(j,k)$ element of the Fisher information matrix is given by \begin{eqnarray*} \mathcal{I}_{jk}(\boldsymbol{\beta})&=& \mathbb{E}\left[\sum_{i=1}^n\frac{Y_i-n_iq_i}{q_i(1-q_i)}\frac{x_{ij}}{g'(q_i)} \sum_{\ell=1}^n\frac{Y_\ell-n_\ell q_\ell}{q_\ell(1-q_\ell)}\frac{x_{k\ell}}{g'(q_\ell)}\right]. \end{eqnarray*} Now, note that $$ \mathbb{E}\Big[(Y_i-n_iq_i)(Y_\ell-n_\ell q_\ell)\Big]=\mathbb{C}[Y_i,Y_\ell]=0 \mbox{ for }i\neq \ell $$ since the observations are assumed to be independent, while when $i=\ell$ we obtain $$ \mathbb{E}\Big[(Y_i-n_iq_i)^2\Big]=\mathbb{V}[Y_i]=n_iq_i(1-q_i) $$ from which we derive $$ \mathcal{I}_{jk}(\boldsymbol{\beta})=\sum_{i=1}^n\frac{n_i}{q_i(1-q_i)\big(g'(q_i)\big)^2} x_{ij}x_{ik}. $$ ::: ::: {.example}[Poisson Regression] If the observations $n_i$ follow a $\mathcal{P}oi(\lambda_i)$ distribution, \begin{eqnarray*} \mathbb{C}[U_j,U_k]&=&\mathbb{E}[U_jU_k]\\ &=&\mathbb{E}\left[\sum_{i_1=1}^nx_{i_1j}(N_{i_1}-\lambda_{i_1}),\sum_{i_2=1}^nx_{i_2k}(N_{i_2}-\lambda_{i_2})\right]\\ &=&\sum_{i_1=1}^n\sum_{i_2=1}^nx_{i_1j}x_{i_2k}\underbrace{\mathbb{C}[N_{i_1},N_{i_2}]}_{=0\text{ if }i_1\neq i_2}\\ &=&\sum_{i=1}^nx_{ij}x_{ik}\mathbb{V}[N_i]=\sum_{i=1}^nx_{ij}x_{ik}\lambda_i. \end{eqnarray*} The Fisher information matrix $\mathcal{I}$ is given by $$ \mathcal{I}=\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\top\lambda_i. $$ If the matrix $\boldsymbol{X}$ is of rank $p+1$, then $\boldsymbol{H}$ is non-singular and, moreover, negative definite. This ensures that the solution of the likelihood equations corresponds to a maximum of $L(\boldsymbol{\beta})$. ::: ::: {.remark} When you have a large number of observations, $$ {\boldsymbol{H}}(\boldsymbol{\beta})\approx \mathbb{E}[{\boldsymbol{H}}(\boldsymbol{\beta})]= -\mathcal{I}(\boldsymbol{\beta}), $$ so an alternative to \@ref(eq:AlgoNR) is given by \begin{equation} \widehat{\boldsymbol{\beta}}_{r+1}\approx\widehat{\boldsymbol{\beta}}_r +\mathcal{I}^{-1}(\widehat{\boldsymbol{\beta}}_r) {\boldsymbol{U}}(\widehat{\boldsymbol{\beta}}_r). \tag{10.25} \end{equation}\] This second iterative scheme is called Fisher’s method of scoring.

Note that \(\mathbb{E}[\boldsymbol{H}]=-\mathbb{E}[\boldsymbol{U}\boldsymbol{U}^\top]=-\mathcal{I}\). This allows us to interpret \(\mathcal{I}\) in terms of information quantity. Indeed, if \(\boldsymbol{H}\), and thus \(\mathcal{I}\), is small, the likelihood will have a weak curvature, and determining the maximum likelihood estimator will be less straightforward.

10.9.9 Confidence Interval for Parameters

10.9.9.1 Likelihood Ratio Method

The likelihood ratio method is based on the likelihood profile, defined for the parameter \(\beta_j\) as the function \[ \mathcal{L}_j(\beta_j|\boldsymbol{y})=\max_{\beta_0,\ldots,\beta_{j-1},\beta_{j+1},\ldots,\beta_p}\mathcal{L}(\boldsymbol{\beta}|\boldsymbol{y}). \] If \(\widehat{\boldsymbol{\beta}}_{\text{MV}}\) is the maximum likelihood estimator of \(\boldsymbol{\beta}\), then \(2\big\{L(\widehat{\boldsymbol{\beta}}_{\text{MV}}|\boldsymbol{y})-L_j(\beta_j|\boldsymbol{y})\big\}\) is approximately chi-squared distributed with one degree of freedom, provided that \(\beta_j\) is the true parameter value, where \(L_j(\beta_j|\boldsymbol{y})=\ln\mathcal{L}_j(\beta_j|\boldsymbol{y})\). Therefore, a \(1-\alpha\) confidence interval for \(\beta_j\) is given by the set of values \(\xi\) such that the difference \(L(\widehat{\boldsymbol{\beta}}_{\text{MV}}|\boldsymbol{y})-L_j(\xi|\boldsymbol{y})\) is small enough, or equivalently, such that \(2\big\{L(\widehat{\boldsymbol{\beta}}_{\text{MV}}|\boldsymbol{y})-L_j(\xi|\boldsymbol{y})\big\}\leq\chi_{1-\alpha,1}^2\), i.e. \[ IC=\left\{\xi\in\mathbb{R} \big|L_j(\xi|\boldsymbol{y})\geq L(\widehat{\boldsymbol{\beta}}_{\text{MV}}|\boldsymbol{y})-\frac{1}{2}\chi_{1-\alpha,1}^2\right\}. \]

The endpoints of this interval are obtained numerically by approximating the likelihood function with a second-degree surface. Specifically, we use the approximation \[ L(\boldsymbol{\beta}|\boldsymbol{y})\approx L(\boldsymbol{\beta}_0|\boldsymbol{y})+(\boldsymbol{\beta}-\boldsymbol{\beta}_0)^\top\boldsymbol{U}(\boldsymbol{\beta})+\frac{1}{2} (\boldsymbol{\beta}-\boldsymbol{\beta}_0)^\top\boldsymbol{H}(\boldsymbol{\beta})(\boldsymbol{\beta}-\boldsymbol{\beta}_0) \] which should be of good quality for \(\boldsymbol{\beta}_0\) sufficiently close to \(\boldsymbol{\beta}\). By approximating \(\boldsymbol{H}(\boldsymbol{\beta})\) by its mathematical expectation \(-\mathcal{I}\), we obtain \[ L(\boldsymbol{\beta}|\boldsymbol{y})\approx L(\boldsymbol{\beta}_0|\boldsymbol{y})+(\boldsymbol{\beta}-\boldsymbol{\beta}_0)^\top\boldsymbol{U}(\boldsymbol{\beta})-\frac{1}{2} (\boldsymbol{\beta}-\boldsymbol{\beta}_0)^\top\mathcal{I}(\boldsymbol{\beta}-\boldsymbol{\beta}_0). \]

10.9.9.2 Wald Method

Thanks to the normal approximation (10.6) for \(\widehat{\boldsymbol{\beta}}\), a confidence interval at the \(1-\alpha\) confidence level for \(\beta_j\) is given by \[ \left[\hat{{\beta}}_j\pm z_{\alpha/2}\sqrt{v_{jj}}\right] \] where \(v_{jj}\) is the diagonal element \((jj)\) of \(\mathcal{I}^{-1}\). This confidence interval is often called the Wald interval. The diagonal elements of \(\mathcal{I}^{-1}\) reflect the precision of the point estimates \(\hat{{\beta}}_j\), while the off-diagonal elements estimate the covariances between the estimators of the \(\beta_j\).

10.9.10 Model Comparison

The model’s fit to a dataset involves replacing the means \(\mu_i\) with the initial observations \(y_i\). It is clear that in general, \(y_i\neq\mu_i\). We must therefore ask whether the differences between \(y_i\) and \(\mu_i\) reflect a misspecification of the model or can be attributed to chance. To do this, we can use various statistics.

10.9.10.1 Deviance

Here, we consider generalized linear models based on the same density (10.20) and having the same link function but differing in the number of parameters they use.

We define the quality of a model by taking the saturated model, which has as many parameters as observations and thus provides a perfect description of the data, as a reference. The saturated model is characterized by \(\widehat{\mu}_i=y_i\) for \(i=1,2,\ldots,n\); the likelihood associated with this model will now be denoted as \(\mathcal{L}(\boldsymbol{y}|\boldsymbol{y})\). In practice, this model is not interesting as it merely reproduces the observations without summarizing them. Let’s consider a model for which the parameter \(\boldsymbol{\beta}\) has dimension \(p+1<n\). The model will fit the data well when \(\mathcal{L}(\widehat{\boldsymbol{\mu}}|\boldsymbol{y})\approx \mathcal{L}(\boldsymbol{y}|\boldsymbol{y})\) and poorly when \(\mathcal{L}(\widehat{\boldsymbol{\mu}}|\boldsymbol{y})<<< \mathcal{L}(\boldsymbol{y}|\boldsymbol{y})\). This suggests the likelihood ratio statistic \[ \Lambda=\frac{\mathcal{L}(\boldsymbol{y}|\boldsymbol{y})} {\mathcal{L}(\widehat{\boldsymbol{\mu}}|\boldsymbol{y})} \] as a measure of the quality of data fit by a model, or equivalently, \[ \ln\Lambda=\ln \mathcal{L}(\boldsymbol{y}|\boldsymbol{y})- \ln \mathcal{L}(\widehat{\boldsymbol{\mu}}|\boldsymbol{y}). \] A large value of \(\ln\Lambda\) suggests that the model is of poor quality. Let’s define the statistic \(D=2\ln\Lambda\); this is called the reduced deviance in the context of generalized linear models. The unreduced deviance is given by \(D^*=\phi D\).

The quality of a model’s fit to the data is often assessed by the deviance, mainly due to its close relationship with the likelihood ratio test. A small deviance value indicates a good fit since the likelihood of the model is close to that of the saturated model. Conversely, a large deviance value indicates a poor fit.

If the model describes the observed data well, \(D\) is approximately chi-squared distributed with \(n-p-1\) degrees of freedom. An observed value \(D_{\text{obs}}\) that is “too large” suggests a poor-quality model. In practice, we consider the model to be of poor quality if \[ D_{\text{obs}}>\chi_{n-p-1;1-\alpha}^2, \] where \(\chi_{n-p-1;1-\alpha}^2\) is the \(1-\alpha\) quantile of the chi-squared distribution with \(n-p-1\) degrees of freedom.

::: {.example}[Binomial Regression]

If the observations \(y_i\) follow a \(\mathcal{B}in(n_i,q_i)\) distribution, the log-likelihood is given by \[ \sum_{i=1}^n\left(\ln\left(\begin{array}{c}n_i \\y_i\end{array}\right)+y_i\ln \widehat{q}_i+(n_i-y_i)\ln(1-\widehat{q}_i) \right). \] In the saturated model, \(q_i\) is estimated as \(y_i/n_i\), so the log-likelihood is \[ \sum_{i=1}^n\left(\ln\left(\begin{array}{c}n_i \\y_i\end{array}\right)+y_i\ln \frac{y_i}{n_i}+(n_i-y_i)\ln\frac{n_i-y_i}{n_i} \right). \] Thus, the deviance is \[\begin{eqnarray*} D&=&2\sum_{i=1}^n\left(y_i\ln \frac{y_i/n_i}{\widehat{q}_i}+(n_i-y_i)\ln\frac{1-y_i/n_i}{1-\widehat{q}_i} \right)\\ &=&2\sum_{i=1}^n\left(y_i\ln \frac{y_i}{\widehat{y}_i}+(n_i-y_i) \ln\frac{n_i-y_i}{n_i-\widehat{y}_i}\right) \end{eqnarray*}\] where \(\widehat{y}_i=n_i\widehat{q}_i\) is the value predicted by the current model for observation \(i\).

It is important to note that the chi-squared approximation for \(D\) can be quite poor when some \(n_i\) are small and the fitted probabilities \(\widehat{q}_i\) are close to 0 or 1. :::

Remark. In the particular case where the observations \(y_i\) follow a \(\mathcal{B}er(q_i)\) distribution, the deviance does not provide a measure of model quality. In this case, the model’s log-likelihood is \[ \sum_{i=1}^n\Big(y_i\ln \widehat{q}_i+(1-y_i)\ln(1-\widehat{q}_i) \Big). \] The saturated model is characterized by \(\widehat{q}_i=y_i\), which makes the associated log-likelihood zero because \(y_i=0\) or \(1\). The deviance is then given by \[\begin{eqnarray} D&=&-2\sum_{i=1}^n\Big(y_i\ln \widehat{q}_i+(1-y_i)\ln(1-\widehat{q}_i)\Big)\nonumber\\ &=&-2\sum_{i=1}^n\Big(y_i\ln \frac{\widehat{q}_i}{1-\widehat{q}_i} +\ln(1-\widehat{q}_i)\Big). (\#eq:eq3.7) \end{eqnarray}\] By differentiating \[ L(\boldsymbol{\beta}|\boldsymbol{y})=\sum_{i=1}^n\Big(y_i\ln q_i+(1-y_i)\ln(1-q_i) \Big) \] with respect to \(\beta_j\), we obtain \[ \frac{\partial}{\partial\beta_j}L(\boldsymbol{\beta}|\boldsymbol{y})= \sum_{i=1}^n\left(\frac{y_i}{q_i}-\frac{1-y_i}{1-\widehat{q}_i}\right)q_i(1-q_i)x_{ij} \] from which we derive \[\begin{eqnarray*} \sum_{j=1}^p\beta_j\frac{\partial}{\partial\beta_j}L(\boldsymbol{\beta}|\boldsymbol{y})&=& \sum_{i=1}^n(y_i-q_i)\sum_{j=1}^p\beta_jx_{ij}\\ &=&\sum_{i=1}^n(y_i-q_i)\ln\frac{q_i}{1-q_i}. \end{eqnarray*}\] The left-hand side, when evaluated at the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\), satisfies \[ \sum_{i=1}^n(y_i-\widehat{q}_i)\mbox{logit}(\widehat{q}_i)=0 \] \[ \Leftrightarrow \sum_{i=1}^ny_i\mbox{logit}(\widehat{q}_i) = \sum_{i=1}^n\widehat{q}_i\mbox{logit}(\widehat{q}_i). \] Taking this into account in @ref(eq:eq3.7), we get \[ D=-2\sum_{i=1}^n\Big\{\widehat{q}_i\mbox{logit}(\widehat{q}_i)+\ln(1-\widehat{q}_i)\Big\}. \] It can be seen that the deviance depends only on the fitted values \(\widehat{q}_i\) of \(q_i\) and not on the observations \(y_i\); therefore, \(D\) does not provide any information about the quality of the fit to the observations and cannot be used to measure model adequacy.

::: {.example}[Poisson Regression]

Let \(L(\widehat{\boldsymbol{\lambda}}|\boldsymbol{n})\) denote the log-likelihood of the fitted model, where \(\widehat{\boldsymbol{\lambda}}=(\widehat{\lambda}_1,\widehat{\lambda}_2,\ldots,\widehat{\lambda}_n)\) with \(\widehat{\lambda}_i=\exp(\widehat{\boldsymbol{\beta}}^\top\boldsymbol{x}_i)\). Since \[ \exp(-\lambda)\frac{\lambda^k}{k!}\leq \exp(-k)\frac{k^k}{k!} \] for any \(k\) and \(\lambda\), the maximum log-likelihood that can be obtained in the model specifying that \(N_i\) are independent Poisson variables is obtained for \(N_i\sim\mathcal{P}oi(n_i)\). In this case, there are as many parameters as observations, which is \(n\) parameters. Let \(L(\boldsymbol{n}|\boldsymbol{n})\) denote the log-likelihood of this model (which predicts \(n_i\) for the \(i\)th observation). The deviance is then given by \[\begin{eqnarray*} D(\boldsymbol{n},\widehat{\boldsymbol{\lambda}})&=&2\Big\{L(\boldsymbol{n}|\boldsymbol{n})-L(\widehat{\boldsymbol{\lambda}}|\boldsymbol{n})\Big\}\\ &=&2\ln\left\{\prod_{i=1}^n\exp(-n_i)\frac{n_i^{n_i}}{n_i!}\right\} -2\ln\left\{\prod_{i=1}^n\exp(-\widehat{\lambda}_i)\frac{\widehat{\lambda}_i^{n_i}}{n_i!}\right\}\nonumber\\ & = & \sum_{i=1}^n\left\{n_i\ln\frac{n_i}{\widehat{\lambda}_i}-(n_i-\widehat{\lambda}_i)\right\} \tag{10.26} \end{eqnarray*}\] where we set \(y\ln y=0\) when \(y=0\). Since the inclusion of an intercept \(\beta_0\) ensures that (??) holds, in this case, the deviance is written as \[\begin{equation} D(\boldsymbol{n},\widehat{\boldsymbol{\lambda}})=\sum_{i=1}^nn_i\ln\frac{n_i}{\widehat{\lambda}_i}. \tag{10.27} \end{equation}\]

:::

Remark (Pseudo-$R^2$ for Count Data). Once the fitted model (i.e., relevant explanatory variables selected and maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\) obtained), it is crucial to evaluate its quality, i.e., its ability to describe the number of claims affecting the various policyholders in the portfolio. This assessment can be done using the deviance as described above.

Suppose the observations \(N_i\) follow a \(\mathcal{P}oi(\lambda_i)\) distribution. The generalization of the usual \(R^2\) statistic from linear regression is based on the decomposition of deviance into \[ D(\boldsymbol{n},\overline{\boldsymbol{n}})=D(\boldsymbol{n},\widehat{\boldsymbol{\lambda}})+D(\widehat{\boldsymbol{\lambda}},\overline{\boldsymbol{n}}) \] where

  1. \(D(\boldsymbol{n},\overline{\boldsymbol{n}})\) is the deviance of the model containing only the intercept \(\beta_0\) (i.e., the one not using explanatory variables for which \(\widehat{\lambda}_i=\overline{n} =\frac{1}{n}\sum_{j=1}^nn_j\), \(i=1,2,\ldots,n\)) given by \[ D(\boldsymbol{n},\overline{\boldsymbol{n}})=\sum_{i=1}^nn_i\ln\frac{n_i}{\overline{n}}; \]
  2. \(D(\boldsymbol{n},\widehat{\boldsymbol{\lambda}})\) is the deviance associated with the considered model;
  3. \(D(\widehat{\boldsymbol{\lambda}},\overline{\boldsymbol{n}})\) is the deviance unexplained by the model, given by \[ D(\widehat{\boldsymbol{\lambda}},\overline{\boldsymbol{n}})=\sum_{i=1}^n\widehat{\lambda}_i\ln\frac{\widehat{\lambda}_i}{\overline{n}} -(\widehat{\lambda}_i-\overline{n}). \]

This decomposition leads to the definition \[\begin{eqnarray*} R_D^2&=&1-\frac{D(\widehat{\boldsymbol{\lambda}},\overline{\boldsymbol{n}})}{D(\boldsymbol{n},\overline{\boldsymbol{n}})}\\ &=&\frac{\sum_{i=1}^nn_i\ln\frac{\widehat{\lambda}_i}{\overline{n}}-(n_i-\widehat{\lambda}_i)} {\sum_{i=1}^nn_i\ln\frac{n_i}{\overline{n}}} \end{eqnarray*}\] with the convention \(y\ln y=0\) when \(y=0\). The Pseudo-\(R^2\) measures the reduction in deviance resulting from the inclusion of explanatory variables in the model. The statistic \(R_D^2\) has the following desirable properties: it is always between 0 and 1, increases when new variables are added to the model, and has an interpretation in terms of information as the proportional reduction in Kullback-Liebler contrast resulting from the inclusion of new variables. Only the last property requires the correct specification of the distribution of \(N_i\).

10.9.10.2 Pearson’s Goodness-of-Fit Statistic

Pearson’s chi-squared statistic \[ X^2=\sum_{i=1}^n\omega_i\frac{(y_i-\mu_i)^2}{\mathbb{V}[Y_i]} \] is used to measure the quality of a fit. The statistic \(X^2\), like deviance \(D\), follows an exact chi-squared distribution in the particular case of the Gaussian linear model. This result is approximately true in other cases.

10.9.11 Hypothesis Tests on Parameters

We want to test the hypothesis \(H_0:\boldsymbol{\beta}=\boldsymbol{\beta}_0=(\beta_0,\beta_1,\ldots,\beta_q)^\top\) against \(H_1:\boldsymbol{\beta}=\boldsymbol{\beta}_1=(\beta_0,\beta_1,\ldots,\beta_p)^\top\) where \(q<p<n\). This amounts to testing the joint nullity of \(\beta_{q+1},\ldots,\beta_p\). We then use the statistic \(\Delta\), which is the difference between the deviances of the two models, namely \[ \Delta=D_0-D_1 = 2\left(\ln L_{\hat{\boldsymbol{\beta}}_1}(\boldsymbol{y})- \ln L_{\hat{\boldsymbol{\beta}}_0}(\boldsymbol{y})\right)\geq 0. \] It can be shown that \(\Delta\) is approximately chi-squared distributed with \(p-q\) degrees of freedom. We reject \(H_0\) in favor of \(H_1\) when \[ \Delta_{obs}>\chi_{p-q;1-\alpha}^2, \] where \(\chi_{p-q;1-\alpha}^2\) is the \(1-\alpha\) quantile of the chi-squared distribution with \(p-q\) degrees of freedom.

The interest of this type of test arises when the actuary wonders whether certain levels of categorical variables should be grouped. Indeed, the null hypothesis test of regression coefficients only indicates if the level in question should be merged with the reference level. However, it could be the case that two levels of a categorical variable are statistically equivalent but differ from the reference level. In such cases, you would want to perform a test like \(H_0:\beta_1=\beta_2\). Specifically, in the example used in Section 10.2 to illustrate the coding of categorical variables using binary coding, you could test \(H_0:\beta_3=0\) and \(H_0:\beta_4=0\), which would tell you if people under 30 or over 60 differ from those aged 30-65, but also \(H_0:\beta_3=\beta_4\) to see if it makes sense to group those under 30 with those over 65.

10.9.12 Estimation of the Dispersion Parameter

The estimation of the dispersion parameter \(\phi\) is based on the deviance \(D\). As \(\mathbb{E}[D]\approx n-p-1\), we could estimate \(\phi\) by \[ \widetilde{\phi}=\frac{1}{n-p-1}D. \] However, this estimator is rarely used in practice because it is very unstable. To avoid these issues, a Taylor expansion of the second order of the log-likelihood provides us with \[ \widehat{\phi}=\frac{1}{n-p-1}(\boldsymbol{y}-\widehat{\boldsymbol{\mu}})^\top \mathcal{I}_n(\widehat{\boldsymbol{\mu}})(\boldsymbol{y}-\widehat{\boldsymbol{\mu}}); \] this estimator is often called the Pearson’s \(X^2\) estimation.

10.9.13 Residual Analysis

The goodness-of-fit measures discussed above (deviance and Pearson’s statistic) provide global indications of model quality. However, a careful analysis of residuals allows us to discover where any discrepancy between the model and the observations may come from, and thus improve the initial model, if necessary.

Residuals are based on the difference between the observation \(y_i\) and the predicted value \(\mu_i\). They help check the adequacy of the model and also detect particular observations (often called “outliers”). These outliers can have a significant influence on parameter estimates, and it is important to re-estimate the model after removing these observations to ensure the stability of the results obtained.

There are two situations where a model is considered inadequate based on global measures such as deviance or Pearson’s statistic: either a small number of observations are poorly described by the model, or all observations show a systematic deviation from the model.

A graphical representation of residuals can detect deviations from the model for most types of dependent variables. However, when the dependent variable can only take a few values (as in logistic regression, for example), residual analysis may have limited utility.

Two types of residuals are commonly used in the context of generalized linear models: Pearson residuals and deviance residuals.

10.9.13.1 Pearson Residuals

They are defined as \[ r_i^P=\frac{\sqrt{w_i}(y_i-\mu_i)}{\sqrt{V(\mu_i)}}. \] The name of this first type of residuals comes from the fact that \(r_i^P\) can be seen as the square root of the contribution of the \(i\)-th observation to the Pearson statistic, i.e. \[ \sum_{i=1}^n\{r_i^P\}^2=X^2. \]

::: {.example}[Binomial Regression]

If the observations \(y_i\) follow a \(\mathcal{B}in(n_i,q_i)\) distribution, the Pearson residual is \[ r_i^P=\frac{y_i-\widehat{y}_i}{\sqrt{n_i\widehat{q}_i(1-\widehat{q}_i)}}. \] :::

::: {.example}[Poisson Regression]

If the observations \(n_i\) follow a \(\mathcal{P}oi(\lambda)\) distribution, the Pearson residual is \[ r_i^P=\frac{n_i-\widehat{\lambda}_i}{\sqrt{\widehat{\lambda}_i}}. \] :::

10.9.13.2 Deviance Residuals

We have seen that deviance is a measure of the goodness of fit provided by a model. We can consider that each observation \(y_i\) contributes an amount \(d_i\) to the deviance \(D\), i.e. \[ D=\sum_{i=1}^nd_i. \] Deviance residuals are then defined as the square root of the contribution \(d_i\) of the \(i\)-th observation to deviance \(D\), multiplied by the sign of the raw residual \(y_i-\mu_i\), i.e. \[ r_i^D=\mbox{sign}(y_i-\mu_i)\sqrt{d_i}. \] Thus, \[ \sum_{i=1}^n\{r_i^D\}^2=D. \] % Deviance residuals can also be standardized to give them approximately unit variance by dividing them by \(\sqrt{1-h_{ii}}\), i.e. % \[ % r_i^{DS}=\frac{r_i^D}{\sqrt{1-h_{ii}}}. % \]

::: {.example}[Logistic Regression]

If the observations \(y_i\) follow a \(\mathcal{B}in(n_i,q_i)\) distribution, the deviance residual is \[\begin{eqnarray*} r_i^D %& = & \mbox{sign}(y_i-\widehat{y}_i)\sqrt{d_i}\\ & = & \mbox{sign}(y_i-\widehat{y}_i)\sqrt{2y_i\ln\left(\frac{y_i}{\widehat{y}_i}\right)+ 2(n_i-y_i)\ln\left(\frac{n_i-y_i}{n_i-\widehat{y}_i}\right)}. \end{eqnarray*}\]

:::

::: {.example}[Poisson Regression]

If the observations \(n_i\) follow a \(\mathcal{P}oi(\lambda_i)\) distribution, the deviance residual is \[ r_i^D=\mbox{sign}(n_i-\widehat{\lambda}_i)\sqrt{2\left\{n_i\ln\frac{n_i}{\widehat{\lambda}_i}-(n_i-\widehat{\lambda}_i)\right\}} \] where \(y\ln y=0\) when \(y=0\). % Thus, \(\{r_i^D\}^2\) is the contribution of the observation to the deviance statistic \(D\).

:::

Remark. Another interesting quantity is obtained by taking the difference between the deviance obtained using the \(n\) observations and that obtained by removing the \(i\)-th observation from the data set (i.e., based on a data set with \(n-1\) observations). This allows measuring the overall influence of \(y_i\) on the model.

Note that the exact computation of these quantities takes time (since the model must be fitted \(n\) times using a data set of size \(n-1\)). To avoid excessive computation time, various approximations are used in practice for the difference in deviances obtained by omitting the observation \(y_i\).

10.9.13.3 Representation of Residuals

The residuals introduced above can be plotted against various statistics, each providing different information about deviations from the model. Residuals can be represented against the observation number (index plot), which helps identify observations that contribute to large residuals (and, consequently, to model inadequacy).

Residuals can also be represented against the predicted values \(\widehat{\mu}_i\) or the linear predictors \(\widehat{\eta}_i\). They can also be plotted against each of the explanatory variables.

%Additionally, a normal QQ-plot of standardized residuals can be used to assess model adequacy. A good model should have a QQ-plot close to the first bisector. If the residuals are too skewed, the curve will not pass through the origin, and if they have a thick distribution tail, the curve will be concave. However, it’s important to keep in mind that residuals from non-normal models are generally skewed.

%::: {.remark} % %It is advisable not to plot residuals against observed values in the case of Poisson regression because %\[ %\Cov\big[N_i-\mathbb{E}[N_i],N_i\big]=\Var[N_i]=\mathbb{E}[N_i] %\] %if \(N_i\sim\mathcal{P}oi(\lambda_i)\), which suggests a trend in the residual plot (large values of \(r_i\) accompanying large values of \(n_i\)), resulting in higher residuals for higher \(n_i\). For this reason, it is preferable to plot residuals against predicted values since in this case %\[ %\Cov\big[N_i-\mathbb{E}[N_i],\mathbb{E}[N_i]\big]=0. %\] %:::

10.9.13.4 Influential Observations

An observation \(y_i\) is said to have influence on a considered model when a small change in \(y_i\) or its omission leads to significantly different parameter estimates for the model. Such an observation has a substantial impact on the study’s conclusions. However, an observation with influence is not necessarily an outlier; it can be close to other observations and have a small residual. The leverage effect of an observation \(y_i\) on the predicted value \(\widehat{y}_j\) is the derivative of \(\widehat{y}_j\) with respect to \(y_i\), indicating how the predicted values of other observations vary with changes in \(y_i\). To measure this effect, the projection matrix \({\boldsymbol{H}}\) is used, which maps observations \(y_i\) to their predictions \(\widehat{y}_i\), i.e., \[ \widehat{\boldsymbol{y}}={\boldsymbol{H}}\boldsymbol{y}. \] In the case of generalized linear models, we have \[ {\boldsymbol{H}}={\boldsymbol{V}}^{1/2}{\boldsymbol{X}}\Big({\boldsymbol{X}}^\top{\boldsymbol{V}}{\boldsymbol{X}}\Big)^{-1} {\boldsymbol{X}}^\top{\boldsymbol{V}}^{1/2} \] where \({\boldsymbol{V}}\) is a diagonal matrix defined as \[ {\boldsymbol{V}}=\mbox{diag}\left(\{V(\mu_i)\}^2\left(\frac{d\eta_i}{d\mu_i}\right)^2 \frac{\phi}{\omega_i} \right). \]

The diagonal elements of \({\boldsymbol{H}}\) provide information about the influence of each observation. It is worth noting that this measure depends on both the explanatory variables and the estimates of the parameters \(\beta_0,\beta_1,\ldots,\beta_p\). Since the trace of \({\boldsymbol{H}}\) equals \(p+1\), the average value of the diagonal terms is \((p+1)/n\). Values corresponding to diagonal terms that exceed, say, twice the average value of \((p+1)/n\) should be carefully examined.

10.9.13.5 Cook’s Distance

Cook’s distance is used to identify observations that influence all parameters \(\beta_0,\beta_1,\ldots,\beta_p\). Similar to linear models, the idea is to determine the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of the parameter \(\boldsymbol{\beta}\) and then calculate the same estimator \(\widehat{\boldsymbol{\beta}}_{(i)}\) obtained by removing the \(i\)-th observation from the dataset. Cook’s distance is then given by \[ C_i=\frac{1}{p+1}(\widehat{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{(i)})^\top {\boldsymbol{X}}^\top{\boldsymbol{V}}{\boldsymbol{X}}(\widehat{\boldsymbol{\beta}}-\widehat{\boldsymbol{\beta}}_{(i)}) \] which can be interpreted as a distance between \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\boldsymbol{\beta}}_{(i)}\). If \(C_i\) is large, it indicates that observation \(y_i\) has a significant influence on parameter estimation. However, the calculation of \(C_i\) is computationally expensive because all parameters must be re-estimated once \(y_i\) is removed. In actuarial science, this calculation is often infeasible due to a large number of observations. To avoid re-fitting the model \(n\) times, various approximations are often used in practice.

10.9.14 The Practice of Generalized Linear Models

10.9.14.1 The Importance of Choosing the Exponential Subfamily

The few examples mentioned in the introduction are often sufficient in practice: modeling claim costs with a Gamma regression model and modeling counts with a Poisson regression model. However, the choice of the subfamily is not neutral when it comes to pricing.

Consider the following (simplified) example based on three observations: \[ \begin{tabular}{|l|lll|} \hline Observation $i$ & 1 & 2 & 3 \\ \hline Response variable $Y_{i}$ (claim costs) & 1 & 2 & 8 \\ Explanatory variable $X_{i}$ (vehicle power) & 1 & 2 & 3 \\ \hline \end{tabular}% \] We aim to fit a generalized linear model, i.e., \(g\left(\mathbb{E}[Y]\right) = \alpha + \beta X\) where \(g\) is a link function. Figure ?? illustrates the influence of the choice of the probability distribution (for the canonical link function), considering \(Y_i\) to follow a normal, Poisson, and Gamma distribution, respectively.

While the three distributions provide similar results at the edges (for values of \(X\) close to \(1\) or \(3\)), they exhibit significantly different behavior elsewhere. In particular, compared to the other two distributions, the Gamma distribution implies higher claim costs (at equal power) for low and high power vehicles, while offering lower costs for vehicles with average power. As this cost reflects the premium, and if the true model is a Poisson distribution, the interpretation of this graph is as follows: - With a normal model, vehicles with average power pay for the risk of others by having a higher premium than their actual risk. - With a Gamma model, high or very low power vehicles pay for the risk of vehicles with average power, which are underpriced.

Therefore, even though this analysis should be nuanced by considering the impact of the link function, it’s essential to note that the choice of the distribution for the response variable has a substantial impact on the resulting pure premiums.

10.9.14.2 Total Cost Modeling: The Importance of Zero-Claim Contracts

Suppose the variable of interest \(Y\) represents the (total) annual premiums for all policies in the portfolio. Since a large number of policies have no claims, the variable \(Y\) will be zero for most observations. A Gamma distribution (for example) cannot model this type of behavior (see Chapter 6 of Volume 1 for the difference between collective and individual models).

The Tweedie distribution allows for modeling this behavior by adding a Dirac measure at \(0\) to a probability distribution with support in \({\mathbb{R}}^+\). The distribution of \(Y\) is then a compound Poisson distribution, \[ Y \sim \mathcal{CP}oi\left({\mu^{2-\gamma}}{\phi(2-\gamma)}, \mathcal{G}am\left(-\frac{2-\gamma}{\phi(1-\gamma)},\phi(2-\gamma)\mu^{\gamma-1}\right)\right), \] where \(1<\gamma<2\). This results in a variance function of the form \(V(\mu)=\phi \mu^\gamma\). The Poisson model is recovered when \(\gamma \rightarrow 1\), and a Gamma distribution is obtained when \(\gamma \rightarrow 2\). It is even possible to obtain a much broader class of distributions, including cases where \(\gamma>2\), by considering stable distributions.

Remark. These models were used in automobile insurance in (Jørgensen and Paes De Souza 1994), where a value of \(\gamma\) close to 1.5 was obtained.

10.10 Generalized Additive Models (GAMs)

10.10.1 Principle

Just as additive models allowed for the incorporation of nonlinear effects of explanatory variables without specifying their form in advance when the response variables were Gaussian, Generalized Additive Models (GAMs) offer the same extension for Poisson, Binomial, and Gamma regression models.

As discussed earlier, the maximum likelihood estimates of the parameters \(\beta_0, \beta_1, \ldots, \beta_p\) involved in the linear score \(\eta\) of a GLM can be obtained by fitting a linear model to pseudo-observations (using weighted least squares). This same approach is used in GAMs. Specifically, we define pseudo-observations as:

\[ z_i^{(k)} = \widehat{\eta}_i^{(k)} + \frac{y_i - \widehat{\mu}_i^{(k)}}{\widehat{D}_i^{(k)}} \]

Where \(\widehat{\mu}_i^{(k)} = g^{-1}(\widehat{\eta}_i^{(k)})\) and \(\widehat{\eta}_i^{(k)} = \widehat{c} + \sum_{j=1}^p \widehat{f}_j^{(k)}(x_{ij})\) for \(i = 1, 2, \ldots, n\). Additionally:

\[ \widehat{D}_i^{(k)} = \frac{\partial}{\partial \eta} g^{-1}(\widehat{\eta}_i^{(k)}) \]

Each pseudo-observation \(z_i^{(k)}\) is associated with weights:

\[ \pi_i^{(k)} = \left(\frac{\widehat{D}_i^{(k)}}{\widehat{\sigma}_i^{(k)}}\right)^2 \]

Where:

\[ \widehat{\sigma}_i^{(k)} = \sqrt{\frac{\phi V\big(g^{-1}(\widehat{\eta}_i^{(k)})\big)}{\omega_i}} \]

The new estimation \(\widehat{f}_j^{(k+1)}\) of \(f_j\) is obtained by regressing \(z_i^{(k)}\) on \(\boldsymbol{x}_i\), taking into account the weights \(\pi_i^{(k)}\). The same technique used for estimating functions \(f_1(\cdot), \ldots, f_p(\cdot)\) in additive models is applied here.

Formalized as an algorithm, the estimation in GAMs is carried out as follows:

  • Initialization: \(\widehat{c} \gets g(\overline{y})\) and \(\widehat{\boldsymbol{f}}_j^{(0)} \gets 0\), for \(j = 1, 2, \ldots, p\).
  • Cycle: For \(k = 1, 2, \ldots\), we construct pseudo-observations \(z_i^{(k)}\) and associate them with weights \(\pi_i^{(k)}\). Then, we fit \(z_i^{(k)}\) to \(c + \sum_{j=1}^p f_j(x_{ij})\) using an additive model, as described earlier:
    • Initialize \(\widehat{c} \gets \overline{z}^{(k)}\) and \(\widehat{\boldsymbol{f}}_j^{(0)} \gets \widehat{\boldsymbol{f}}_j^{(k)}\)
    • Reevaluate \(\widehat{\boldsymbol{f}}_j^{(1)} \to \boldsymbol{H}_{\lambda_j}\left(\boldsymbol{z}^{(k)}-(\overline{z}^{(k)},\ldots,\overline{z}^{(k)})^\top- \sum_{s<j}\widehat{\boldsymbol{f}}_s^{(1)}-\sum_{s>j}\widehat{\boldsymbol{f}}_s^{(0)}\right)\)
    • Iterate (ii) and stop when the sum of squared residuals ceases to decrease.
  • Stopping Criterion: Stop when variations in \(f_j\) become negligible.

As we can see, each step of the iterative algorithm leading to the estimates of the functions \(f_j(\cdot)\) requires a complete backfitting to review the estimation of these functions based on the pseudo-variables obtained at that stage.

10.10.1.1 Local Maximum Likelihood

Another approach is to directly extend the Loess method to Generalized Additive Models (GAMs). In this case, we will use local adjustments within GLMs based on maximum likelihood, similar to how the Loess method operated with Gaussian log-likelihood.

More precisely, given a point \(x\) where we want to estimate the response, we determine a neighborhood \(\mathcal{V}(x)\) and weights \(w_i(x)\) just like in Loess. The goal is to solve the likelihood equations:

\[ \boldsymbol{X}^\top\boldsymbol{\Omega}(x)\big(\boldsymbol{y}-\boldsymbol{\mu}(\widehat{\beta}(x))\big)=\boldsymbol{0} \]

Where the diagonal matrix \(\boldsymbol{\Omega}(x)\) includes the weights \(w_1(x), \ldots, w_n(x)\).

::: {.example}[Logistic Regression]

Consider observations \((y_i, x_i)\) where \(y_i\) follows a \(\mathcal{B}in(n_i, q_i)\) distribution, where \(q_i\) is a function of \(x_i\). If we want to estimate the \(q_i = q(x_i)\), we will determine the neighborhood \(\mathcal{V}(x)\) and estimate the function \(f\) in:

\[ q(x) = \frac{\exp\big(f(x)\big)}{1+\exp\big(f(x)\big)} \]

By maximizing:

\[ L\big(\beta_0(x), \beta_1(x)\big) = \sum_{\xi \in \mathcal{V}(x)} w_\xi(x)\ln\ell_\xi \big(\beta_0(x), \beta_1(x)\big) \]

Where:

\[ \ell_\xi\big(\beta_0(x), \beta_1(x)\big) = \left(\begin{array}{c}n_\xi \\y_\xi\end{array}\right)\big(q_\xi(x)\big)^{y_\xi}\big(1-q_\xi(x)\big)^{n_\xi-y_\xi} \]

And:

\[ q_\xi(x) = \frac{\exp\big(\beta_0(x)+\beta_1(x)\xi\big)}{1+\exp\big(\beta_0(x)+\beta_1(x)\xi\big)}. \]

Finally, the fitted value of \(q(x)\) is obtained as:

\[ \widehat{q(x)} = \frac{\exp\big(\widehat{f}(x)\big)}{1+\exp\big(\widehat{f}(x)\big)} \]

Where:

\[ \widehat{f}(x) = \widehat{\beta}_0(x) + \widehat{\beta}_1(x)x. \]

:::

10.10.2 In Practice…

Most often, actuaries have a large number of categorical variables and only a few continuous variables. Suppose the explanatory variables \(\boldsymbol{x}_i\) related to the insured individual \(i\) are reorganized as:

\[ \boldsymbol{x}_i=(1,x_{i1},\ldots,x_{if},x_{i,f+1},\ldots,x_{i,f+c}) \]

Where \(x_{i1}, \ldots, x_{if}\) are the \(f\) binary variables used to encode categorical variables describing the insured individual \(i\), and \(x_{i,f+1}, \ldots, x_{i,f+c}\) are the \(c\) continuous variables related to this individual. The linear predictor \(\eta_i\) for this individual will take the form:

\[ \eta_i=c+\sum_{j=1}^f\beta_jx_{ij}+\sum_{j=1}^cf_j(x_{i,f+j}). \]

One can go even further and introduce interaction effects between a categorical variable and a continuous variable (typically, between gender and age of the insured individual in the context of auto insurance).

10.11 Practical Case of Auto Insurance Pricing

10.11.1 Portfolio Description

In this section, we consider an automobile insurance portfolio, which we analyze using the widely used SAS software within insurance companies. The methods proposed can, of course, be adapted to other types of coverage (damage insurance for vehicles, fire insurance for buildings, theft insurance, travel cancellation insurance, etc.), taking into account specific characteristics of these contracts. The data comes from the portfolio of a large insurance company operating in Belgium. We have observations related to 158,061 policies observed during the year 1997. The variables included in the dataset are listed in Table 10.3.

Right from the start, let’s clarify some important points. First of all, except for the variables , , and describing the claims, all others are known variables {} to the insurer (meaning the insurer can use these variables to personalize the premium amount charged to the policyholder for risk coverage). Clearly, unknown risk characteristics of the insured at the beginning of the period cannot be used in the {} pricing grid. If necessary, the insurer will use them for other purposes (as we will see later). Among the explanatory variables available as listed in Table 10.3, we distinguish different types:

  1. Those related to the policyholder (, , and in our example).
  2. Those related to the insured vehicle (, , , and in our example).
  3. Those related to the coverage chosen by the policyholder ( and , in our example).

These variables are described in detail below. Note that information related to the insureds’ past claims (such as the number and/or cost of claims or a summary provided by the position in a bonus-malus scale or a reduction-increase coefficient applied to the premium) should not normally be incorporated into the {} pricing scheme. Indeed, the insurer will adjust the premium amount based on the claims reported by the insured using credibility theory or bonus-malus systems (described in the following chapters). Therefore, including past claims in {} pricing would result in a double penalty for policyholders who reported claims and underpricing for those who did not report any claims to the company.

Table 10.3: Variables included in the dataset
Varible Description
AGES Age of the policyholder
AGGLOM Type of residential area of the policyholder
CARB Fuel type
CTOT Total cost of claims (in euros)
DUR Coverage duration (in days)
FRAC Premium fraction
GARACCESS Coverage extent
IND Claim code
KW Vehicle power
NSIN Number of claims
SEXE Policyholder’s gender
SPORT Sportiness of the vehicle
USAGE Vehicle usage

Before starting the modeling of the number and cost of claims, it is essential to have a good understanding of the portfolio you are working with. Therefore, take the time to describe in detail the different rating variables and examine the composition of the portfolio to be analyzed.

We are working here with a policy dataset. Such a dataset has as many rows as policies in the portfolio during the considered period. It summarizes the information available at the beginning of the period for each of the contracts and describes the claims related to them. In addition to this dataset, the insurer also has a claims dataset where all the characteristics of the claims made by the policies in the portfolio are recorded (circumstances in which these claims occurred, people involved, presence of bodily injuries, etc.). These two datasets are linked through the policy number.

Remark. The phase of building the database is crucial for the premium that will result from the analysis: how can you arrive at a correct premium using erroneous, incomplete, or outdated data? Therefore, it is essential for the actuary to be involved in building the database and not delegate the task without control to the IT department or a young intern fresh out of school. The data extraction and cleaning phase represents a significant portion of the study time.

It is important: 1. That the data is homogeneous, meaning it concerns policies from the same portfolio with similar conditions (if not, explanatory variables differentiating the categories of insureds or identifying the company that issued the policies should be introduced). 2. When merging data from multiple sources, to systematically identify their source (for example, when merging databases related to policies sold by brokers or directly, you should add a variable indicating the distribution channel through which the policy was issued). 3. To carefully examine missing data and not ignore it under any circumstances. Often, omitted information reveals certain characteristics of the policy. Therefore, a level should be added to categorical variables indicating when information is missing. It is only when you have ensured that the omissions are random that you can neglect policies with missing information.

Most companies are now aware of the need to have as many data as possible and of good quality. Building and maintaining databases are among the most important concerns of large financial groups. In this regard, the widespread use of electronic transmission of information (the broker or agent enters data to receive an online price quote) helps avoid encoding errors or missing values (since the quote is provided only if all fields are duly completed).

10.11.2 Variables Describing Claims

10.11.2.1 Number of Claims ()

This is the number of claims reported by the insured to the company, not the number of claims caused by the insured during the year. The insured may decide (rightly or wrongly) to compensate the injured third party in the case of a minor loss (such as a scratch on an old car door, for example, for which a young owner might prefer a 100 Euro note, signifying festivities and student celebrations, rather than repairing the vehicle). It is obvious that the number of claims compensated directly by the insured depends on the insurer’s {} pricing policy (understood as how claims reported and resulting in compensation by the company are penalized). Therefore, you should be particularly careful when considering changes to the {} mechanisms for personalizing premium amounts. We will revisit these issues in the following chapters.

In auto liability insurance, deserves special attention as claim costs often do not lend themselves to detailed segmentation. Additionally, will play a central role in the {} personalization of premium amounts (most commercial systems, such as bonus-malus mechanisms, only incorporate the number of claims into the formula for premium reevaluation during the contract).

The significant advantage of is that it is generally known with precision by the company. Except for claims occurring at the end of the period, which will probably only be reported to the insurer at the beginning of the next period, claims are usually reported promptly to the company. This can happen either due to a contractual clause requiring reporting within a certain timeframe under penalty of loss of coverage or because the insured wishes to be compensated as quickly as possible.

The average frequency of claims for the portfolio is 12.45% per year. Table 10.4 shows that the maximum number of claims reported by a policyholder is 5. More specifically, 140,276 (or 88.75%) policyholders reported no claims, 16,085 (or 10.18%) reported 1 claim, 1,522 (or 0.96%) reported 2 claims, 159 (or 0.10%) policyholders reported 3 claims, 17 policyholders (or 0.01%) reported 4 claims, and 2 policyholders (less than 0.01%) reported 5 claims during the year 1997.

Table 10.4 describes the fitting of the observed distribution of to a Poisson distribution with a parameter \(\lambda\) identical for all policies. The maximum likelihood estimator of the parameter is \(\widehat{\lambda} = 0.1245\). It is clear that the fit is very poor and is rejected without hesitation by a chi-squared test (observed chi-squared statistic value is 783.75, with a p-value less than \(10^{-4}\)). If you observe the sequence of signs:

\[ \widehat{\Pr[\texttt{Nsin}=k]} - \Pr[\mathcal{P}oi(\widehat{\lambda})=k],\hspace{2mm}k\in{\mathbb{N}}, \]

given in the last column of Table 10.4, you can clearly see the sequence of signs +,-,+,+,… This sequence is not random. It is due to a very important property of Poisson mixtures known as the “Shaked’s Two Crossings Theorem” (see Property 3.7.6).

The sequence of signs thus supports the hypothesis of a mixed Poisson distribution for at the portfolio level. This indicates that the portfolio is heterogeneous and justifies the {} differentiation of policyholders.

Table 10.4: Observed claims in the portfolio and adjustment by a Poisson distribution
Number of Claims Number of Observed Policies Number of Predicted Policies Sign
0 140,276 139,553.33 +
1 16,085 17,379.16 -
2 1,522 1,082.15 +
3 159 44.92 +
4 17 1.40 +
5 2 0.04 +
6+ 0 0.00

%Let us now examine the adjustment of by the negative binomial distribution (obtained by assuming \(\Theta\) %has a \(\mathcal{G}(a,a)\) distribution).

10.11.2.2 Claims Occurrence ()

This is a binary variable indicating whether the insured has reported at least one claim during the year, i.e., \[ \texttt{IND} = \mathbb{I}[\texttt{NSIN} \geq 1] = \left\{ \begin{array}{l} 1, \text{ if }\texttt{NSIN} \geq 1,\\ 0, \text{ otherwise}. \end{array} \right. \]

Out of the 158,061 insureds in the portfolio, 140,276 (or 88.75%) did not report any claims, and 17,785 (or 11.25%) triggered the insurer’s coverage during the year.

10.11.2.3 Total Claim Cost CTOT

This represents the claims expense for the year 1997, i.e., the total cost (in euros) borne by the insured and covered by the company. It is the sum of payments, reserves, and claims management expenses that occurred in 1997. The monetary risk, therefore, has two components: the occurrence component , which informs us whether the insured has triggered the insurer’s coverage at least once during the period under consideration, and the total cost of reported claims (zero if =0). Therefore, can be represented as \[ \texttt{CTOT} = \texttt{IND} \times \texttt{CTOT}_+ \] where \(\texttt{CTOT}_+\) has the same distribution as \(\texttt{CTOT}\) given \(\texttt{CTOT}>0\) or as \(\texttt{CTOT}\) given \(\text{IND}=1\). The variable \(\texttt{CTOT}_+\) is strictly positive, whereas is mostly zero (zero for 88.75% of the insureds in our example).

If we consider the cost of each claim individually, we have \[ \texttt{CTOT} = \sum_{k=1}^{\texttt{NSIN}} C_k \] where \(C_k\) is the cost (assumed strictly positive) of the \(k\)th claim (with the convention \(\texttt{CTOT}=0\) if \(\texttt{NSIN}=0\)). The average cost \(\overline{C}\) of a claim affecting a policy is then given by \[ \overline{C} = \frac{\texttt{CTOT}_+}{\texttt{NSIN}}, \text{ if }\texttt{NSIN}>0, \] and \(\overline{C}=0\) if \(\texttt{NSIN}=0\). When studying claim costs, it is important to take into account the number of claims generated by the insured. Therefore, \(\overline{C}\) is often analyzed by introducing a weight .

Remark. Another variable often used is the S/P ratio by categories of policies (i.e., the total claims cost caused by this category of insureds divided by the collected premium amount). This variable depends on the tariff in force when the observations were collected. It introduces the old tariff into the development of the new one, and this usage is not recommended.

We know that only 17,785 insureds triggered the insurer’s coverage during the year 1997. Let’s focus on the claims experience of these insureds (i.e., the variable \(\texttt{CTOT}_+\)). The skewness coefficient is 84.31, indicating a significant left skewness. More precisely, 25% of the policies that led to indemnity payments by the company had claims with amounts less than 145.02 Euros, 50% had amounts less than 566.51 Euros, and 75% had amounts less than 1,450.67 Euros. The mean of \(\texttt{CTOT}_+\) is 1,807.46 Euros. Therefore, the empirical pure premium is \[ \widehat{\mathbb{E}[\texttt{CTOT}_+]}\widehat{\mathbb{E}[\texttt{IND}]} = \widehat{\mathbb{E}[\texttt{CTOT}]} = 203.38 Euros. \] The coefficient of variation of \(\texttt{CTOT}_+\) is 982.61, with a standard deviation of 17,760.39.

The four policies that caused the highest claims amounts were 452,839 Euros, 499,601 Euros, 499,917 Euros, and 1,989,568 Euros. It is often necessary to cap claims to analyze their costs. A common approach is to cap them at a quantile, for example, \(q_{0.99}\)=19,927.4 Euros.%, \(q_{0.95}\)=4,275.32 Euros, or \(q_{0.90}\)=3,021.87 Euros.

10.11.3 Measuring Exposure to Risk: the Variable

This represents the number of days the policy was in force during the year 1997. It will be used to measure exposure to risk. It allows us to account for the fact that a claim reported by a policy in force for 1 month is a worse sign for the insurer than a claim related to a policy in force for the entire year.

Remark. Note in passing that exposure to risk should ideally be measured by the number of kilometers driven rather than the number of days the policy was in force (the vehicle could well remain in the garage during certain periods, therefore not exposed to risk). However, measuring the annual mileage is extremely difficult, so most European insurers have abandoned the idea of introducing annual mileage into their pricing criteria and have preferred to use “proxies” such as vehicle usage (vehicles used for professional purposes likely cover more kilometers per year) or fuel type (the choice of diesel is often justified by covering longer distances).

In our portfolio, the average coverage duration is 323.93 days. Some policies were in force for only one day. Figure ?? gives an idea of the coverage periods for the different policies in the portfolio. We can see a majority of policies covered for the entire year and a more or less uniform distribution of coverage durations less than a year. Typically, policies with exposure to risk of less than a year are new policies and cancellations. The portfolio we are studying is relatively stable.

10.11.4 Characteristics of the Policyholder

10.11.4.1 Variable

This is a quantitative variable with integer values representing the age of the policyholder (in completed years) as of January 1, 1997. More than age itself, it is driving experience that we hope to capture through this variable. An equivalent variable (with a high correlation between them) often more acceptable to clients is the length of time the driver’s license has been held.

Policies often include the concept of the habitual driver of the vehicle, and the personal characteristics determining the premium amount are those of the habitual driver mentioned in the specific conditions, not those of the policyholder. It is worth noting that personal characteristics (such as ) often refer to the policyholder, who is not necessarily the driver of the vehicle. Therefore, conclusions obtained based on this data should be considered with caution.

Now let’s examine the age structure of insureds in our portfolio. This is described in Figure ??. Let’s calculate the observed claim frequency by age. This is a Poisson regression model with a single explanatory variable, namely, the age of the insured treated as a categorical variable. If \(\lambda_k\) represents the annual claim frequency of insureds aged \(k\) years, the likelihood associated with the data is \[ \mathcal{L} = \prod_{i=1}^n \exp(-d_i\lambda_{\texttt{AGES}_i}) \frac{(d_i\lambda_{\texttt{AGES}_i})^{n_i}}{n_i!} \] where \(\texttt{AGES}_i\) represents the age of insured \(i\), \(d_i\) is the exposure duration for policy \(i\), and \(n_i\) is the number of claims related to this policy. Maximizing the log-likelihood is equivalent to solving the system \[ \frac{\partial}{\partial\lambda_k}\left\{\sum_{i|\texttt{AGES}_i=k}d_i\lambda_k+\sum_{i|\texttt{AGES}_i=k}n_i\ln\lambda_k\right\}=0, \] which provides the maximum likelihood estimators of claim frequencies at each age, given by \[ \widehat{\lambda}_k = \frac{\sum_{i|\texttt{AGES}_i=k}n_i}{\sum_{i|\texttt{AGES}_i=k}d_i}. \] Therefore, it is necessary to divide the number of claims by the total exposure duration for each age, not by the number of policies. Figure ?? presents \(\widehat{\lambda}_k\) as a function of age \(k\). It is clear that young drivers have a higher claim frequency, approaching 30%. With age, the annual claim frequency tends to decrease, slightly increasing among the older insureds.

Now, let’s see if can explain certain variations in claim costs. To do this, we restrict our study to insureds who reported at least one claim. The mean of the variable by age is described in Figure ??. No particular structure is detected when examining the graph, indicating that the variable may influence claim frequency but not their cost.

It should be emphasized that the above analysis is purely marginal. The apparent influence of age on claim frequency could thus be caused by another variable strongly correlated with age (we refer the reader to the comments in Volume I).

10.11.4.2 Variable

This binary qualitative variable indicates whether an antitheft device is installed in the vehicle:

\[ \texttt{VOL}= \left\{ \begin{array}{l} 0,\text{ if the vehicle has no antitheft device},\\ 1,\text{ if the vehicle has an antitheft device}. \end{array} \right. \]

In the portfolio, 122,869 policies (77.91%) do not have an antitheft device, while 34,192 policies (21.68%) have one. Figure ?? shows that policies with vehicles equipped with antitheft devices tend to have a lower claim frequency and lower average claim costs.

10.11.4.3 Variable

This variable describes whether the policyholder has opted for personal insurance coverage, which provides additional coverage for personal injuries:

\[ \texttt{PERS}= \left\{ \begin{array}{l} 0,\text{ if personal insurance is not chosen},\\ 1,\text{ if personal insurance is chosen}. \end{array} \right. \]

In the portfolio, 107,389 policies (68.04%) do not include personal insurance, while 50,672 policies (31.96%) have it. Figure \(\ref{Pers}\) illustrates the influence of personal insurance on claim frequency and average claim costs. Policies with personal insurance coverage tend to have a slightly higher claim frequency but lower average claim costs.

10.11.4.4 Variable

This binary variable indicates whether the policyholder has had previous insurance coverage:

\[ \texttt{RES}= \left\{ \begin{array}{l} 0,\text{ if the policyholder has not had previous insurance},\\ 1,\text{ if the policyholder has had previous insurance}. \end{array} \right. \]

In the portfolio, 144,117 policies (91.27%) are for policyholders without previous insurance, while 13,944 policies (8.83%) are for policyholders with previous insurance coverage. Figure \(\ref{Res}\) shows that policyholders with previous insurance tend to have a slightly lower claim frequency but slightly higher average claim costs.

10.11.4.5 Variable PROF

This variable describes the policyholder’s profession, which is categorized into different groups. Unfortunately, the specific categories are not provided in the text. However, it is mentioned that policyholders with certain professions have a higher claim frequency. Figure ?? illustrates the influence of the policyholder’s profession on claim frequency and average claim costs.

10.11.4.6 Variable VPP

This binary variable indicates whether the policyholder has opted for additional coverage for voluntary civil liability (VPP stands for “Volontaire de Protection Personnelle” in French):

\[ \texttt{VPP}= \left\{ \begin{array}{l} 0,\text{ if voluntary civil liability coverage is not chosen},\\ 1,\text{ if voluntary civil liability coverage is chosen}. \end{array} \right. \]

In the portfolio, 138,779 policies (87.91%) do not include voluntary civil liability coverage, while 19,282 policies (12.19%) have it. Figure \(\ref{Vpp}\) shows that policies with voluntary civil liability coverage tend to have a slightly higher claim frequency but slightly lower average claim costs.

10.11.5 Vehicle Characteristics

10.11.5.1 Variable

This variable represents the car brand. Unfortunately, specific brand names are not provided in the text. However, it is mentioned that different car brands have varying impacts on claim frequency and claim costs. The impact of the car brand on claim frequency and average claim costs is illustrated in Figure \(\ref{Marque}\).

10.11.5.2 Variable CARAGE

This variable represents the age of the insured vehicle in completed years. Policies with different vehicle ages have varying impacts on claim frequency and claim costs. The influence of vehicle age on claim frequency and average claim costs is illustrated in Figure \(\ref{Carage}\).

10.11.5.3 Variable

This variable represents the driver’s bonus-malus class, which is a measure of the driver’s claim history and experience. Policies with different bonus-malus classes have varying impacts on claim frequency and claim costs. The influence of the bonus-malus class on claim frequency and average claim costs is illustrated in Figure ??.

10.11.5.4 Variable PUISS

This variable represents the power of the insured vehicle, measured in horsepower. Policies with vehicles of different power levels have varying impacts on claim frequency and claim costs. The influence of vehicle power on claim frequency and average claim costs is illustrated in Figure ??.

10.11.5.5 Variable

This binary qualitative variable describes the extent of coverage chosen by the insured:

\[ \texttt{GARACCESS}= \left\{ \begin{array}{l} 0,\text{ if only liability insurance (RC) is subscribed by the insured},\\ 1,\text{ if, in addition to liability insurance, the insured has subscribed}\\ \hspace{10mm}\text{for additional coverage, such as}\\ \hspace{10mm}\text{property damage or theft, for example}. \end{array} \right. \]

Two types of behavior can be expected here:

  1. The theory of moral hazard applies, and more claims are observed from individuals with more coverage. Similarly, according to the signaling theory, insured individuals who have only subscribed to liability insurance have done so knowingly, knowing that they are less risky. Therefore, the 0 category of should appear as a factor decreasing claims.
  2. The subscription to more extensive coverage reflects a stronger aversion to risk on the part of the individuals involved. This would result in a particularly cautious driving behavior and, therefore, fewer claims. In this case, the 0 category of becomes an aggravating factor.

Actuaries, being pragmatic by nature, do not try to determine which mechanism prevails and limit themselves to observing, based on data, the effect of different modalities of .

In our portfolio, 93,194 insured individuals have exclusively subscribed to liability insurance (representing 58.96% of the portfolio), while the other 64,867 (41.04% of the portfolio) have subscribed to additional coverage. In terms of the influence of the extent of coverage on claim frequency, Figure ?? shows the observed claim frequency and the average claim cost per claim as a function of the extent of coverage. It can be seen that both claim frequency and average claim cost per claim are higher for policies that only include liability insurance. This technically justifies the reductions in liability insurance premiums granted to insured individuals who subscribe to other optional coverages.

10.11.6 Interaction Between Rating Variables

Often, an interaction between the age and gender of the insured is observed. By interaction, it is meant here that the influence of the insured’s age on claim frequency varies depending on whether one is male or female. In our portfolio, if we distinguish between males and females, we obtain the claim frequencies by age described in Figure ??. It can be observed that young males have a slightly higher claim frequency compared to young females, and then the claim frequencies are quite similar. We do not continue the modeling of interactions here.

10.11.7 Initial Screening of Rating Variables

10.11.7.1 Chi-squared Test of Independence

An effective way to perform an initial screening among the variables at our disposal is to conduct chi-squared tests based on contingency tables. Due to sample sizes, it is better to work with than with . If we cross-tabulate with , many expected cell counts are less than 5, making it invalid to perform a chi-squared test based on such a contingency table. However, if we cross-tabulate and , we obtain the contingency table described in Table ?? (the expected cell counts under the independence hypothesis are indicated in parentheses), on which we can base the chi-squared test. The observed value of the chi-squared statistic for independence is 20.33 (for 1 degree of freedom), which gives a \(p\)-value less than \(10^{-4}\). Thus, there is a strong association between gender and whether or not an insured individual has had a claim.

Table 10.5: Contingency table cross-tabulating IND and SEXE
IND SEXE Female Male Total
No claim Female 36,722 103,554 140,276
Male (36,972) (103,304)
One or more claims Female 4,937 12,848 17,785
Male (4,687.5) (13,098)

The rejection of the independence hypothesis indicates that the gender of the policyholder influences the variable and, therefore, also the variable . However, non-rejection does not allow us to conclude, partly for reasons related to the error of the second kind (which is not controlled), and partly because the lack of influence on does not imply a lack of influence on (since the explanatory variable can influence the distribution in the 1, 2, claims categories).

By successively cross-tabulating all rating variables with , we obtain the results shown in Table 10.6. It can be seen that does not seem to influence . Therefore, we could exclude this variable from further analysis.

Table 10.6: Results of chi-squared tests on contingency tables cross-tabulating rating variables and IND
Variable Observed Value of Q Statistic Degrees of Freedom p-value
FRAC 375.47 3 \(<.0001\)
CARB 139.22 1 \(<.0001\)
SPORT 6.73 1 0.0094
GARACCESS 22.60 1 \(<.0001\)
AGGLOM 202.04 2 \(<.0001\)
USAGE 0.07 1 0.7956
KW 23.40 2 \(<.0001\)

10.11.7.2 True and Apparent Dependencies

Note that the rejection of the independence hypothesis between and indicates that the gender of the policyholder influences the probability of having at least one claim during the period. However, this influence can be either a true dependency, where the probability of occurrence genuinely depends on the gender of the policyholder, or an apparent dependency, where the probability of having at least one claim depends on a variable correlated with gender (whether hidden, like aggressiveness while driving, or observable, like the age of the policyholder, if age structures differ between males and females). In the latter case, the dependency between the gender of the policyholder and having at least one claim would disappear if the third variable were taken into account.

10.11.8 Analysis of Claim Frequencies

10.11.8.1 Poisson Regression Model for the Number of Claims

Let \(N_i\) be the number of claims reported by insured individual \(i\) during the year 1997, where \(i=1,2,\ldots,n\). We denote \(d_i\) as the exposure to risk for individual \(i\). The Poisson model assumes that the conditional distribution of \(N_i\) given \(\boldsymbol{x}_i\) follows a Poisson distribution. Therefore, we need to specify its mean \(\mathbb{E}[N_i|\boldsymbol{x}_i]\). Since the mean is strictly positive, it is typically modeled as an exponential linear form:

\[\begin{equation} \mathbb{E}[N_i|\boldsymbol{x}_i]=d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i), \quad i=1,2,\ldots,n, \tag{10.28} \end{equation}\]

where \(\boldsymbol{\beta}\) is a vector of unknown regression coefficients. Thus, we work with the model:

\[\begin{equation} N_i\sim\mathcal{P}oi\left(d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)\right), \quad i=1,2,\ldots,n. \tag{10.29} \end{equation}\]

When all variables are categorical, each insured individual is represented by a vector \(\boldsymbol{x}_i\) where the components are either 0 or 1. In this case, the annual frequency \(\lambda_i=\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)\) appears as a product of enhancement or reduction factors relative to the frequency of the reference individual in the portfolio. More precisely,

\[\begin{eqnarray*} \lambda_i&=&\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)\\ &=&\exp(\beta_0)\prod_{j=1}^p\exp(\beta_jx_{ij})\\ &=&\exp(\beta_0)\prod_{j|x_{ij}=1}\exp(\beta_j). \end{eqnarray*}\]

Therefore, \(\exp(\beta_0)\) is the annual frequency of the reference individual in the portfolio, while each of the factors \(\exp(\beta_j)\) represents the influence of a segmentation criterion (if \(\beta_j>0\), individuals with this characteristic will experience a premium increase compared to the reference premium \(\exp(\beta_0)\), whereas \(\beta_j<0\) indicates a premium reduction).

The parameters have the following interpretation: considering the characteristic coded by the \(j\)-th binary variable,

\[ \frac{\mathbb{E}[N_i|\text{ characteristic present}]}{\mathbb{E}[N_i|\text{ characteristic absent}]}=\exp(\beta_j). \]

Based on this equation, \(\exp(\beta_j)\) is the factor by which we need to multiply the claim frequency of individuals not presenting the characteristic coded by the \(j\)-th binary variable to obtain the claim frequency of individuals presenting this characteristic.

10.11.8.2 Treatment of Age with a Generalized Additive Model

We begin with a Poisson regression model of the GAM type to detect the influence of the variable . Specifically, the effect of the policyholder’s age on the annual claim frequency is represented by a function \(f_{\texttt{AGES}}\) decomposed using cubic splines and then estimated based on available statistics. Table 10.7 shows the fit of the linear part of the model. We learn that the sport code is not relevant (\(p\)-value of 0.6981). Similarly (\(p\)-value of 0.3612), the variable could be removed from the model. Table 10.8 compares the models with and without the nonlinear part in . We can see that the influence of this continuous rating variable is clearly nonlinear (\(p\)-value less than 0.0001).

Table 10.7: Parameter estimation for the linear part of the Poisson GAM model
Parameter Estimate Error \(t\) value Pr \(> &#124;t&#124;\)
Intercept -1.42117 0.08396 -16.93 $<.0001
CARB Diesel 0.17744 0.01582 11.21 $<.0001
CARB Essence 0 . . .
FRAC Annuel -0.20345 0.01475 -13.79 $<.0001
FRAC Fractionné 0 . . .
SPORT Non sportif -0.02708 0.06983 -0.39 0.6981
SPORT Sportif 0 . . .
GARACCESS RC Seule 0.09806 0.01494 6.56 $<.0001
GARACCESS RC+Accessoires 0 . . .
AGGLOM Rural -0.22783 0.01519 -15 $<.0001
AGGLOM Urbain 0 . . .
SEXE Femme 0.05386 0.01646 3.27 0.0011
SEXE Homme 0 . . .
KW Cylindrée moyenne 0.09516 0.01852 5.14 $<.0001
KW Grosse cylindrée 0.20646 0.02393 8.63 $<.0001
KW Petite cylindrée 0 . . .
USAGE Privé -0.03167 0.03468 -0.91 0.3612
USAGE Professionnel 0 . . .
Linear(AGES) -0.01267 0.00050444 -25.11 $<.0001
Table 10.8: Fit of the nonlinear part of the Poisson GAM
splines Smoothing.Parameter DF..Degrees.of.Freedom. GCV..Generalized.Cross.Validation. Num.Unique.Obs Chi.Square Pr…ChiSq
Spline(AGES) 0.999630 13.629696 0.6607040 78 511.6004 $<.0001

Omitting the variables and does not significantly degrade the model (the deviance changes from 85,798.498 to 85,799.492). We can see the results of the fit in Table 10.9. Table 10.10 informs us that the nonlinear effect of is still highly significant.

Table 10.9: Parameter estimation for the linear part of the Poisson GAM model without the SPORT and USAGE variables
Parameter Estimate Error \(t\) value Pr \(> &#124;t&#124;\)
Intercept -1.47918 0.03291 -44.95 $<.0001
CARB Diesel 0.17828 0.01576 11.31 $<.0001
CARB Essence 0 . . .
FRAC Annuel -0.20299 0.01475 -13.76 $<.0001
FRAC Fractionné 0 . . .
GARACCESS RC Seule 0.09752 0.01492 6.54 $<.0001
GARACCESS RC+Accessoires 0 . . .
AGGLOM Rural -0.22785 0.01519 -15 $<.0001
AGGLOM Urbain 0 . . .
SEXE Femme 0.05378 0.01646 3.27 0.0011
SEXE Homme 0 . . .
KW Cylindrée moyenne 0.0955 0.01851 5.16 $<.0001
KW Grosse cylindrée 0.21003 0.02345 8.96 $<.0001
KW Petite cylindrée 0 . . .
Linear(AGES) -0.01267 0.00050334 -25.17 $<.0001
Table 10.10: Fit of the nonlinear part of the Poisson GAM without the SPORT and USAGE variables
splines Smoothing.Parameter DF..Degrees.of.Freedom. GCV..Generalized.Cross.Validation. Num.Unique.Obs Chi.Square Pr…..ChiSq
Spline(AGES) 0.999646 13.440987 0.678901 78 486.5207 $<.0001

Now let’s examine the influence of on the claim frequency. Figure ?? provides information on this: it displays the coefficient by which the frequency of the reference class should be multiplied based on the age of the policyholder, i.e., \(\exp\left(\widehat{\beta}_{\texttt{AGES}}\texttt{AGES}+\widehat{f}_{\texttt{AGES}}(\texttt{AGES})\right)\).

Based on Figure ??, we decide to categorize the variable as follows: \[ \texttt{AGESGROUP}= \left\{ \begin{array}{l} \text{Beginner, if }18-21\text{ years},\\ \text{Young, if }22-30\text{ years},\\ \text{Experienced, if }31-55\text{ years},\\ \text{Senior, if }56-75\text{ years},\\ \text{Elderly driver, if over }75\text{ years}. \end{array} \right. \] Of course, this choice has an arbitrary component. Let’s now examine the variable . The portfolio contains 1,212 beginner drivers (0.77%), 22,685 young drivers (14.35%), 89,326 experienced drivers (56.51%), 40,199 seniors (25.43%), and 4,639 elderly drivers (2.93%). A chi-square independence test between and leads to its rejection (observed test statistic value of 860.15 for 4 degrees of freedom, yielding a \(p\)-value less than \(0.0001\)). Figure ?? shows that the most represented class is that of experienced drivers. We also observe a “bowl-like” shape in the number and average claim amounts: high values for the young driver classes, a clear decrease over the years, and a slight increase for the elderly driver class.

10.11.8.3 Estimation of Annual Claim Frequencies

Now that the variable has been categorized, we revert to a Poisson regression model of the GLM type. The model fitting is given in Table 10.11, while Table 10.12 provides the \(p\)-values of the tests used to exclude each variable from the model one by one (Type 3 analysis in SAS jargon). The point estimates of \(\beta_j\) are provided in the third column of Table 10.11, with the first two columns identifying the level to which the regression coefficient refers. Rows with 0 correspond to the reference levels of the different rating variables. The “Wald \(95\%\) Conf Limit” column contains the lower and upper bounds of the \(95\%\) confidence intervals for the parameters, calculated using the formula \[ \text{Coeff}\, \beta_j \pm 1.96 \, \text{Std Error}\, \beta_j, \] where 1.96 is the \(97.5\%\) quantile of the standard normal distribution, and Std Error is the square root of the \(j\)-th diagonal element of \(\widehat{\boldsymbol{\Sigma}}\) given in the fourth column.

The “Chi-Sq” and “Pr\(>\)ChiSq” columns, which represent the associated \(p\)-value, are used to test if the coefficient \(\beta_j\) is significantly different from 0. This test is performed using the Wald statistic \[ \frac{(\text{Coeff}\, \beta_j)^2}{(\text{Std Error}\, \beta_j)^2}, \] which is approximately chi-squared with 1 degree of freedom. We reject the nullity of \(\beta_j\) when the \(p\)-value is less than \(5\%\).

The Type 3 analysis allows us to examine the contribution of each variable compared to a model without it. The “ChiSquare” column calculates, for each variable, twice the difference between the log-likelihood obtained for the model containing all variables and the log-likelihood of the model without the specific explanatory variable. This statistic is asymptotically distributed as a chi-squared with DF degrees of freedom, where DF is the number of parameters associated with the examined explanatory variable. The last column provides the \(p\)-value associated with the likelihood ratio test, allowing us to assess the contribution of this explanatory variable to modeling the studied phenomenon.

Table 10.11: Poisson regression model fitting
Parameter DF Estimate Standard Error Wald 95% Confidence Limits (Lower) Wald 95% Confidence Limits (Upper) Chi-Square Pr \(>\) ChiSq
Intercept 1 -1.9575 0.0176 -1.992 -1.923 12387.1 <.0001
CARB Diesel 1 0.1869 0.0159 0.1558 0.2179 138.68 <.0001
CARB Essence 0 0 0 0 0 . .
FRAC Annuel 1 -0.277 0.0147 -0.3058 -0.2481 353.8 <.0001
FRAC Fractionné 0 0 0 0 0 . .
SPORT Sportif 1 0.0391 0.0698 -0.0978 0.176 0.31 0.5754
SPORT Non sportif 0 0 0 0 0 . .
GARACCESS RC+Accessoires 1 -0.1342 0.0149 -0.1635 -0.1049 80.68 <.0001
GARACCESS RC Seule 0 0 0 0 0 . .
AGGLOM Urbain 1 0.2376 0.0152 0.2078 0.2674 244.51 <.0001
AGGLOM Rural 0 0 0 0 0 . .
SEXE Femme 1 0.0701 0.0165 0.0378 0.1023 18.13 <.0001
SEXE Homme 0 0 0 0 0 . .
KW Grosse cylindrée 1 0.1274 0.0203 0.0876 0.1672 39.35 <.0001
KW Petite cylindrée 1 -0.0867 0.0185 -0.1231 -0.0504 21.91 <.0001
KW Cylindrée moyenne 0 0 0 0 0 . .
AGESGRP Débutant 1 0.7902 0.0578 0.6769 0.9034 186.92 <.0001
AGESGRP Conducteur âgé 1 -0.189 0.0482 -0.2135 -0.0244 6.09 0.1360
AGESGRP Jeune 1 0.3891 0.0184 0.353 0.4252 446.29 <.0001
AGESGRP Senior 1 -0.232 0.0192 -0.2696 -0.1943 145.77 <.0001
AGESGRP Expérimenté 0 0 0 0 0 . .
USAGE Professionnel 1 0.0269 0.0347 -0.0411 0.0949 0.6 0.4386
USAGE Privé 0 0 0 0 0 . .
Table 10.12: Likelihood ratio test statistics for Type 3 analysis
Source DF Chi-Square Pr > ChiSq
CARB 1 136.63 $<.0001
FRAC 1 357.5 $<.0001
SPORT 1 0.31 0.5775
GARACCESS 1 81.36 $<.0001
AGGLOM 1 237.72 $<.0001
SEXE 1 17.96 $<.0001
KW 2 78.79 $<.0001
AGESGRP 4 878.05 $<.0001
USAGE 1 0.6 0.4404

We start by excluding the variable , which is considered the least significant (\(p\)-value of 57.75% in Table 10.12). The resulting model (results not provided) still yields a high \(p\)-value for the variable (44.28%). Therefore, we move directly to the next model by removing the variable. Next, to simplify the pricing model, we consider whether it would be possible to group certain levels of the variable. To do this, we test the hypothesis of equality of coefficients taken two by two, as shown in Table 10.13. The obtained \(p\)-value for the levels “Senior” and “Elderly driver” clearly indicates that we can group these levels (into one senior level) for the subsequent analysis. The final model, taking into account all these modifications, is presented in Tables 10.14 and 10.15.

Table 10.13: Tests of equality of coefficients taken two by two for the variable {AGESGRP}
Contrast DF Chi-Square Pr > ChiSq
Beginner-Young 1 41.21 \(<\).0001
Elderly-Driver 1 5.06 0.2450
Young-Experienced 1 421.67 \(<\).0001
Experienced-Elderly 1 151.15 \(<\).0001
Table 10.14: Poisson regression model fit, final model
Parameter DF Estimate Standard Error Wald 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept 1 -1.9564 0.0176 (-1.9909, -1.922) 12398.00 $<.0001
CARB Diesel 1 0.1856 0.0158 (0.1547, 0.2165) 138.28 $<.0001
CARB Gasoline 0 0.0000 0.0000 (0, 0) 0.00 .
ANNUAL Fraction 1 -0.2751 0.0147 (-0.3039, -0.2463) 350.38 $<.0001
FRACTIONED Fraction 0 0.0000 0.0000 (0, 0) 0.00 .
GARAGE ACCESSORIES RC+Accessories 1 -0.1340 0.0149 (-0.1632, -0.1048) 80.71 $<.0001
GARAGE ACCESSORIES RC Only 0 0.0000 0.0000 (0, 0) 0.00 .
URBAN AGGLOMERATION 1 0.2377 0.0152 (0.2079, 0.2674) 244.69 $<.0001
RURAL AGGLOMERATION 0 0.0000 0.0000 (0, 0) 0.00 .
GENDER Female 1 0.0691 0.0164 (0.0369, 0.1014) 17.67 $<.0001
GENDER Male 0 0.0000 0.0000 (0, 0) 0.00 .
LARGE DISPLACEMENT KW 1 0.1296 0.0198 (0.0907, 0.1684) 42.81 $<.0001
SMALL DISPLACEMENT KW 1 -0.0860 0.0185 (-0.1223, -0.0497) 21.57 $<.0001
MEDIUM DISPLACEMENT KW 0 0.0000 0.0000 (0, 0) 0.00 .
AGESGRP Beginner 1 0.7894 0.0578 (0.6761, 0.9026) 186.65 $<.0001
AGESGRP Young 1 0.3891 0.0184 (0.3531, 0.4252) 448.41 $<.0001
AGESGRP Senior 1 -0.2213 0.0185 (-0.2576, -0.185) 142.67 $<.0001
AGESGRP Experienced 0 0.0000 0.0000 (0, 0) 0.00 .
Table 10.15: Likelihood ratio statistics for Type 3 analysis, final model
Source DF Chi.Square Pr…ChiSq
CARB 1 136.18 $<.0001
FRACTION 1 353.97 $<.0001
GARAGE ACCESSORIES 1 81.39 $<.0001
AGGLOMERATION 1 237.89 $<.0001
GENDER 1 17.51 $<.0001
KW 2 83.70 $<.0001
AGESGRP 3 877.65 $<.0001

For the insured individual \(i\) characterized by a vector of explanatory variables \(\boldsymbol{x}_i\), the predicted annual frequency is \(\exp(\boldsymbol{x}_i^\top\widehat{\boldsymbol{\beta}})\). This will also be the case for new insured individuals with the same characteristics as insured individual \(i\) (the implicit assumption being that new policies are taken out by individuals who perfectly match the characteristics of the insured individuals that form the basis of the tariff construction; this assumes, among other things, that the company effectively manages adverse selection through a careful acceptance policy).

10.11.8.4 Overdispersion

The Poisson model imposes strong constraints on the dependence between the count variable \(N_i\) and the risk factors \(\boldsymbol{x}_i\), since \[\begin{equation} \mathbb{E}[N_i|\boldsymbol{x}_i]=\mathbb{V}[N_i|\boldsymbol{x}_i]=d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i). \tag{10.30} \end{equation}\] This implies an equality between the mean number of claims and the variability of this number within each risk class. However, it should be noted that the convergence of the maximum likelihood pseudo-estimators obtained in the Poisson model allows their use even if the Poisson distribution is not appropriate (provided that the conditional mean is correctly specified).

In practice, to check the validity of constraint (10.30), we calculate the empirical mean and variance of claims for each risk class, denoted as \(\widehat{m}_k\) and \(\widehat{\sigma}_k^2\), respectively. Then, we plot the data points \(\{(\widehat{m}_k, \widehat{\sigma}_k^2), k=1,2,\ldots\}\) in a graph. This allows us to observe how the variance changes with the mean. When the points cluster around the first bisector, it can be considered that the first two conditional moments are equal, supporting the Poisson model. Conversely, when \(\widehat{\sigma}_k^2>\widehat{m}_k\), we often observe overdispersion, meaning that some classes have \(\widehat{\sigma}_k^2>\widehat{m}_k\). This phenomenon is often due to omitted variables.

We can understand this phenomenon by considering two risk classes, \(C_1\) and \(C_2\), without overdispersion (\(\widehat{\sigma}_1^2=\widehat{m}_1\) and \(\widehat{\sigma}_2^2=\widehat{m}_2\)), but which have been mistakenly combined. In the class \(C_1\cup C_2\), the mean is given by \[ \widehat{m}=p_1\widehat{m}_1+p_2\widehat{m}_2, \] where \(p_1\) and \(p_2\) represent the relative weights of \(C_1\) and \(C_2\), respectively. The variance becomes \[ \widehat{\sigma}^2=p_1\widehat{\sigma}_1^2+p_2\widehat{\sigma}_2^2+p_1(\widehat{m}_1-\widehat{m})^2+p_2(\widehat{m}_2-\widehat{m})^2. \] Therefore, in \(C_1\cup C_2\), there is overdispersion because \(\widehat{\sigma}^2>\widehat{m}\), with equality only possible if \(\widehat{m}_1=\widehat{m}_2\). It is thus understandable that the omission of important explanatory variables can lead to overdispersion of observations within risk classes.

Based on our portfolio, the data points \(\{(\widehat{m}_k, \widehat{\sigma}_k^2), k=1,2,\ldots\}\) are represented in the graph in Figure ??. The effect of overdispersion is clearly visible.

10.11.8.5 Consequences of Specification Error in the Poisson Model

As mentioned earlier, the Poisson model is relatively restrictive as it assumes equidispersion of the data. Often, as in our dataset, this assumption is not met. It is interesting to examine what happens if the Poisson model is misspecified. Specifically, suppose that the true form of the conditional mean is now \[\begin{equation} \mathbb{E}[N_i|\boldsymbol{x}_i]=d_i\exp(\boldsymbol{\beta}_0^\top\boldsymbol{x}_i), \tag{10.31} \end{equation}\] where \(\boldsymbol{\beta}_0\) is the true parameter value, but the conditional distribution of \(N_i\) given \(\boldsymbol{x}_i\) is not Poisson. The estimator \(\widehat{\boldsymbol{\beta}}\) obtained by solving the likelihood equations is no longer a maximum likelihood estimator, but rather an estimator calculated with specification error. This is referred to as the pseudo-maximum likelihood estimator. However, the pseudo-maximum likelihood estimator based on the Poisson model with mean \(\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)\) is consistent for the true value \(\boldsymbol{\beta}_0\) if the mean is of the form (10.31). This comes from the likelihood equations obtained in the Poisson regression model, which can still be written as \[ \mathbb{E}\Big[\big(N_i-\mathbb{E}[N_i|\boldsymbol{x}_i]\big)\boldsymbol{x}_i\Big]=\boldsymbol{0} \] \[ \Leftrightarrow\mathbb{E}\Big[d_i\exp(\boldsymbol{\beta}_0^\top\boldsymbol{x}_i)-d_i\exp(\widehat{\boldsymbol{\beta}}^\top\boldsymbol{x}_i)\Big]\boldsymbol{x}_i=\boldsymbol{0} \] which ensures that \(\widehat{\boldsymbol{\beta}}\to_{\text{prob}}\boldsymbol{\beta}_0\) as the number of observations \(n\to +\infty\). Only the correct specification of \(\mathbb{E}[N_i|\boldsymbol{x}_i]\) is necessary to obtain consistent estimators.

Specification error does not affect the consistency of the estimator in large samples. However, specification error requires a modification of the calculation of the asymptotic variance, which is now given by \(\boldsymbol{H}^{-1}\mathcal{I}\boldsymbol{H}\), where \[ \boldsymbol{H}=\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\topd_i\exp(\boldsymbol{\beta}_0^\top\boldsymbol{x}_i) \text{ and } \mathcal{I}=\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\top\mathbb{V}[N_i|\boldsymbol{x}_i]. \] In practice, the asymptotic variance-covariance matrix will be estimated as \(\widehat{\boldsymbol{H}}^{-1}\widehat{\mathcal{I}}\widehat{\boldsymbol{H}}\) where \[ \widehat{\boldsymbol{H}}=\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\top d_i\exp(\widehat{\boldsymbol{\beta}}^\top\boldsymbol{x}_i) \text{ and } \widehat{\mathcal{I}}=\sum_{i=1}^n\boldsymbol{x}_i\boldsymbol{x}_i^\top(n_i-\lambda_i)^2. \]

Overdispersion (indicating specification error) has little effect on point estimates in large samples. However, overdispersion leads to an underestimation of the variances of the estimators, resulting in overly narrow confidence intervals and overestimation of chi-squared statistics used to test the nullity of regression coefficients. Consequently, a variable deemed relevant in the Poisson model may no longer be relevant after overdispersion has been taken into account.

10.11.8.6 Negative Binomial Regression Model

A simple and effective technique to account for overdispersion is to introduce a random error term into the linear predictor (recognizing the heterogeneity of insureds within each tariff class: even though they are identical for the company, they still have relatively different risk profiles). If we denote \(\epsilon_i\) as this error term, we assume that \[ [N_i|\boldsymbol{x}_i,\epsilon_i]\sim\mathcal{P}oi\big(d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i+\epsilon_i)\big). \] Let \(\Theta_i=\exp(\epsilon_i)\) and require \(\mathbb{E}[\Theta_i]=1\). Thus, \[ \mathbb{E}[N_i|\boldsymbol{x}_i]=d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)\mathbb{E}[\Theta_i]=d_i\exp(\boldsymbol{\beta}^\top\boldsymbol{x}_i)=\lambda_i; \] the addition of the error term \(\epsilon_i\) to the linear predictor does not change the prior mean of \(N_i\), conditioned on the observables \(\boldsymbol{x}_i\). This property ensures that the tariff is correct on average, and that the premium collected for a tariff class is sufficient to compensate for claims. As \[\begin{eqnarray*} \mathbb{V}[N_i|\boldsymbol{x}_i]&=&\mathbb{E}\Big[\mathbb{V}[N_i|\boldsymbol{x}_i,\Theta_i]\Big]+\mathbb{V}\Big[\mathbb{E}[N_i|\boldsymbol{x}_i,\Theta_i]\Big] \\ & = & \mathbb{E}[\lambda_i\Theta_i]+\mathbb{V}[\lambda_i\Theta_i]=\lambda_i\big\{1+\sigma^2\lambda_i\big\}>\lambda_i=\mathbb{E}[N_i|\boldsymbol{x}_i] \end{eqnarray*}\] where we have set \(\sigma^2=\mathbb{V}[\Theta_i]\), the introduction of a random error term to the linear predictor automatically leads to data overdispersion.

10.11.8.7 Justification for Introducing the Error \(\epsilon\)

Many relevant explanatory variables cannot be observed by the insurer (for legal or economic reasons). Of course, some hidden variables could be correlated with the observable variables \(\boldsymbol{X}\). For example, annual mileage is a hidden variable for the company, but it is likely to be correlated with the use of the vehicle (private-professional), which is an observable variable for the company. The idea is to represent the residual effect of hidden variables by an error term \(\epsilon\) that would be superimposed on the score. Technically, \(\epsilon\) is assumed to be independent of the vector \(\boldsymbol{X}\) of observable characteristics of the insured.

For an individual such that \(\boldsymbol{X}=\boldsymbol{x}\), the annual number of claims \(N\) follows a Poisson distribution with mean \(\exp(\boldsymbol{\beta}^\top\boldsymbol{x}+\epsilon)\), where \(\epsilon\) is a random variable representing the influence of omitted variables in the tariff. \(\epsilon\) is called a random effect and represents the residual heterogeneity of the portfolio. The annual number of claims \(N\) becomes a mixture of Poisson distributions, reflecting the fact that each risk class (defined by \(\boldsymbol{X}=\boldsymbol{x}\)) contains a mixture of insureds with different risk profiles on unobservable factors.

If we denote \(f_\Theta\) as the probability density function of \(\Theta_i\), we obtain \[\begin{eqnarray*} \Pr[N_i=k|\boldsymbol{x}_i]&=&\int_{\theta\in{\mathbb{R}}^+}\Pr[N_i=k|\boldsymbol{x}_i,\Theta_i=\theta]f_\Theta(\theta)d\theta\\ &=&\int_{\theta\in{\mathbb{R}}^+}\exp(-\lambda_i\theta)\frac{(\lambda_i\theta)^k}{k!}f_\Theta(\theta)d\theta. \end{eqnarray*}\] For most choices of \(f_\Theta\), this integral does not have an explicit expression. A notable exception is the density associated with the Gamma distribution, for which we obtain the Negative Binomial distribution for \([N_i|\boldsymbol{x}_i]\). Considering that the random effects \(\Theta_i=\exp(\epsilon_i)\), \(i=1,2,\ldots,n\), are independent and identically distributed as Gamma with mean 1 and variance \(1/a\), the density of \(\Theta_i\) is given by \[ f_\Theta(\theta)=\frac{1}{\Gamma(a)}a^a\theta^{a-1}\exp(-a\theta),\hspace{2mm}\theta\in{\mathbb{R}}^+. \] The choice of the density \(f_\Theta\) is justified by purely analytical considerations (the Gamma distribution being the conjugate prior to the Poisson distribution). Conditionally on the observable variables \(\boldsymbol{x}_i\) and the random effect \(\Theta_i\), the discrete probability density of \(N_i\) is given by \[ \Pr[N_i=n_i|\boldsymbol{x}_{i},\Theta_i=\theta_i]=\exp\Big(-\theta_id_i\exp(\eta_i)\Big)\frac{\big\{\theta_id_i\exp(\eta_i)\big\}^{n_i}}{n_i!}. \] Conditionally on the observables, \(N_i\) follows a Poisson distribution with mean \(d_i\exp(\eta_i)\Theta_i\). Unconditionally, \(N_i\) follows a Negative Binomial distribution, and the probabilities are \[2mm] \(\Pr[N_i=n_i|\boldsymbol{x}_i]\) \[\begin{eqnarray*} %&=& %\int_{\theta\in{\Reel}^+}\Pr[N_i=n_i|\xvec_i,\Theta_i=\theta]f_\Theta(\theta)d\theta\\ &=&\left(\begin{array}{c}a+n_i-1 \\n_i\end{array}\right)\left(\frac{d_i\exp(\eta_i)}{a+d_i\exp(\eta_i)}\right)^{n_i} \left(\frac{a}{a+d_i\exp(\eta_i)}\right)^a, \end{eqnarray*}\] for \(n_i\in{\mathbb{N}}\).

While the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) in the Poisson model does not coincide with that in the model with random effects, the estimates obtained can be considered the results of the generalized method of moments. Indeed, equation (??), for which the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) is a solution, can be viewed as the empirical counterpart of the equation \[\begin{equation} \mathbb{E}\left[\sum_{i=1}^n\big\{N_i-\lambda_i\big\}\boldsymbol{x}_i\right]=\boldsymbol{0}, \tag{10.32} \end{equation}\] which is valid in both the simple Poisson model and the model with random effects. Therefore, if one is willing to abandon the maximum likelihood method in favor of the generalized method of moments, the estimates obtained in the model without random effects can be retained.

It is relatively simple to obtain an estimation of \(\sigma^2=\mathbb{V}[\Theta_i]\) using the method of moments. To do this, note that \[\begin{eqnarray} \mathbb{V}[N_{i}] & = & \mathbb{E}\Big[\mathbb{V}[N_{i}|\Theta_i]\Big]+\mathbb{V}\Big[\mathbb{E}[N_{i}|\Theta_i]\Big]\nonumber\\ & = & \mathbb{E}\Big[\Theta_id_{i}\exp(\eta_{i})\Big]+\mathbb{V}\Big[\Theta_id_{i}\exp(\eta_{i})\Big]\nonumber\\ & = & d_{i}\exp(\eta_{i})+\Big\{d_{i}\exp(\eta_{i})\Big\}^2\sigma^2\nonumber\\ &=&\mathbb{E}[N_{i}]+\Big\{d_{i}\exp(\eta_{i})\Big\}^2\sigma^2. \tag{10.33} \end{eqnarray}\] Let’s write a similar relation to (10.32) for the variance, which is \[\begin{equation} \sum_{i=1}^n\left\{\Big(n_{i}-d_{i}\exp(\eta_{i})\Big)^2-n_{i} -\Big(d_{i}\exp(\eta_{i})\Big)^2\sigma^2\right\}=0. \tag{10.34} \end{equation}\] The estimators \(\widehat{\boldsymbol{\beta}}\) and \(\widehat{\sigma}\) are solutions of the system formed by (10.32) and (10.34); they converge in the model with random effects. Therefore, the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\), a solution of (10.32), converges in the model with random effects. The estimator of \(\sigma^2\) is given by \[ \widehat{\sigma}^2=\frac{\sum_{i=1}^n\left\{\Big(n_{i}-d_{i} \exp(\widehat{\eta}_{i})\Big)^2-n_{i}\right\}} {\sum_{i=1}^n\Big\{d_{i}\exp(\widehat{\eta}_{i})\Big\}^2} \] where \(\widehat{\eta}_{i}=\widehat{\boldsymbol{\beta}}^\top\boldsymbol{x}_{i}\), and \(\widehat{\boldsymbol{\beta}}\) is the maximum likelihood estimator of \(\boldsymbol{\beta}\) in the model without random effects.

Instead of first estimating the parameters in a Poisson regression model (10.28)-(10.29) and then adding a random effect without questioning the estimation of \(\boldsymbol{\beta}\), one could also work directly in a Negative Binomial regression model.

If the Gamma distribution is chosen for \(\Theta_i\), the likelihood is given by \[\begin{eqnarray*} \mathcal{L}(\boldsymbol{\beta}|\boldsymbol{n})&=&\prod_{i=1}^n\frac{\lambda_{i}^{n_{i}}}{n_i!}\left(\frac{a}{a+\lambda_{i}}\right)^a \left(a+\lambda_{i}\right)^{-n_{i}}\frac{\Gamma\left(a+n_{i}\right)}{\Gamma(a)}. \end{eqnarray*}\]

The maximum likelihood estimators of the parameters \(\boldsymbol{\beta}\) and \(a\) are solutions of the system \[\begin{equation} \sum_{i=1}^n\boldsymbol{x}_{i}\left(n_{i}-\lambda_{i}\frac{a+n_{i}} {a+\lambda_{i}}\right)=\boldsymbol{0}. \tag{10.35} \end{equation}\] To interpret this relation, note that if a constant \(\beta_0\) is introduced into the score, the first equation of (10.35) guarantees that \[ \sum_{i=1}^nn_i=\sum_{i=1}^n\lambda_i\frac{a+n_{i}}{a+\lambda_{i}}. \] This means that the model reproduces the number of claims by incorporating the information provided by them. Indeed, \(\lambda_i\frac{a+n_{i}}{a+\lambda_{i}}\) is the expected number of claims in the following period for an insured with characteristics \(\boldsymbol{x}_i\) who has reported \(n_i\) claims during the current insurance period. If the first variable is 1 if the insured is male and 0 otherwise, then \[ \sum_{\text{males}}n_i=\sum_{\text{males}}\lambda_i\frac{a+n_{i}}{a+\lambda_{i}}. \] One can see this last relation as a guarantee of non-subsidization of males by females, and vice versa.

The likelihood equations (10.35) do not have an explicit solution; therefore, numerical solution methods will be used, using the estimates provided by the method of moments as initial values.

The adjustment of the negative binomial regression model is given in Tables 10.16-10.17. As expected, the point estimates are similar to those obtained by Poisson regression, but the confidence intervals are wider in the negative binomial model.

Table 10.16: Adjustment of the negative binomial regression model, final model
Parameter DF Estimate Error Wald 95% Confidence Limits Chi-Square Pr > ChiSq
Intercept 1 -1.9555 0.0182 (-1.9913, -1.9198) 11486.30 $<.0001
CARB Diesel 1 0.1867 0.0164 (0.1545, 0.2189) 129.24 $<.0001
CARB Gasoline 0 0.0000 0.0000 (0, 0) 0.00 .
FRAC Annual 1 -0.2757 0.0152 (-0.3055, -0.2458) 326.86 $<.0001
FRAC Fractional 0 0.0000 0.0000 (0, 0) 0.00 .
GARACCESS RC+Accessories 1 -0.1341 0.0155 (-0.1644, -0.1038) 75.07 $<.0001
GARACCESS RC Only 0 0.0000 0.0000 (0, 0) 0.00 .
AGGLOM Urban 1 0.2375 0.0158 (0.2065, 0.2685) 225.05 $<.0001
AGGLOM Rural 0 0.0000 0.0000 (0, 0) 0.00 .
SEXE Female 1 0.0712 0.0171 (0.0377, 0.1048) 17.36 $<.0001
SEXE Male 0 0.0000 0.0000 (0, 0) 0.00 .
KW Large Displacement 1 0.1297 0.0206 (0.0894, 0.1701) 39.68 $<.0001
KW Small Displacement 1 -0.0852 0.0192 (-0.1228, -0.0475) 19.66 $<.0001
KW Medium Displacement 0 0.0000 0.0000 (0, 0) 0.00 .
AGESGRP Beginner 1 0.8006 0.0621 (0.6788, 0.9224) 166.02 $<.0001
AGESGRP Young 1 0.3906 0.0193 (0.3528, 0.4284) 410.41 $<.0001
AGESGRP Senior 1 -0.2213 0.0191 (-0.2587, -0.1839) 134.30 $<.0001
AGESGRP Experienced 0 0.0000 0.0000 (0, 0) 0.00 .
Dispersion 1 0.5431 0.0358 (0.4773, 0.6178) NA .
Table 10.17: Likelihood ratio statistics for Type 3 analysis, negative binomial model
Source DF Chi.Square Pr…ChiSq
CARB 1 127.51 $<.0001
FRAC 1 329.10 $<.0001
GARACCESS 1 75.58 $<.0001
AGGLOM 1 219.49 $<.0001
SEXE 1 17.22 $<.0001
KW 2 NA .
AGESGRP 3 813.99 $<.0001

10.11.9 Analysis of Claim Costs

10.11.9.1 Challenges

Before proceeding, let’s explain why the analysis of claim costs is significantly more complicated than that of claim counts.

While all policies in the portfolio can be used to estimate the annual claim frequency, it is clear that only policies with claims can be used to study the distribution of claim amounts. Therefore, the actuary has a limited number of observations to work with when fitting a model to claim amounts. Furthermore, claim amounts are much more challenging to model than claim counts due to the increased complexity of the phenomenon.

Often, claims of some severity require relatively long periods to be closed. Consider, for example, an automobile liability claim with bodily injury, where it is necessary to wait for the victim’s condition to stabilize before determining the amount of compensation. For this reason, the company often has only cost projections in its records, making the analysis uncertain.

According to some authors, considering claim amounts in automobile liability insurance pricing is questionable. Indeed, the amounts paid by the insurer are intended to compensate third parties for their losses. Thus, an insured who injures a pedestrian would expose their insurer to very different expenses depending on whether the pedestrian is an elderly, sick person with no family or a young, dynamic executive with two young children. It is difficult to see how the characteristics of the insured could explain this phenomenon, even though it is understandable that the insured’s profile could explain the probability of causing an accident with bodily injury. In the same vein, an insured who damages another vehicle would force their insurer to pay very different amounts depending on whether the other vehicle is a limousine or a low-end car. However, there is no doubt that claim amounts should be considered in related coverages such as “property damage” coverage, for example.

10.11.9.2 Severe Claims

Often, less than 20% of claims account for more than 80% of the company’s expenses. This requires special treatment of these “severe claims,” which are usually not segmented or only slightly segmented.

Different representations of claim costs by policy were examined in Section 6.2.2. Here, we will adopt the formalism of Example 6.2.1. Specifically, the total claim cost incurred by a policy is written as follows: \[ S = \sum_{k=1}^N C_k + IL \] where

  • \(N\) is the number of standard claims, assumed to follow a Poisson distribution.
  • \(C_k\) is the cost of the \(k\)th standard claim.
  • \(I\) indicates whether the policy generated at least one severe claim.
  • \(L\) is the cumulative cost of these severe claims, if any.

The insurer will segment their premium based on \(\mathbb{E}[N]\) and \(\mathbb{E}[C_k]\), possibly on \(\mathbb{E}[I]\), but generally not on \(\mathbb{E}[L]\). Severe claims are too few to allow for customization of amounts.

Remark. Since severe claims are the most expensive (they represent more than 80% of the amounts paid by the insurer) and do not lend themselves well to segmentation, the highly segmented part of the premium should represent only 20% of the amount paid by the insured. Therefore, since insureds all seem equal in the face of severe claims (with few exceptions), the rate competition observed in commercial premiums sometimes seems questionable.

In our portfolio, 17,785 policyholders reported at least one claim in the year 1997. The empirical pure premium amounts to 203.38 Euros. Some descriptive statistics can be found in Table 10.18. The level at which a claim is classified as “severe” is obtained using extreme value theory (presented in Chapter 15).

Table 10.18: Descriptive statistics of pure premium amount
Statistic Value Percentile Percentile.Value
Mean 1638.51000 100% 1.98957E+06
Median 522.58400 99% 1.75898E+04
Mode 1426.37900 95% 3.59446E+03
Std Deviation 17132.44560 90% 2.58231E+03
Skewness 91.79986 75% Q3 1.42638E+03
Coeff Variation 1045.61099 50% Median 5.22584E+02
Interquartile Range 1284.00000 25% Q1 1.42192E+02

10.11.9.3 Logistic Regression for Analyzing the Occurrence of Severe Claims

Logistic regression is used to explain the occurrence of severe claims. The backward method for selecting explanatory variables leads to the successive exclusion of the following variables: , , , , , , and . This finally results in the following results (after grouping some categories of the variable), as shown in Table 10.19.

Table 10.19: Probabilities of not causing severe claims based on the insured’s characteristics
CARB AGESGROUP Probability.of.Not.Causing.a.Severe.Claim
Gasoline Beginner-Young 0.9533
Gasoline Senior-Experienced 0.9620
Diesel Beginner-Young 0.9435
Diesel Senior-Experienced 0.9540

10.11.9.4 Analysis of Standard Claim Costs

Gamma Regression Model

We will assume that the claim costs \(C_{i1},C_{i2},\ldots\) incurred by insured \(i\) are independent and identically distributed as Gamma variables with mean \[ \mu_i = \mathbb{E}[C_{ik}|\mathbf{x}_i] = \exp(\boldsymbol{\beta}^\top\mathbf{x}_i) \] and variance \[ \text{Var}[C_{ik}|\mathbf{x}_i] = \frac{\{\exp(\boldsymbol{\beta}^\top\mathbf{x}_i)\}^2}{\nu}. \]

Remark. Often, only the total cost \(C_{i\bullet} = \sum_{k=1}^{n_i}C_{ik}\) is available, not the details of the various components of the sum \(C_{ik}\). In this case, we will work with the average cost \(\overline{C}_{i\bullet} = C_{i\bullet}/n_i\). In this case, we can let the parameter \(\nu\) vary from insured to insured by specifying \(\nu_i = \nu\omega_i\), where the weight \(\omega_i\) is the number of claims \(n_i\).

Neglecting severe claims, the pure premium for insured \(i\) is given by \[\begin{eqnarray*} \mathbb{E}\left[\sum_{k=1}^{N_i}C_{ik}\right] &=& \mathbb{E}[N_i]\mathbb{E}[C_{i1}] = \exp\left((\boldsymbol{\beta}_{\text{freq}} + \boldsymbol{\beta}_{\text{cost}})^\top\mathbf{x}_i\right), \end{eqnarray*}\] which provides a multiplicative rate when all explanatory variables are binary. The pure premium then becomes \[ \exp\left((\boldsymbol{\beta}_{\text{freq}} + \boldsymbol{\beta}_{\text{cost}})^\top\mathbf{x}_i\right) + q_i\mathbb{E}[L], \] where \(q_i = \mathbb{E}[I_i]\) is the probability that insured \(i\) causes at least one severe claim.

Model Fitting

The backward method for selecting explanatory variables led to the exclusion of several explanatory variables. Table 10.20 describes the final model.

Table 10.20: Parameter estimation of the gamma regression model for claim costs.
Parameter Estimate Wald.95..Confidence.Limits Pr…ChiSq
Intercept 6.6389 6.5908 - 6.6873 <.0001
FRAC -0.0402 -0.0764 - -0.0039 0.0296
FRAC 0.0000 0 - 0 .
GARACCESS 0.0702 0.0342 - 0.1062 0.0001
GARACCESS 0.0000 0 - 0 .
AGESGRP -0.0657 -0.1095 - -0.0222 0.0032
AGESGRP 0.0000 0 - 0 .
AGGLOM 0.0826 0.0314 - 0.1345 0.0017
AGGLOM 0.0000 0 - 0 .
Scale 0.7294 0.7155 - 0.7435 .

Remark. Gamma regression is not the only way to model claim costs. In the log-normal model, it is assumed that the natural logarithm of claim amounts follows a normal distribution with a mean given by a linear predictor \(\boldsymbol{\beta}^\top\mathbf{x}_{it}\) and a constant variance \(\sigma^2\), i.e., \[ \ln C_{ik} \sim \mathcal{N}or(\boldsymbol{\beta}^\top\mathbf{x}_i, \sigma^2). \] The likelihood equation for estimating the parameter \(\widehat{\boldsymbol{\beta}}\) is given by \[ \sum_{i\text{ s.t. }n_i>0}\sum_{k=1}^{n_i}\left(\ln c_{ik} - \boldsymbol{\beta}^\top\mathbf{x}_i\right)\mathbf{x}_i = \boldsymbol{0} \Leftrightarrow \sum_{i\text{ s.t. }n_i>0}lcres_i\mathbf{x}_i = \boldsymbol{0}, \] where \(lcres_i\) is the estimation residual defined for insureds who reported at least one claim as \[ lcres_i = \sum_{k=1}^{n_i}\left\{\ln c_{ik} - \boldsymbol{\beta}^\top\mathbf{x}_i\right\} = \sum_{k=1}^{n_i}\ln c_{ik} - n_i\boldsymbol{\beta}^\top\mathbf{x}_i. \] This equation also expresses an orthogonality relationship between explanatory variables and estimation residuals.

This approach is not without drawbacks: since the dependent variable is the natural logarithm of claim amounts, inferential conclusions will relate to the transformed claim amounts, and it is not always easy to revert to the original data.

Remark (Another Application of Gamma Regression: Individual Reserving Model). An interesting application of the analysis of claim costs is to determine individual reserving rules. As soon as a claim is reported, the company must set aside an amount corresponding to the probable cost of that event to provide shareholders, the market, and regulatory authorities with an accurate picture of its financial position.

An automated approach is to explain the amount to be reserved based on the characteristics of claims reported in the past (in addition to a priori variables, the insurer will also use information related to the circumstances of the claims, such as the time of day it occurred, the presence of bodily injury, etc.).

10.12 Panel Data Rating

This section is based on (Denuit, Walhin, and Pitrebois 2003) and only addresses the number of claims.

10.12.1 Rating Based on Panel Data

Often, actuaries use several years of observations to construct their rates (to increase the size of the database and to avoid giving too much importance to events in a particular year). This has the consequence that some of the data will no longer be independent. Indeed, observations made on the same insured over different periods are likely correlated (which is the reason for {} rating, which will be discussed in the following chapters). We are thus dealing with panel data.

In the context of {} rating, the dependence between observations related to the same policy is considered a nuisance: at this stage, the actuary wants to determine the impact of observable factors on the insured risk, and the correlations that exist between the data prevent the use of classical statistical techniques (most of which are based on the assumption of independence). Here, we will show how to take this dependence into account to improve the quality of estimates using techniques proposed by (Liang and Zeger 1986) and (Zeger, Liang, and Albert 1988).

The estimators of claim frequencies obtained under the assumption of independence of individual data over different periods are convergent (meaning they will tend in probability to population values as the sample size increases). Therefore, it is reasonable to expect that for large automobile portfolios, the impact of the simplifying assumption of independence on point estimates will be minimal. Indeed, this is what we will demonstrate in the empirical part of our study.

10.12.2 Notation

As explained above, insurance companies often use multiple observation periods to build their rates. Individual observations are therefore doubly indexed, by policy \(i\) and period \(t\). From now on, \(N_{it}\) represents the number of claims reported by insured \(i\) during period \(t\), \(i=1,2,\ldots,n\), \(t=1,2,\ldots,T_i\), where \(T_i\) denotes the number of observation periods for insured \(i\). We will denote \(d_{it}\) as the duration of the \(t\)-th observation period for individual \(i\). When there is a change in observable variables, a new interval begins, so \(d_{it}\) may be different from 1. We assume that we also have other variables \(\mathbf{x}_{it}\) known at the beginning of the period \(t\), which can serve as explanatory factors for the insured’s claims experience. In addition to explanatory variables, we can introduce calendar time as a regression component to account for specific events or possible trends in claims frequency, following the approach of (Besson and Patrat 1992).

Typically, we are dealing with panel data: the same variable is measured on a large number \(n\) of individuals over time, with a relatively small number \(\max_{1\leq i\leq n} T_i\) of repetitions. Asymptotics will be done here by letting \(n\) tend to infinity, rather than the number of observations made on the same individual (as is typically the case in time series analysis).

10.12.3 Presentation of the Dataset

We illustrate our points on a Belgian insurance portfolio consisting of 20,354 policies observed over a period of 3 years. Figure ?? provides an idea of the exposure duration for the policies in the portfolio. A little over 34% of insureds remained in the portfolio for all three years. For each policy and each year, the number of claims and certain characteristics of the insured are recorded: the gender of the driver (male-female), the age of the driver (three age classes: \(18-22\) years, \(23-30\) years, and \(>30\) years), the vehicle power (three power classes: \(<66\)kW, \(66-110\)kW, and \(>110\)kW), the size of the driver’s city of residence (large, medium, or small based on the number of inhabitants), and the color of the vehicle (red or other). Across the entire portfolio, the average annual frequency is 18.4% (which is well above the European average).

Figures ?? to @reffig:(hist5) show histograms describing, for each explanatory variable, the distribution of the portfolio among different levels of the variable and, for each of these levels, the average frequency (in \(\%\)) of claims.

FIGURES

These histograms prompt the following comments. Figure ?? shows a slight underperformance for women (\(17.7\%\) compared to \(18.8\%\)), who represent \(36\%\) of the insured in the portfolio. The overperformance of young drivers is evident from Figure ?? (although they are underrepresented in the portfolio). Claim frequencies seem to decrease with age, going from \(30.8\%\) to \(20.8\%\) and finally to \(16.3\%\). Regarding the vehicle power, Figure ?? shows an underperformance for high-powered vehicles. Examination of Figure ?? reveals that claim frequency is higher in large urban areas. Claims frequency appears to decrease with the size of the urban area. Finally, Figure ?? shows that the red color of the vehicle does not seem to be an aggravating factor.

10.12.4 Poisson Regression Assuming Temporal Independence

10.12.4.1 Model

As a first approximation, we will assume that the \(N_{it}\) are independent for different values of \(i\) and \(t\). This is, of course, a strong simplifying assumption that we will evaluate the impact of by comparing the results obtained with those provided by different methods to account for the serial dependence that exists between the \(N_{it}\) at a fixed \(i\).

We assume that the conditional distribution of \(N_{it}\) given \(\mathbf{x}_{it}\) is Poisson and specify a mean of the form of a linear exponential, i.e., \[\begin{equation} N_{it}\sim\mathcal{P}ois\big(d_{it}\exp(\eta_{it})\big),\hspace{2mm} i=1,2,\ldots,n,\hspace{2mm}t=1,2,\ldots,T_i. \tag{10.36} \end{equation}\] The claim frequency for individual \(i\) during period \(t\) is \(\lambda_{it}=d_{it}\exp(\eta_{it})\).

10.12.4.2 Parameter Estimation

Let \(n_{it}\) be the number of claims reported by insured \(i\) during period \(t\). The likelihood associated with these observations is then \[ \mathcal{L}(\boldsymbol{\beta}|\mathbf{n})=\prod_{i=1}^n\prod_{t=1}^{T_i} \exp\{-\lambda_{it}\}\frac{\{\lambda_{it}\}^{n_{it}}}{n_{it}!}; \] this is the probability of obtaining the observations made within the portfolio in the considered model (note that \(\mathcal{L}\) is a function of the parameters \(\boldsymbol{\beta}\), with the observations assumed to be known).

The maximum likelihood estimation of \(\boldsymbol{\beta}\) consists of determining \(\widehat{\boldsymbol{\beta}}\) by maximizing \(\mathcal{L}(\boldsymbol{\beta}|\mathbf{n})\): \(\widehat{\boldsymbol{\beta}}\) is therefore the value of the parameter making the observations collected by the actuary most probable. To facilitate the attainment of the maximum, we often switch to the log-likelihood, which is given by \[ L(\boldsymbol{\beta}|\mathbf{n})=\ln \mathcal{L}(\boldsymbol{\beta}|\mathbf{n})=\sum_{i=1}^n\sum_{t=1}^{T_i} \Big\{-\ln n_{it}!+n_{it}(\eta_{it}+\ln d_{it})-\lambda_{it}\Big\}. \] Therefore, \(\widehat{\boldsymbol{\beta}}\) is the solution of the system \[\begin{equation} \frac{\partial}{\partial \beta_0}L(\boldsymbol{\beta}|\mathbf{n})=0 \Leftrightarrow \sum_{i=1}^n\sum_{t=1}^{T_i}n_{it}=\sum_{i=1}^n\sum_{t=1}^{T_i}\lambda_{it} \tag{10.37} \end{equation}\] and for \(j=1,2,\ldots,p\), \[\begin{eqnarray} \frac{\partial}{\partial \beta_j}L(\boldsymbol{\beta}|\mathbf{n})=0 &\Leftrightarrow &\sum_{i=1}^n\sum_{t=1}^{T_i}x_{itj}\Big\{n_{it}-\lambda_{it}\Big\}=0 \tag{10.38} \end{eqnarray}\]

Unsurprisingly, we can interpret the likelihood equations (10.38) as a relation of orthogonality between the explanatory variables \(\mathbf{x}_{it}\) and the estimation residuals.

The variance-covariance matrix \(\boldsymbol{\Sigma}\) of the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) of the parameter \(\boldsymbol{\beta}\) is the inverse of the Fisher information matrix \(\mathcal{I}\). It can be estimated by \[ \widehat{\boldsymbol{\Sigma}}=\left\{\sum_{i=1}^n\sum_{t=1}^{T_i}\mathbf{x}_{it}\mathbf{x}_{it}^\top \widehat{\lambda}_{it}\right\}^{-1}. \]

Under the asymptotic theory of the maximum likelihood method, \(\widehat{\boldsymbol{\beta}}\) is approximately normally distributed with a mean equal to the true parameter value and a variance-covariance matrix \(\widehat{\boldsymbol{\Sigma}}\). This allows for obtaining confidence intervals and regions for the parameters.

10.12.4.3 Numerical Illustration

The procedure in SAS allows for performing Poisson regression of the number of claims on the \(5\) explanatory variables presented in Section 1.7. The explanatory variable “vehicle color” is not significant. After removing it from the model, we group the power levels “\(66-110\)kW” and “\(>110\)kW” into a single class. This leads us to the selected model, which is described in Table 10.21.

Table 10.21: Parameter estimation of the gamma regression model for claim costs.
Parameter Estimate Wald.95..Confidence.Limits Pr…ChiSq
Intercept 6.6389 6.5908 - 6.6873 <.0001
FRAC -0.0402 -0.0764 - -0.0039 0.0296
FRAC 0.0000 0 - 0 .
GARACCESS 0.0702 0.0342 - 0.1062 0.0001
GARACCESS 0.0000 0 - 0 .
AGESGRP -0.0657 -0.1095 - -0.0222 0.0032
AGESGRP 0.0000 0 - 0 .
AGGLOM 0.0826 0.0314 - 0.1345 0.0017
AGGLOM 0.0000 0 - 0 .
Scale 0.7294 0.7155 - 0.7435 .

The log-likelihood is -19,283.2, and the Type 3 analysis provides the results shown in Table @ref(tab:type3_final). Except for the power variable, all variables are statistically significant, and the omission of any of them significantly worsens the model (at the 5% level). However, we decide to keep the power variable due to its importance in insurance pricing and its slight exceedance of the threshold (only 0.93%). The log-likelihood of the final model is only slightly worse than that of the unconstrained model (-19,282.6).

(#tab:type3_final)Results of the Type 3 analysis for the final model assuming serial independence
Source DF ChiSquare Pr…ChiSq
Sex 1 4.74 0.0294
Age 2 176.07 <.0001
Age 0 0.00 .
Power 1 3.56 0.0593
City 2 73.82 <.0001
City 0 0.00 .

10.12.4.4 Residual Analysis

Figure @ref(fig:res_pred) shows individual deviance residuals. One can observe the structure reflecting the few observed values for \(N_{it}\). Therefore, the quality of the model cannot be judged based on Figure @ref(fig:res_pred). If we recalculate the residuals by classes, we obtain Figure @ref(fig:res_classes). No particular structure can be observed, but some residuals have relatively high values, which question the accuracy of the model.

10.12.4.5 Overdispersion

To detect possible overdispersion, we calculate, for each risk class, the mean and empirical variance of the number of claims, \(\widehat{m}_k\) and \(\widehat{\sigma}_k^2\), respectively, and plot the points \(\{(\widehat{m}_k, \widehat{\sigma}_k^2),\hspace{2mm}k=1,2,\ldots\}\) in a graph. The result is visible in Figure ??. Strong overdispersion can be observed for all categories of insureds. The points \((\widehat{m}_k, \widehat{\sigma}_k^2)\) are indeed located above the first bisector of the grid. This also leads us to consider that the model of Poisson with temporal independence is not suitable.

Remark. It is possible to account for the observed overdispersion without acknowledging the potential serial dependence. To do this, either a Poisson mixture model or a quasi-likelihood approach is used by specifying \[\begin{equation} \mathbb{V}[N_{it}|\boldsymbol{x}_{it}]=\phi\mathbb{E}[N_{it}|\boldsymbol{x}_{it}]=\phi\lambda_{it}. \tag{10.39} \end{equation}\] To visually test the validity of this relationship, we fit the point cloud from Figure ?? using a line passing through the origin (i.e., with equation \(y=\phi x\)). This yields an estimate@ref(d dispersion parameter \(\phi\) of 1.9122 and a coefficient of determination \(R^2=86.17\%\) (indicating that the line explains over 86% of the variability in the point cloud). For comparison, if we had attempted to fit a second-degree curve (such as \(y=x+\gamma x^2\), typical of the mean-variance relationship in a Poisson mixture model), we would have obtained \(y=x+2.9545 x^2\) with \(R^2=90.90\%\). Therefore, a Poisson mixture (such as the negative binomial distribution) could also have been considered. However, in this section, we prioritize a quasi-likelihood approach. This involves determining \(\widehat{\boldsymbol{\beta}}\) by solving the system (10.37)-(10.38). Then, \(\widehat{\phi}\) is obtained by dividing either the deviance or the Pearson statistic by the number of degrees of freedom. The estimated value of \(\phi\) on our data is 1.35, which reflects the overdispersion of the data.

The introduction of the overdispersion parameter \(\phi\) inflates the variances and covariances of the \(\widehat{\beta}_j\) (which are multiplied by \(\widehat{\phi}\)). This has the effect of reducing the values of the test statistics used to assess the nullity of \(\beta_j\) or the relevance of including certain variables in the model. Accounting for overdispersion can therefore lead to the exclusion of tariff variables that would have been retained in the pure Poisson model. We observe such an effect in our dataset, as the p-value of the power variable in the Type 3 analysis increases to 10.44%.

10.12.5 Accounting for Temporal Dependence

10.12.5.1 Detection of Serial Dependence

To get a first idea of the type of dependence between \(N_{it}\), one can consider the observations \(N_{it}\), \(t=2,\ldots,T_i\), \(i=1,\ldots,n\), and perform a regression of these on the corresponding explanatory variables \(\boldsymbol{x}_{it}\) as well as the number \(N_{i,t-1}\) of claims observed during the previous coverage period. This will also allow us to see the effect of including past values of the variable of interest in the explanatory variables.

To highlight this dependence, we work with the observations of the last two years available to us. Therefore, we consider the observations \(N_{it}\), \(t=2,3\), \(i=1,\ldots,n\), and perform a regression of these on the explanatory variables \(\boldsymbol{x}_{it}\), to which we add the variable \(N_{i,t-1}\), i.e., the number of claims observed during the previous period. We start with a model containing the 5 explanatory variables already presented and refine it step by step, using Type 3 analysis. We begin by removing the “vehicle color” variable, which has a p-value of \(27.37\%\), and in a second step, we remove the “driver’s gender” variable, whose p-value has become \(21.10\%\). We then obtain the model whose results are presented in Tables @ref(tab:gen_sinprec) and @ref(tab:type3_sinprec). The regression coefficient obtained for the past number of claims is highly significant, indicating serial dependence.

(#tab:gen_sinprec)Results of the regression for the model accounting for past claims
Variable Level Coeff.beta Std.Error Wald.95..Conf.Limit Chi.Sq Pr.ChiSq
Intercept -2.0405 0.0370 (-2.1131, -1.9680) 3041.80 <.0001
Age 17-22 0.5841 0.0983 (0.3914, 0.7767) 35.31 <.0001
Age 23-30 0.1822 0.0348 (0.1140, 0.2503) 27.41 <.0001
Age >30 0 0 (0, 0) . .
Power >110 kW -0.0745 0.1035 (-2.2773, 0.1283) 0.52 0.4716
Power 66-110 kW 0.0933 0.0357 (0.0233, 0.1633) 6.83 0.0090
Power <66 kW 0 0 (0, 0) . .
City Large 0.2201 0.0412 (0.1394, 0.3009) 28.54 <.0001
City Medium 0.1050 0.0413 (0.0242, 0.1859) 6.48 0.0109
City Small 0 0 (0, 0) . .
N_{t-1} 0.3113 0.0371 (0.2387, 0.3839) 70.59 <.0001
(#tab:type3_sinprec)Results of the Type 3 analysis for the model accounting for past claims
Source DF ChiSquare Pr…ChiSq
Age 2 50.58 <.0001
Age 2 50.58 <.0001
Age 2 50.58 <.0001
Power 2 7.94 0.0188
Power 2 7.94 0.0188
Power 2 7.94 0.0188
City 2 28.68 <.0001
City 2 28.68 <.0001
City 2 28.68 <.0001
\(N_{t-1}\) 1 63.38 <.0001

In a second approach, we start from the frequency obtained under the assumption of independence and without adding the number of claims from the previous year as an explanatory variable. This premium is then corrected by a multiplicative factor obtained from a Poisson regression on the single variable “number of claims from the previous year” (with the frequency premium obtained under the assumption of independence as an offset). The results of this regression can be found in Tables @ref(tab:gen_sinprecmult) and @ref(tab:type3_sinprecmult).

(#tab:gen_sinprecmult)Results of the regression for the model accounting for past claims while fixing the influence of explanatory variables
Variable Coeff.beta Std.Error Wald.95..Conf.Limit Chi.Sq Pr…ChiSq
Intercept -0.1147 0.0180 (-0.1500, -0.0793) 40.42 <.0001
\(N_{t-1}\) 0.3040 0.0370 (0.2316, 0.3765) 67.65 <.0001
(#tab:type3_sinprecmult)Results of the Type 3 analysis for the model accounting for past claims while fixing the influence of explanatory variables
Source DF ChiSquare Pr…ChiSq
\(N_{t-1}\) 1 60.84 <.0001

Remark. It is interesting to note that this approach immediately provides “French-style” bonus-malus coefficients. Indeed, Table @ref(tab:gen_sinprecmult) informs us that policyholders who have not reported any claims in the year will see their premium multiplied by \(\exp(-0.1147)=0.8916\), while those who have reported \(k\) claims will face a premium increase of \(\exp(-0.1147+k\times 0.3040)=0.8916\times(1.3553)^k\). It is always interesting to compare these coefficients to those produced by a more orthodox model formulated in terms of latent variables.

The data, therefore, suggest serial dependence. This invalidates the results obtained in the previous section, which are based on the assumption that \(N_{it}\) is independent for different values of \(i\) and \(t\). Theoretically, however, it can be shown that the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\) calculated under the assumption of serial independence (i.e., with a specification error) is convergent. Therefore, if the portfolio size is sufficiently large, we expect little impact on the point estimates of the different \(\beta_j\). However, the variance of \(\widehat{\boldsymbol{\beta}}\) cannot be calculated as described above and is affected by serial dependence.

10.12.5.2 Parameter Estimation Using GEE Technique

In the presence of serial dependence, one might consider keeping the maximum likelihood estimator in the Poisson model with temporal independence (i.e., the solution of (10.37)-(10.38), which is justified by its convergence property. As shown by (Liang and Zeger 1986), it is possible to improve this approach (i.e., obtain estimators whose asymptotic variance will be lower than that of those we have just described). This is the Generalized Estimating Equation (GEE) method proposed by (Liang and Zeger 1986). The estimators provided by this method are convergent; therefore, it is hoped that the estimates obtained in this way will be of good quality given the large number of observations generally available to actuaries.

The idea is simple: retaining the maximum likelihood estimator \(\widehat{\boldsymbol{\beta}}\), a solution of (10.37)-(10.38), to estimate \(\boldsymbol{\beta}\) in the model with random effects is certainly not optimal since it does not take into account the correlation structure of \(N_{it}\). Let us rewrite the system (10.37)-(10.38) in vector form: \[\begin{equation} \sum_{i=1}^n\boldsymbol{X}_i^\top(\boldsymbol{n}_i-\mathbb{E}[\boldsymbol{N}_i])=\boldsymbol{0}\text{ where }\boldsymbol{X}_i=(\boldsymbol{x}_{i1},\ldots,\boldsymbol{x}_{iT_i})^\top. \tag{10.40} \end{equation}\] The covariance matrix of \(N_{it}\) in the Poisson model with serial independence is \[ \boldsymbol{A}_i=\left( \begin{array}{cccc} \lambda_{i1} & 0 & \cdots & 0\\ 0 & \lambda_{i2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \lambda_{iT_i} \end{array} \right). \] This matrix does not account for overdispersion or serial dependence present in the data. If we explicitly introduce the matrix \(\boldsymbol{A}_i\) into (10.40), we obtain \[\begin{equation} \sum_{i=1}^n\left(\frac{\partial}{\partial\boldsymbol{\beta}} \mathbb{E}[\boldsymbol{N}_i]\right)^\top\boldsymbol{A}_i^{-1}(\boldsymbol{n}_i-\mathbb{E}[\boldsymbol{N}_i]) =\boldsymbol{0} \tag{10.41} \end{equation}\] since \[ \frac{\partial}{\partial\boldsymbol{\beta}}\mathbb{E}[\boldsymbol{N}_i]=\boldsymbol{A}_i\boldsymbol{X}_i. \]

The principle of GEE is to replace \(\boldsymbol{A}_i\) in (10.41) with a more reasonable candidate for the variance-covariance matrix of \(\boldsymbol{N}_i\), where “reasonable” means accounting for overdispersion and temporal correlation. Let us specify a plausible form for the covariance matrix of \(\boldsymbol{N}_i\): we could consider \[ \boldsymbol{V}_i=\phi\boldsymbol{A}_i^{1/2}\boldsymbol{R}_i(\boldsymbol{\alpha})\boldsymbol{A}_i^{1/2} \] where the correlation matrix \(\boldsymbol{R}_i(\boldsymbol{\alpha})\) accounts for the serial dependence between the components of \(\boldsymbol{N}_i\) and depends on a set of parameters \(\boldsymbol{\alpha}\). The matrix \(\boldsymbol{R}_i\) is a square sub-matrix of dimension \(T_i\times T_i\) of a matrix \(\boldsymbol{R}\) of dimension \(T_{\max}\times T_{\max}\) whose elements do not depend on the characteristics \(\boldsymbol{x}_{it}\) of individual \(i\). Overdispersion is accounted for since \(\mathbb{V}[N_{it}]=\phi\lambda_{it}\). Note that the matrix \(\boldsymbol{V}_i\) defined in this way is the covariance matrix of \(\boldsymbol{N}_i\) only if \(\boldsymbol{R}_i(\boldsymbol{\alpha})\) is the correlation matrix of \(\boldsymbol{N}_i\), which is not necessarily the case.

As mentioned above, the idea is to substitute the matrix \(\boldsymbol{V}_i\) for \(\boldsymbol{A}_i\) in (10.41), and to retain as the estimate of \(\boldsymbol{\beta}\) the solution of \[\begin{equation} \sum_{i=1}^n\left(\frac{\partial}{\partial\boldsymbol{\beta}}\mathbb{E}[\boldsymbol{N}_i]\right)^\top \boldsymbol{V}_i^{-1}(\boldsymbol{n}_i-\mathbb{E}[\boldsymbol{N}_i]) =\boldsymbol{0} \tag{10.42} \end{equation}\] This last equation also expresses orthogonality between the regression residuals and the explanatory variables. The estimators obtained in this way are convergent regardless of the choice of the matrix \(\boldsymbol{R}_i(\boldsymbol{\alpha})\). It is evident that they will be more precise the closer \(\boldsymbol{R}_i(\boldsymbol{\alpha})\) is to the true correlation matrix of \(\boldsymbol{N}_i\).

10.12.6 Modeling Dependence Using the “Working Correlation Matrix”

As we understood from the above, it is the correlation matrix \(\boldsymbol{R}_i\) that takes into account the dependence between observations related to the same insured individual. This matrix of dimension \(T_i\times T_i\) is called the “working correlation matrix.” It is a correlation matrix of a specified form depending on a number of parameters contained in the vector \(\boldsymbol{\alpha}\).

If \(\boldsymbol{R}_i(\boldsymbol{\alpha})=\)Identity, (10.42) exactly yields the likelihood equations (10.40) under the assumption of independence.

In general, in the context of prior-rate making, one specifies a matrix \(\boldsymbol{R}_i(\boldsymbol{\alpha})\) that reflects an autoregressive-type structure. Thus, the diagonal elements of \(\boldsymbol{R}_i\) are 1, and off-diagonal, the element \(jk\) is \(\alpha_{|j-k|}\) for \(|j-k|\leq m\) and 0 for \(|j-k|>m\). We take \(m=T_{\max}-1\). The components of the vector \(\boldsymbol{\alpha}\) that parameterize the matrix \(\boldsymbol{R}_i(\alpha)\) describing the type of dependence between the data are estimated based on observations.

10.12.7 Obtaining Estimates

Equation (10.42) is generally solved using a modified Fisher score method for \(\boldsymbol{\beta}\) and a moment estimation for \(\boldsymbol{\alpha}\) (we refer the reader to (Liang and Zeger 1986) for a complete description of the method). Specifically, starting with an initial value \(\widehat{\boldsymbol{\beta}}^{(0)}\) that solves the system (10.37)-(10.38), we calculate \[\begin{eqnarray*} \widehat{\boldsymbol{\beta}}^{(j+1)}&=&\widehat{\boldsymbol{\beta}}^{(j)}+\left\{\sum_{i=1}^n\boldsymbol{D}_i^\top(\widehat{\boldsymbol{\beta}}^{(j)}) \boldsymbol{V}_i^{-1}(\widehat{\boldsymbol{\beta}}^{(j)},\boldsymbol{\alpha}(\widehat{\boldsymbol{\beta}}^{(j)}))\boldsymbol{D}_i(\widehat{\boldsymbol{\beta}}^{(j)})\right\}^{-1}\\ &&\hspace{20mm}\left\{\sum_{i=1}^n\boldsymbol{D}_i^\top(\widehat{\boldsymbol{\beta}}^{(j)}) \boldsymbol{V}_i^{-1}(\widehat{\boldsymbol{\beta}}^{(j)},\boldsymbol{\alpha}(\widehat{\boldsymbol{\beta}}^{(j)}))\boldsymbol{S}_i(\widehat{\boldsymbol{\beta}}^{(j)})\right\} \end{eqnarray*}\] where \(\boldsymbol{D}_i(\boldsymbol{\beta})=\frac{\partial}{\partial\boldsymbol{\beta}}\mathbb{E}[\boldsymbol{N}_i]\) and \(\boldsymbol{S}_i(\boldsymbol{\beta})=\boldsymbol{N}_i-\mathbb{E}[\boldsymbol{N}_i]\). At each step, \(\phi\) and \(\boldsymbol{\alpha}\) are re-estimated from the Pearson residuals \(r_{it}^P=\frac{n_{it}-\lambda_{it}}{\sqrt{\lambda_{it}}}\) using the formulas \[ \widehat{\phi}=\frac{1}{\sum_{i=1}^nT_i-p}\sum_{i=1}^n\sum_{t=1}^{T_i}\{r_{it}^P\}^2 \] and \[ \widehat{\alpha}_\tau=\frac{1}{\widehat{\phi}\left(\sum_{i|T_i>\tau}(T_i-\tau)-p\right)} \sum_{i|T_i>\tau}\sum_{t=1}^{T_i-\tau}r_{it}^Pr_{it+\tau}^P. \]

10.12.8 Numerical Illustration

The serial dependence of \(N_{it}\) for a fixed \(i\) has clearly been demonstrated in Section 3.1, and it is important to assess the impact of the independence assumption on frequency estimation. The GEE approach can be implemented using the SAS procedure. A variable selection, based as previously on Type 3 analysis, leads us to select the same variables as in the model assuming independence. The results are shown in Tables @ref(tab:gee_unstr) and @ref(tab:type3_gee_unstr). The estimation of the “working correlation matrix” of a second-order autoregressive structure (i.e., order \(T_{\max}-1\)) gives \[ \left( \begin{array}{ccc} 1 & 0.0493 & 0.0462\\ 0.0493 & 1 & 0.0493\\ 0.0462 & 0.0493 & 1\\ \end{array} \right) \] and \(\widehat{\phi}=1.3437\).

TBLEAU

10.13 Technical Justifications for Segmentation

10.13.1 Technical Rate and Commercial Rate

One of the fundamental tasks of an actuary is to evaluate, on the one hand, the pure premium, which is the contribution of each policyholder that enables the insurer to compensate for claims, and, on the other hand, to assess the security loading that ensures the stability of the company’s results. The concept of pure premium has been discussed in Chapter 3, while the determination of security loadings was studied in Chapter 4. In this regard, the actuary will conduct as detailed a technical analysis as possible of the risk to be covered and will construct the technical rate. The technical rate provides the cost price of the coverage granted by the insurer, based on the risk profile of the insured.

The technical rate is for purely internal use. The company will apply the commercial rate to policyholders, which can significantly differ from the technical rate. While the commercial rate may be based on the technical rate, regulatory considerations or factors related to the company’s market positioning, as well as a desire to simplify the rate grid, can lead to commercial premiums that are sometimes very different from the technical rate. Two rate grids coexist: one purely technical and for internal use only, resulting from a detailed analysis carried out by the actuary, and the other commercial, describing the amounts that will actually be charged to policyholders.

10.13.2 Segmentation of Technical and Commercial Rates

At the technical level, the actuary can either finely segment the portfolio or use mixture models to account for the heterogeneity resulting from the absence of segmentation. Let us now turn our attention to the commercial rate, where certain imperfections in the market may require the actuary to segment.

Over the last decade, commercial rates that were once nearly uniform have gradually been differentiated based on the profile of policyholders. This phenomenon is often referred to as segmentation (see Section 3.8). Although segmentation is not limited to rate differentiation, it also includes risk selection that the insurer performs when concluding the contract (acceptance) or during the contract (termination).

The purpose of this section is not to justify or criticize the principle of segmentation at the commercial level. In this matter, contractual freedom should prevail, and it is up to the state to regulate the market if necessary. We only want to show that most rate choices made by companies result from commercial or competitive considerations and are not imposed by actuarial science. Only adverse selection from policyholders can compel an insurer to reflect rate differentiation among policyholders. This criterion is then found both in the technical rate and in the commercial rate (but perhaps not in the same way).

The reader should keep in mind that rate differentiation has no impact on claims experience. Therefore, regardless of the company’s pricing policy, the pure premium for the portfolio must remain the same (since it corresponds to the expected cost of claims). Hence, any reduction in premium for one category of policyholders necessarily leads to a corresponding increase in premiums paid by other categories of policyholders.

Excessive rate differentiation can have disastrous effects on an insurance market. Indeed, such differentiation significantly limits the insurability of risks: by decreasing (often only moderately) the premium for certain categories of policyholders, it increases the premium for others, sometimes to the point of excluding them from the insurance market. The example of automobile liability insurance is instructive in this regard.

The increasing level of premium differentiation observed in the market is mainly explained by the spiral of segmentation: it is competition that drives companies to differentiate premiums more and more. Indeed, an insurer can hardly maintain a uniform rate in a market where competitors differentiate risks. Theoretically, as soon as a market participant introduces a new criterion for rate differentiation, its competitors are obliged to implicitly or explicitly recognize it in their rates, as we explained in detail in Volume I. This is why we have witnessed increasingly fine rate differentiation in recent years, particularly in automobile liability insurance. Therefore, it is the spiral of segmentation induced by fierce competition among market participants that leads to excessive segmentation, and not any actuarial prescription.

10.13.3 Adverse Selection and Segmentation

One characteristic of private insurance lies in the freedom to contract for both parties involved: the policyholder and the insurer. Therefore, the prospective policyholder will choose the insurance contract that they consider most attractive. Furthermore, the prospective policyholder has a certain advantage in terms of information compared to an insurer that does little segmentation. The prospective policyholder knows their own situation very precisely, whereas the insurer has no knowledge of certain risk-aggravating factors that may be present. This asymmetric information leads to adverse selection from policyholders: policyholders whose profile is riskier tend to subscribe to the policies offered by the company in large numbers, thereby deteriorating the insurer’s claims experience.

Adverse selection from policyholders has the effect of limiting access to insurance. To understand this, consider, for example, coverage against a dreaded disease, M. If the population includes individuals predisposed to this disease and others who are not, without any of them knowing their predisposition, the insurer may cover them all based on an average premium. The policy taken out can then be seen as a multi-guarantee contract, which first covers the risk of being predisposed to disease M and then covers the cost of treating it when it occurs.

Suppose, for example, that 10% of the population is predisposed to disease M. Specifically, in the case of predisposition, the probability of developing disease M is 2%, while it is only 0.1% in the absence of predisposition. This predisposition can be detected through a genetic test, but its use is prohibited for the insurer and too costly for the insured. The cost of treating disease M is €10,000. The pure premium for coverage against this disease is:

\[ 10\%\times 2\% \times €10,000 + 90\% \times 0.1\% \times €10,000 = €20 + €9 = €29. \]

Upon closer examination of this example, it can be seen that uniform pricing actually covers two distinct risks: first, the risk of being predisposed to disease M, and second, the risk of developing the disease. Formally, each policyholder pays a premium of €10 as if they were not predisposed, and adds to that a premium of €19 that covers the risk of being predisposed to disease M.

It is also evident that this approach allows for risk coverage at financially acceptable conditions for all individuals, whether predisposed or not. This perspective, equating the policies offered by an insurer that does not segment with multi-guarantee contracts, was developed by insurance economists, including (Chiappori 1997). It justifies the higher premium for “good” risks who are unaware of their status by offering broader coverage.

Consider a portfolio of 10,000 policyholders whose composition is similar to that of the population. Individuals are unaware of their potential predisposition. This portfolio therefore includes, on average, 1,000 policyholders predisposed to disease M and 9,000 who are not. Among the former, an average of 20 will develop disease M, resulting in a cost of €200,000 for the insurer. Among the latter, an average of 9 will develop disease M, resulting in a cost of €90,000 for the insurer. The total cost of €290,000 on average will be offset by the collection of 10,000 premiums of €29 each.

10.13.4 Inequity of Prior Pricing

An insurer who wishes to use a segmentation criterion must be able to demonstrate, with statistical evidence, the causal link between this criterion and the variations in claims that it is supposed to induce. Apart from the fact that establishing a causal link resembles the quest for the Holy Grail for a statistician (except in the case of an experiment plan with controlled parameters), this requirement discredits many segmentation criteria currently used. Who could believe that cohabitation or having multiple children improves the quality of driving? It is more likely that marital life or family responsibilities encourage caution and a refusal to take unnecessary risks, resulting in less recklessness behind the wheel, which in turn reduces the risk in automobile liability insurance. There is no causal relationship between these pricing variables and automobile claims. The relevant risk factors are hidden: aggressiveness on the road, adherence to traffic rules, alcohol consumption, annual mileage, etc., and therefore cannot be incorporated into the commercial rate.

However, there is a viable solution in automobile insurance: the PAYD (Pay As You Drive) system developed on both sides of the Atlantic by Norwich Union in the UK and Progressive in the USA. This is a decidedly innovative approach to automobile insurance, made possible by the latest technologies. A “black box” is installed in the insured vehicle, recording speed, accelerations and decelerations, kilometers traveled, and the type of road taken. Privacy is perfectly protected because the data is filtered before reaching the insurer, making it impossible for the insurer to reconstruct the journeys based on the summary provided. However, this summary allows the insurer to very accurately assess the vehicle’s driving quality: by comparing recorded speeds to the maximum speeds allowed depending on the type of road, the insurer can assess how well policyholders adhere to the speed limits imposed by traffic regulations, and accelerations and decelerations indicate the nervousness of the driving, and so on. The insurer thus has the relevant information to price the contract accurately.

This approach also promotes road safety and would undoubtedly help reduce the heavy social cost of road accidents. In addition to calculating the premium, the system offers additional benefits, such as the possible location of the vehicle by the police in case of theft, alerting emergency services in case of an accident, and more.

The cost of the system is reasonable (around €500 for the first year, covering the acquisition of the equipment and its installation, and limited maintenance costs in subsequent years). The system is currently being tested by 5,000 drivers in the UK and another 5,000 in the USA. Other experiments are ongoing, including in Italy. Note that in Singapore, a similar system is used to bill monthly tolls based on the hours during which motorists used the roads and the types of roads taken.

10.14 Bibliographical Notes

Regarding scores, the reader can refer to (Gourieroux 1992) or (Gouriéroux 1999) and (Bardos 2001). Standard approaches to data (multivariate) analysis are presented in (Lebart, Morineau, and Piron 1995). Methods of classification are presented in a clear and pedagogical manner in (Nakache and Confais 2004). The theory of generalized linear models dates back to (Nelder and Wedderburn 1972) and has been thoroughly covered by (McCullagh and Nelder 1989). An excellent introduction is provided by (Dobson 2001).

We also recommend readers to consult (Bailey 1963) and (Bailey and Simon 1960) regarding the origins of risk classification, as well as (Anderson et al. 2004) regarding the practice of generalized linear models in pricing. Notable pioneers of regression models in pricing include (Ter Berg 1980), (Berg 1980), (Albrecht 1983), and especially (Renshaw 1994).

The treatment of severe claims can be done using extreme value theory, which is discussed later in this book. You can read (Cebrián, Denuit, and Lambert 2003) for a practical case.

We haven’t discussed the problem of geographical zone pricing here. An accessible introduction, along with a practical case, can be found in (Brouhns, Denuit, and Masuy 2002). (Denuit and Lang 2004) review the various approaches used in a priori pricing and propose an integrated pricing model based on generalized additive models in the Bayesian paradigm. All rating factors, whether categorical, continuous, spatial, or temporal, are treated in a unified manner.

Empirical studies are relatively scarce in the literature. In addition to those mentioned earlier, you can refer to (Ramlau-Hansen 1988) and (Beirlant et al. 1992). For other regression models not covered in this chapter, see (Beirlant et al. 1998), (Cummins et al. 1990), (Ter Berg 1996), and (Keiding, Andersen, and Fledelius 1998).

Finally, note that the models described in this chapter are also very useful in life insurance for constructing mortality tables. See (Delwarde and Denuit 2005) for more details.

10.15 Exercises

Exercise 10.1 Given independent and identically distributed observations \(X_1, X_2, \ldots, X_n\), let \(X_{(1)} \leq X_{(2)} \leq \ldots \leq X_{(n)}\) be the observations arranged in ascending order.

  1. Define the event \[ A(i_1, i_2) = \left[X_{(i_1)} < q_p < X_{(i_2)}\right], \] which means that at least \(i_1\) observations are less than the \(p\)th quantile \(q_p\), and at least \(n - i_2 + 1\) observations are greater than it. Show that \[\begin{eqnarray*} \Pr[A(i_1, i_2)] &=& \sum_{j=i_1}^{i_2-1}\binom{n}{j}p^j(1-p)^{n-j}. \end{eqnarray*}\]
  2. Show that a confidence interval at the \(1-\alpha\) level for \(q_p\) is \(]x_{(i_1)}, x_{(i_2)}[\), where the integers \(i_1\) and \(i_2\) satisfy the relation \[ 1 - \alpha = \Pr[i_1 \leq \text{Binomial}(n, p) < i_2]. \]
  3. When \(n\) is large, show that \(i_1\) and \(i_2\) are approximately given by \[ i_1 = \lfloor -z_{\alpha/2}\sqrt{np(1-p)} + np \rfloor \] and \[ i_2 = \lceil z_{\alpha/2}\sqrt{np(1-p)} + np \rceil, \] where \(z_{\alpha/2}\) is such that \(\Phi(z_{\alpha/2}) = 1 - \alpha/2\), and \(\lfloor x\rfloor\) (resp. \(\lceil x\rceil\)) represents the floor (integer part) and ceiling (smallest integer greater than or equal to) of the real number \(x\), respectively.

Exercise 10.2 Show that for the Poisson distribution with mean \(\lambda\), \[\begin{eqnarray*} \mathcal{I}(\widetilde{\lambda}|\lambda) &=& \widetilde{\lambda}\left(\ln\frac{\widetilde{\lambda}}{\lambda} - 1\right) + \lambda. \end{eqnarray*}\]

Exercise 10.3 Show that for the gamma distribution, \[\begin{eqnarray*} \mathcal{I}(\widetilde{\lambda}|\lambda) &=& \frac{3(\widetilde{\lambda}^{1/3} - \lambda^{1/3})}{\lambda^{1/3}}. \end{eqnarray*}\]

Exercise 10.4 Consider the simple linear model \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\), denoted as \((1)\), where \(\varepsilon_i\) are independent and identically distributed with \(\mathcal{N}or(0,\sigma^2)\).

  1. Let \(Z_i = Y_i - X_i\), \(\delta_0 = \beta_0\), and \(\delta_1 = \beta_1 - 1\), so \(Z_i = \delta_0 + \delta_1 X_i + \varepsilon_i\), denoted as \((2)\), is equivalent to model \((1)\). Compare the \(R^2\) obtained in both models.
  2. Conclude that by regressing \(Y\) not on \(X\) but on \(Y-X\), it is sometimes possible to artificially increase the value of \(R^2\).

Exercise 10.5

  1. Company A has 8,000 experienced drivers and 2,000 novice drivers in its portfolio. They have reported 400 and 200 claims, respectively. Estimate (using maximum likelihood in a Poisson regression model) the scores associated with these two categories of insureds.
  2. Two companies share the market: Company A and a competitor, Company B, which does not segment its tariff a priori. Explain what should happen in the market assuming rational insureds.

Exercise 10.6 In order to explain the annual claims frequency of insured individuals based on their observable characteristics, the actuary in charge of pricing Motor Third Party Liability insurance at a major insurance company performed a Poisson regression. The number \(N_i\) of claims caused by insured individual \(i\) is assumed to follow a Poisson distribution with a mean of \(d_i\exp(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})\), where \(d_i\) is the coverage duration and \[ x_{i1}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ is female},\\ 0,\text{ otherwise}, \end{array} \right. \] \[ x_{i2}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ pays their premium in installments},\\ 0,\text{ if they pay it annually}, \end{array} \right. \] and \[ x_{i3}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ drives a diesel vehicle},\\ 0,\text{ if they drive a gasoline vehicle}. \end{array} \right. \]

The maximum likelihood estimates obtained are described in the following table:

In this table, you can see the point estimates of the regression coefficients and their corresponding standard errors.

  1. What equations are \(\widehat{\beta}_0\), \(\widehat{\beta}_1\), \(\widehat{\beta}_2\), and \(\widehat{\beta}_3\) solutions to? What do they guarantee in terms of the distribution of claims between different categories of insureds?
  2. Do all risk factors significantly influence claims frequency?
  3. What is the annual claims frequency of the reference insured individual? Provide a point estimate and a 95% confidence interval.
  4. What is the estimated annual claims frequency of a female policyholder paying her premium annually and driving a gasoline vehicle? Provide a 95% confidence interval if the coefficient of linear correlation between \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) is estimated to be 0.1.
  5. What is the estimated annual claims frequency of a female policyholder who pays her premium in installments and drives a gasoline vehicle?
  6. To detect possible overdispersion, a graph is plotted showing the mean and empirical variance of the number of claims observed within each class. What curve should the scatterplot be fitted to in order to support the mixed Poisson distribution hypothesis? Explain and justify.

Exercise 10.7 In order to explain the annual claims frequency of insured individuals based on their observable characteristics, the actuary in charge of Motor Third Party Liability pricing at a major insurance company performed a Poisson regression on a dataset of 100,000 policies observed over one year, using driving experience (novice/experienced) and vehicle power (small/medium/large engine) as factors. The number \(N_i\) of claims caused by insured individual \(i\) is assumed to follow a Poisson distribution with a mean of \(d_i\exp(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})\), where \(d_i\) is the coverage duration and \[ x_{i1}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ is a novice driver},\\ 0,\text{ otherwise}, \end{array} \right. \] \[ x_{i2}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ drives a vehicle with a small engine},\\ 0,\text{ otherwise}, \end{array} \right. \] and \[ x_{i3}=\left\{ \begin{array}{l} 1,\text{ if insured individual $i$ drives a vehicle with a large engine},\\ 0,\text{ otherwise}. \end{array} \right. \]

The maximum likelihood estimates obtained are as follows: \[ \widehat{\beta}_0=-2.10, \quad \widehat{\beta}_1=0.06, \quad \widehat{\beta}_2=-0.03, \quad \text{and} \quad \widehat{\beta}_3=0.05. \] The standard errors are also provided: \[ \widehat{\sigma}\big(\widehat{\beta}_0\big)=0.03, \quad \widehat{\sigma}\big(\widehat{\beta}_1\big)=0.02, \quad \widehat{\sigma}\big(\widehat{\beta}_2\big)=0.01, \quad \text{and} \quad \widehat{\sigma}\big(\widehat{\beta}_3\big)=0.01. \]

  1. What equations are the point estimates provided above solutions to? Provide an actuarial interpretation.
  2. Test the hypotheses \(\beta_j=0\) for \(j=0,1,2,3\), and interpret the results.
  3. Estimate the annual claims frequency of the reference insured individual and provide a confidence interval.
  4. Do the same for the annual claims frequency of a novice driver with a medium-engine vehicle if \(\widehat{\mathbb{C}[\widehat{\beta}_0,\widehat{\beta}_1]}=0.0001\).

Exercise 10.8 The cost of claims caused by insured individual \(i\) can be expressed as \[ S_i=\sum_{k=1}^{N_i}C_{ik}, \] where \(N_i\) follows a Poisson distribution with mean \(\lambda_i\), and \(C_{ik}\) follows a Gamma distribution with mean \(\mu_i\) and variance \(\mu_i^2/\nu\). All random variables involved are independent.

The actuary in charge of pricing uses gender (male/female) and the insured’s location (rural/urban), coded using two binary variables, with “male” and “rural” as the reference levels. The actuary incorporates this information as follows: \[ \lambda_i=d_i\exp(\beta_0^{\text{freq}}+\beta_1^{\text{freq}}x_{i1}+\beta_2^{\text{freq}}x_{i2}), \] with \(d_i\) as the exposure duration, and \[ \mu_i=\exp(\beta_0^{\text{coût}}+\beta_1^{\text{coût}}x_{i1}+\beta_2^{\text{coût}}x_{i2}). \] The point estimates of these parameters obtained by maximum likelihood estimation based on observations from a large portfolio are as follows: For frequencies: \[ \widehat{\beta}_0^{\text{freq}}=-1.95, \quad \widehat{\beta}_1^{\text{freq}}=0.03, \quad \text{and} \quad \widehat{\beta}_2^{\text{freq}}=0.05, \] And for costs: \[ \widehat{\beta}_0^{\text{coût}}=6.64, \quad \widehat{\beta}_1^{\text{coût}}=-0.05, \quad \text{and} \quad \widehat{\beta}_2^{\text{coût}}=-0.07, \] with \(\widehat{\nu}=0.75\). Standard errors are also provided: For frequencies: \[ \widehat{\sigma}\big(\widehat{\beta}_0^{\text{freq}}\big)=0.01, \quad \widehat{\sigma}\big(\widehat{\beta}_1^{\text{freq}}\big)=0.01, \quad \text{and} \quad \widehat{\sigma}\big(\widehat{\beta}_2^{\text{freq}}\big)=0.02, \] And for costs: \[ \widehat{\sigma}\big(\widehat{\beta}_0^{\text{coût}}\big)=0.02, \quad \widehat{\sigma}\big(\widehat{\beta}_1^{\text{coût}}\big)=0.01, \quad \text{and} \quad \widehat{\sigma}\big(\widehat{\beta}_2^{\text{coût}}\big)=0.02, \] for costs.

  1. Describe the iterative algorithm that was used to obtain the point estimates \(\widehat{\beta}_0^{\text{freq}}\), \(\widehat{\beta}_1^{\text{freq}}\), and \(\widehat{\beta}_2^{\text{freq}}\) provided above.
  2. What equations are the point estimates \(\widehat{\beta}_0^{\text{coût}}\), \(\widehat{\beta}_1^{\text{coût}}\), and \(\widehat{\beta}_2^{\text{coût}}\) solutions to? Provide an actuarial interpretation.
  3. Estimate the pure premium \(\mathbb{E}[S_i]\) and the variance \(\mathbb{V}[S_i]\) for the different rate classes.

Postface

Albrecht, Peter. 1983. “Parametric Multiple Regression Risk Models: Connections with Tariffication, Especially in Motor Insurance.” Insurance: Mathematics and Economics 2 (2): 113–17.
Anderson, Duncan, Sholom Feldblum, Claudine Modlin, Doris Schirmacher, Ernesto Schirmacher, and Neeza Thandi. 2004. “A Practitioner’s Guide to Generalized Linear Models.” Casualty Actuarial Society Discussion Paper Program 11 (3): 1–116.
Bailey, Robert A. 1963. “Insurance Rates with Minimum Bias.” In Proceedings of the Casualty Actuarial Society, 50:4–14. 93.
Bailey, Robert A, and LeRoy J Simon. 1960. “Two Studies in Automobile Insurance Ratemaking.” ASTIN Bulletin: The Journal of the IAA 1 (4): 192–217.
Bardos, Mireille. 2001. Analyse Discriminante: Application Au Risque Et Scoring Financier. Dunod.
Beirlant, Jan, V Derveaux, Anna Maria De Meyer, MJ Goovaerts, E Labie, and B Maenhoudt. 1992. “Statistical Risk Evaluation Applied to (Belgian) Car Insurance.” Insurance: Mathematics and Economics 10 (4): 289–302.
Beirlant, Jan, Yuri Goegebeur, Robert Verlaak, and Petra Vynckier. 1998. “Burr Regression and Portfolio Segmentation.” Insurance: Mathematics and Economics 23 (3): 231–50.
Berg, Peter ter. 1980. “Two Pragmatic Approaches to Loglinear Claim Cost Analysis.” ASTIN Bulletin: The Journal of the IAA 11 (2): 77–90.
Besson, Jean-Luc, and C. Patrat. 1992. “Trend Et Systèmes de Bonus-Malus.” ASTIN Bulletin: The Journal of the IAA 22 (1): 11–31.
Brouhns, Natacha, Michel Denuit, and Bernard Masuy. 2002. “Ratemaking by Geographical Area: A Case Study Using the Boskov and Verrall Model.” Publications of the Institut de Statistique, Louvain-La-Neuve, 1–26.
Cebrián, Ana C, Michel Denuit, and Philippe Lambert. 2003. “Generalized Pareto Fit to the Society of Actuaries’ Large Claims Database.” North American Actuarial Journal 7 (3): 18–36.
Celeux, Gilles, and Jean-Pierre Nakache. 1994. Analyse Discriminante Sur Variables Qualitatives. Polytechnica.
Chiappori, Pierre-André. 1997. Risque Et Assurance. Flammarion.
Cleveland, William S. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368): 829–36.
Cleveland, William S, and Susan J Devlin. 1988. “Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting.” Journal of the American Statistical Association 83 (403): 596–610.
Cleveland, William S, and Eric Grosse. 1991. “Computational Methods for Local Regression.” Statistics and Computing 1: 47–62.
Cornillon, Pierre-André, and Eric Matzner-Løber. 2007. “La régression Linéaire Simple.” Régression: Théorie Et Applications, 1–32.
Cummins, J David, Georges Dionne, James B McDonald, and B Michael Pritchett. 1990. “Applications of the GB2 Family of Distributions in Modeling Insurance Loss Processes.” Insurance: Mathematics and Economics 9 (4): 257–72.
Delwarde, Antoine, and Michel Denuit. 2005. Construction de Tables de Mortalité périodiques Et Prospectives. Economica.
Denuit, Michel, and Stefan Lang. 2004. “Non-Life Rate-Making with Bayesian GAMs.” Insurance: Mathematics and Economics 35 (3): 627–47.
Denuit, Michel, Jean-François Walhin, and Sandra Pitrebois. 2003. “Tarification Automobile Sur Données de Panel.” Bulletin of the Swiss Association of Actuaries, 51.
Dobson, Annette J. 2001. An Introduction to Generalized Linear Models. CRC press.
Fahrmeir, Ludwig, Gerhard Tutz, Wolfgang Hennevogl, and Eliane Salem. 1994. Multivariate Statistical Modelling Based on Generalized Linear Models. Vol. 425. Springer.
Gourieroux, Christian. 1992. “Courbes de Performance, de sélecton Et de Discrimination.” Annales d’économie Et de Statistique, 107–23.
Gourieroux, Christian, Alain Monfort, and Alain Trognon. 1984. “Pseudo Maximum Likelihood Methods: Theory.” Econometrica, 681–700.
Gouriéroux, Christian. 1999. “Statistique de l’assurance.” Economica, Paris.
Hastie, Trevor, and Robert Tibshirani. 1990. Generalized Additive Models. Chapman; Hall.
Hurvich, Clifford M, Jeffrey S Simonoff, and Chih-Ling Tsai. 1998. “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion.” Journal of the Royal Statistical Society Series B: Statistical Methodology 60 (2): 271–93.
Ingenbleek, Jean-Francois, and Jean Lemaire. 1988. “What Is a Sports Car?” ASTIN Bulletin: The Journal of the IAA 18 (2): 175–87.
Jørgensen, Bent, and Marta C Paes De Souza. 1994. “Fitting Tweedie’s Compound Poisson Model to Insurance Claims Data.” Scandinavian Actuarial Journal 1994 (1): 69–93.
Keiding, Niels, Christian Andersen, and Peter Fledelius. 1998. “The Cox Regression Model for Claims Data m Non-Life Insurance.” ASTIN Bulletin: The Journal of the IAA 28 (1): 95–118.
Lebart, Ludovic, Alain Morineau, and Marie Piron. 1995. Statistique Exploratoire Multidimensionnelle. Dunod Paris.
Liang, Kung-Yee, and Scott L Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22.
McCullagh, Peter, and J. A. Nelder. 1989. Generalized Linear Models. Chapman & Hall.
Monfort, Alain. 1982. Cours de Statistique Mathématique. Economica.
Nakache, Jean-Pierre, and Josiane Confais. 2004. Approche Pragmatique de La Classification: Arbres Hiérarchiques, Partitionnements. Editions Technip.
Nelder, John Ashworth, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society Series A: Statistics in Society 135 (3): 370–84.
Ramlau-Hansen, Henrik. 1988. “A Solvency Study in Non-Life Insurance: Part 1. Analyses of Fire, Windstorm, and Glass Claims.” Scandinavian Actuarial Journal 1988 (1-3): 3–34.
Renshaw, Arthur E. 1994. “Modelling the Claims Process in the Presence of Covariates.” ASTIN Bulletin: The Journal of the IAA 24 (2): 265–85.
Ter Berg, Peter. 1980. “On the Loglinear Poisson and Gamma Model.” ASTIN Bulletin: The Journal of the IAA 11 (1): 35–40.
———. 1996. “A Loglinear Lagrangian Poisson Model.” ASTIN Bulletin: The Journal of the IAA 26 (1): 123–29.
Zeger, Scott L, Kung-Yee Liang, and Paul S Albert. 1988. “Models for Longitudinal Data: A Generalized Estimating Equation Approach.” Biometrics, 1049–60.