STATiSTiCAl SCOriNG MOdEl Of liThuANiAN COMPANiES laima dzidzevičiūtė

In the banking sector of Lithuania, the necessity to apply statistical scoring models has especially increased after the transposition of the New Capital Adequacy Directive into the national legal acts. According to them, banks are allowed to apply their own statistical models to calculate capital adequacy. However, banks‘ internal data are not allways sufficient for developing internal statistical models. The need to apply statistical scoring models increases not only for banks, but also for other institutions that grant credits. Until now, only several authors in Lithuania have proposed their own statistical scoring models for corporates; however, these models were developed using very small data samples and are suitable for specific types of companies for which they were developed only. The model proposed in this article solves these problems because it is appropriate for assessment of all companies, it is not industry-specific and has been developed using a large data sample. The objective of this study was to develop a logistic regression scoring model for assessment of corporates, using data of the external register JSC Creditinfo Lietuva1. In the proposed model, there are 19 variables characterizing all the features of a company: size, locality, age, economic sector, financial condition, past due payments, negative facts and claims from external debt collection institutions.


Introduction
In order to make a decision to grant a credit or not, banks must have in place a credit risk assessment model.During the last decades, statistical scoring models have become more and more significant in the context of all credit risk assessment models.They may be applied not only in the decision-making process, but also in other spheres of bank activities, such as the pricing process adding a higher risk premium for riskier credits, calculating specific provisions and capital adequacy, forming a bank's strategy, allocating capital, managing past due payments, identifying the clients that could be potential clients for other products, analysing risk-adjusted profitability of a bank, in management reporting systems, etc.In Lithuania, the necessity to apply statistical scoring models especially increased after the transposition of the New Capital Adequacy Directive (prepared accoring to the New Basel Capital Accord) into the national legal acts.According to them, banks are allowed to apply their own statistical models for calculating capital adequacy.However, internal historical data stored at banks themselves are not allways sufficient for developing internal statistical scoring models.As an inquiry of Lithuanian banks2 showed, only four banks apply statistical scoring models, others indicate a too short historical observation period and insufficiency of internal data.The necessity to apply statistical scoring models increases not only for banks, but also for other companies granting credits, i. e. for consumer credit, quick credit, leasing companies to assess the risk of applicants' employers.
Until now, only several authors in Lithuania have proposed their own statistical scoring models for corporates.For instance, Grigaravičius (2003) proposed a logistic regression model to forecast the bankruptcy of the companies the shares of which are sold in stock exchange, Stoškus, Beržinskienė, Virbickaitė (2007) proposed a discriminant analysis model.However, these models were developed using very small data samples and are suitable for specific types of companies only.The model proposed in this article solves these problems because it is appropriate for the assessment of all companies, it is not industry-specific and has been developed using a large data sample.
The purpose of this study was to develop a logistic regression scoring model for the assessment of corporates using data of the external register JSC Creditinfo Lietuva.Calculations were made with the SPSS program.The final result of the proposed logistic regression model is an individual probability of default (thereinafter PD), i. e. the probability that a concrete company will default within one year from the scoring date.The proposed model may be applied not only by banks, but also by other companies; e. g., consumer credit, quick credit and leasing companies may apply it for assessing the credit risk of clients' employers.
In the first part of the article, data used for modeling are described, and in the second part a detailed description of the modeling process comprising all the stages is given: the definition of Bads and the result period, segmentation of population, sampling, analysis of input variables, choosing the model form, calculation of coefficients and ex-ante validation.

Data
Data of the Lithuanian companies from all economic sectors for 2005-2008 were obtained from the external loan register JSC Creditinfo Lietuva which collects and stores companies' information about their age, locality, legal status and legal form, economic sector, annual turnover, the number of employees, managers, members of the board, subsidiaries and branches, claims, arrests and legal processes, bankruptcies, debts, changes of companies' name and address, public rating, inquiries, shares and other information from banks, leasing and telecommunication, public utility companies, public registers, etc. (http://www.creditinfo.lt/?PageID=721).
Each company is attributed to one of the two possible groups: to Goods or to Bads.The default criterion is used to define the status of Bads.Default is defined as a status of a company when payments of this company to at least one credit institution are past due more than 90 days or a bankruptcy procedure is initiated for the company3 .A company is attributed to Bads, if it defaulted within one year from the end of a respective year, i. e. the reference date T 0 (see Fig. 1).
The reference points are used: 31 December 2005, 31 December 2006 and 31 December 2007.The variables that characterize the creditworthiness of companies are taken at a concrete reference date T 0; however, they may be calculated for the end of a year (e.g., financial ratios) or for the period x from T -x to T 0 (e.g., information about past due payments during two years before the reference date) .
For example, variables of the company ABC are taken on 31 December 2007, i. e. the reference date is 31 December 2007.Then it is assessed whether within one year from 31 December 2007 until 31 December 2008 the ABC defaulted at least once for at least one FIG.1. Scheme of companies' data gathering Variables of each included company that could be used as independent input variables of the logistic regression model are determined at a concrete reference date T 0 (i. e. on 31 12 2005, 31 12 2006 or 31 12 2007) It should be determined whether within one year from date T 0 a company defaulted at least once for at least one credit institution.If yes, then the company is attributed to Bads and developing logistic regression model dependent variable 1 is assigned; if no, it is attributed to Goods and 0 is assigned.
credit institution.If yes, then while forming the data array it would be attributed to Bads and the dependent variable 1 would be assigned.However, if ABC did not default during this one-year period, then this company would be attributed to Goods, and the dependent variable 0 would be assigned (see Fig. 2).
Data of each separate year were joined into one common data array and a "companyyear" was used for the further analysis; e. g., if data on a concrete company are given for all three years, then the data of such company are "tripled" and used as data of three separate companies.In total, a data array of 19193 rows ("company-years") was obtained, 376 (1.96%) of them were attributed to Bads and the rest 18817 (98.04%) to Goods.

Stages of model development
The development of the logistic regression model consists of eight stages which are described in detail below (see Fig. 3).

Definition of "Bad" obligors and the result period
As mentioned in the first part of the paper, "Bad" is defined as a company complying with at least one of these two criteria: 1) payments of the company to at least one credit institution are past due more than 90 days; 2) bankruptcy procedure is initiated for the company.The result period is equal to one year, i. e. it is assessed whether the company became "bad" within a year starting from the end of a respective year.Such duration was chosen in order to comply with the requirements of the Bank of Lithuania, prepared according to the New Capital Adequacy Directive and the New Basel Capital Accord (Bank of Lithuania, 2006;EU, 2006;BCBS, 2006).

Segmentation of population
The proposed companies' scoring model is generic (external) because data from an external loan register comprising information of many banks were used.As companies from all economic sectors were included, the model is recommended to assess the risk of various companies and is not industry-specific.Also, one should notice that the model is behavioural (portfolio), i. e. it is recommended for banks to apply it for regular reassessments of already existing credit clients.JSC Creditinfo Lietuva does not gather information about credit granting date at a concrete credit institution; so, it is not possible to develop an application scoring model.The result period developing the proposed model is determined starting from the end of a respective year and not from the date of the loan granting4 .However, even if the model is behavioural (and not application), it is possible to apply it even in the decision-taking process when deciding whether or not a credit should be granted.
The model was developed on a company (and not on a credit) level, i. e. it is intended for the assessment of companies and not of credits.Besides, the model may be applied for the assessment of all credit types (investment loans, working capital financing, etc.).

Sampling
Upon joining the data of three years into one common data array, 19193 rows ("companyyears") were obtained, of them 376 were assigned to Bads and 18817 to Goods.To adjust the initial sample several approaches were applied: a needed sample size was calculated and compared with the initial sample size; 1) the structure of 2) Goods and Bads was analyzed and the optimal structure was derived.
The following formula was applied to calculate the needed sample size (SAS, 2009; Dzidzevičiūtė, 2010 a ): where PD MAX is the maximum PD that can be determined by experts analyzing the historical experience of the companies; α is the significance level, i. e. 100% minus the confidence level chosen by a bank; Z α/2 is the value of the inverse standard normal distribution function (it is possible to calculate it, e. g., applying MS Excel function NoRMSINV()); ∆PD is the PD error; e. g., if the bank chooses the 95% confidence level and the 0.20% PD error, it wants to be 95% confident that the average of individual PDs calculated by the model will be no more than 20bp off PD MAX .
As in the initial sample the Bads rate is 1.96%, in order to be conservative, a slightly higher maximum PD should be used to calculate the needed sample size (e.g., 2.4%).Suppose we want to be 95% confident that the average of individual PDs calculated by the model will be no more than 20bp off this PD MAX .Then the needed sample size calculated according to formula (1) is equal to 22496.one could notice that the calculated needed sample size exceeds the initial sample, i. e. there are only 19193 rows ("company-years") and 22496 rows are needed.
Besides, the initial proportions of Goods and Bads are 98.04% and 1.96%.Meanwhile, for logistic regression it is recommended to use 80% of Goods and 20% of Bads.To achieve such proportions, a mixture of undersampling and oversampling techniques was used, i. e. the number of Goods was reduced (every 26 th row was deleted) and the number of Bads was increased (every row was repeated 13 times) to reach 20% in the total structure.After adjustment, the number of Goods was 18093 (79.36%) and the number of Bads 4706 (20.64%), in total 22799 rows.

Analysis of input variables, choosing statistical model form and calculation of coefficients
The variables used in the final model were chosen in three cycles: in the first cycle based on expert judgment, 57 variables presented in Appendix, 1) Table A.1 were determined; in the second cycle, 48 variables (from 57) were chosen taking into account 2) several criteria (economic logic, monotony, individual discriminatory power of a variable); in the third cycle, 48 variables were inputted into the SPSS program, and the final 3) 19 variables were chosen applying the forward stepwise procedure.

First cycle
Initially, 57 variables characterizing all the features of a company were determined (see Appendix, Table A.1): the financial ratios, external past due payments, age, legal form, county and economic sector of a company, information about the company's management, change of its address and name, negative facts about the company, claims from external debt collection companies, etc.
The values of all quantitative variables were joined into 10 groups by percentiles (in some cases negative values were used as a separate group, e. g., for Total assets / Equity because the negative values of this ratio indicate a very risky situation of a company, and small positive values, on the contrary, indicate a non-risky situation, so they cannot be mapped into the same group).For the variables Company's group by annual turnover at the end of a year, Age of a company, Number of employees, groups were determined based on expert judgment and not by percentiles.As all values of quantitative variables were grouped, the analysis of outliers was not made.
To code the values, the weight of evidence (thereinafter WoE) approach was applied, because applying this approach the dummies assigned accurately reflect the riskiness of a concrete group i (Dzidzevičiūtė, 2010 a ): where WOE i is the WoE of the i-th group; G i is the proportion of Goods in the i-th group, % from all Goods; B i is the proportion of Bads in the i-th group, % from all Bads.
Table 1 provides the calculation of dummies for County of a company.The higher the WoE, the lower the risk of a concrete group.When the percentage proportion of Goods in a respective group exceeds the percentage proportion of Bads in that group, WoE will be more than 0, and vice versa.As one could notice, the riskiest county is Panevėžys, as its WoE is the lowest if compared with other counties5 .
The initial groups were adjusted taking into account: the economic logic, i. e. the risk of groups should reflect the expectations of an  of micronumerocity, assigned to one of the groups based on the similarity of Bads rate; the discriminatory power of a variable, i. e. the information value of various • grouping alternatives was compared and the highest was chosen; the unpredictive variables were totally excluded from the further analysis (see Appendix, Table A .1.).
Table 2 provides the adjustment of the initial grouping.
From Table 2 it is clear that some initial groups were joined (e. g., percentiles from 0.2 to 0.4) to reach the monotonously decreasing Bads rate, i. e. the higher the ratio, the lower the Bads rate.The information value for this grouping alternative was the highest.

Second cycle
From the initial 57 variables, based on their individual discriminatory power, economic logic and monotony, 48 variables were chosen and further used in the modeling.The information value was calculated using the following formula (e. g., 0.1 in Table 1 for variable County of a company) (SAS, 2009): where IV is the information value of a variable.G i is the proportion of Goods in i-th group, % from all Goods; B i is the proportion of Bads in i-th group, % from all Bads; WOE i is the WoE of the i-th group; n is the number of groups.
Interpreting the meaning of the information values, the following explanations were used: <0.02 -unpredictive variable; 0.02-0.1 -weak predictiveness of a variable; 0.1-0.3-medium predictiveness of a variable; >0.3 -strong predictiveness of a variable.As one could notice in Table 1, the predictiveness of the variable County of a company is medium, whereas the predictiveness of the variable Net profit (loss) / Total assets is strong.Table A.1 in Appendix provides the information values for all analyzed variables.

Third cycle
In the second cycle, 48 variables were further analyzed using the forward stepwise (Wald) procedure.The WoE values were inputted into SPSS program.Applying the forward stepwise procedure, step-by step, variables having a strong relationship with a dependent variable were included into the model, and then it was checked which variables should be excluded from the regression equation.In total, 21 steps were made; the final model is presented in Step 21.After the procedure, 19 variables were left in the equation (Appendix, Table A.2).The PD of a company is determined applying the formulas below (Dzidzevičiūtė, 2010 a ): where PD i is the probability that a company i will default; X 1i, … X ni are dummies of independent input variables, i. e. the WoE of a concrete group indicated in Table 3; b 0 , b 1 , …b n are the coefficients shown in Appendix, Table A.2, column B; PD i /(1-PD i ) is an odd in favour of PD i = 1 (the value may vary from 0 to ∞); Z i is a natural logarithm of the odd, also called logit.Table 3 provides the groups of variables and their dummies (WoE) and shows the step when a concrete variable was included into the equation.one could notice that variables left in the final cycle characterize all the features of a company: age, size (group of annual turnover, number of employees and, to some extent, natural logarithms of net profit and non-current amounts payable and liabilities as bigger companies generate relatively bigger absolute amounts of net profit and take relatively bigger credits), financial condition (even eight financial ratios were included), locality (companies were grouped by counties), economic sector (companies were grouped according to the NACE 2 classificator), external past due payments (total number of all past due payments to credit institutions, leasing, telecomunication, public utility companies and other companies and the average duration of all these past due payments during the last year before the scoring date), negative facts about a company and claims from external debt collection companies.
For ex-ante validation, the following analyses were made: analysis of the economic logic of the coefficients' mathematical signs • : the mathematical sign of a coefficient must comply with the economic logic used when developing the model.The coefficients of logistic regression equation must have a plus when the increasing value of a variable (or a dummy) indicates ceteris paribus an increasing risk of a company, and, on the contrary, the coefficients must have a minus when the increasing value of a variable (or a dummy) indicates ceteris paribus a decreasing risk of a company.In this study, groups of variables were coded with the WoE; the increasing WoE indicates ceteris paribus a decreasing risk of a company.Therefore, the sign of all coefficients in formulas (4) and ( 5) must be a minus.As one could notice in the Appendix Table A.2, all coefficients in column B are with a minus as one could expect; analysis of the significance of coefficients' ineaquality to 0 applying • the Chi-square goodness-of-fit test: the p values (Sig.) when applying the Chi-square goodnessof-fit test are lower than the significance level 0.05; so, the H0 hypothesis is rejected (i.e. at least one coefficient is significantly unequal to 0) (see Appendix, Table A.3). analysis of the significance of coefficients' ineaquality to 0 applying • the Wald tests: the p values (Sig.) when applying the Wald tests proove the significance of the coefficients' inequality to 0. As one could notice in the last 21st step of the forward stepwise procedure, all Sig.values are below the significance level of 0.05, so the H0 hypothesis is rejected (i.e. all coefficients are significantly unequal to 0) (see Appendix, Table A.

2).
This means that the mathematical signs of the coefficients comply with the economic logic: all coefficients are significantly unequal to 0. Besides, the overall percentage of the classification table is 83.2%6 .However, a concrete institution (bank, consumer and quick credit company, leasing company), before appling the proposed model, should check its discriminatory power, the accuracy of calibration, stability, etc.7 using its own data; also, a regular ex-post validation should be performed upon implementing the model.

Conclusions
When developing the logistic regression model, the final variables were chosen in three cycles.In the first cycle, 57 variables were chosen that characterize all the features of a company: financial condition, external past due payments, age, legal form, county and economic sector, information about the company's management, change of its adress and name, negative facts about the company, claims from external debt collection companies, etc.The WoE approach was applied for coding with dummies, i. e. a concrete WoE was assigned for each group of a variable's value.In the second cycle, based on the economic logic, monotony and individual discriminatory power, 48 variables were chosen for the further analysis.Then, in the third cycle, applying the forward stepwise (Wald) procedure, 19 final variables were determined.The proposed model consists of 19 variables that comprehensively characterize a company's risk.It may be applied to assess companies from all economic sectors and for all credit types (investment loans, working capital financing, etc.).The proposed model may be applied not only by banks, but also by other institutions that grant credits (consumer credit, quick credit, leasing companies), e. g., to assess the applicants' employers.However, before applying the proposed model, companies should validate its discriminatory power, the accuracy of calibration, stability, etc. using their own data to decide whether the model is suitable for them.
In addition to the model itself, the analysis presented in the article could be helpful for banks while developing their own models; for example, banks could choose the same or similar variables, use the results of individual discriminatory power analysis, intervals of quantitative variables, apply the proposed WoE and information value approaches, etc.

FIG
FIG. 2. Example for the ABC company

Table 1 . WOE and information value for the variable County of a company
• expert before modeling; for example, the negative values of Total assets / Equity should get a low WoE because they indicate a risky situation of a company, etc.; monotony, i. e.•Bads rate should monotonically decrease or increase when the value of a quantitative variable increases (at least, to a certain level; for example, the distribution can be U-shaped); micronumerosity, i. e. if the number of values in a concrete group is very small, it • is better to assign them to one of the other groups based on the similarity of Bads rate.For example, missing values were put into a separate group, or, in the case Source: calculations of the author.

Table 2 . WOE and information value for the variable Net profit (loss) / Total assets
Source: calculations of the author.

X6 There are / there are no records from debt collection compa- nies about claims to the company during that year*** Step 6
WOe is multiplied by the coefficient for that variable shown in appendix, Tablea.2,column b; the lower the WOe, the riskier the group.** all negative facts about a company that are registered at JSC Creditinfo lietuva, e. g., negative media information.
* *** Only the records registered at JSC Creditinfo lietuva are used.**** Past due payments to credit institutions, leasing, telecomunication, public utility companies and other companies registered at JSC Creditinfo lietuva.