In order to study social inequalities, indices can be used to summarize the multiple dimensions of the socioeconomic status. As a part of the Equit’Area Project, a public health program focused on social and environmental health inequalities; a statistical procedure to create (neighborhood) socioeconomic indices was developed. This procedure uses successive principal components analyses to select variables and create the index. In order to simplify the application of the procedure for non-specialists, the R package SesIndexCreatoR was created. It allows the creation of the index with all the possible options of the procedure, the classification of the resulting index in categories using several classical methods, the visualization of the results, and the generation of automatic reports.
When studying social inequalities, it is generally interesting to take into account the socioeconomic status (SES) of an individual, a neighborhood or a region rather than consider only one socioeconomic variable such as educational level or income. However, socioeconomic status is a complex and multidimensional concept which encompasses many aspects such as employment, income, education, housing and social bonds. All of these aspects can themselves be represented by various variables. To synthesize and consider these different aspects, one solution is to create a SES index.
There are already many existing SES indices, especially at the neighborhood level [
Compared to other existing approaches to compute indices, our procedure is a slightly more complex to understand and apply, especially for non-statisticians than for some other SES indices; we specifically developed our model in a R package [
In this paper we present and illustrate the use of the SesIndexCreatoR package for Lille agglomeration (a large French metropolitan area). For further examples we recommend reading the works by Padilla et al. as mentioned above.
The example data provided in the SesIndexCreatoR package concerns one large city in France, Lille (Nord Pas de Calais region, northern France), and some adjacent municipalities. The statistical unit is the sub-municipal French census block groups (called IRIS) defined by the National Institute of Statistics and Economic Studies (INSEE). These units have an average of 2000 inhabitants and are constructed to be as homogeneous as possible in terms of socio-demographic characteristics and land use. Census block groups (BGs) are divided into three distinct categories: housing, economical activity and miscellaneous. Housing BGs are the most common, economical activity BGs include at least 1000 employees and at least twice as many employees as residents, and miscellaneous BGs are specific wide areas sparsely populated (leisure parks, port areas, forest, etc.). As activity and miscellaneous BGs have some particular profiles due to the way they are defined, they are treated in the example as illustrative units (meaning that they are not part of the procedure but will have an index value). For confidentiality and distribution reasons, the real BGs idenficators are replaced in the example data set with a simple number from 1 to 234 (which is the number of BGs of the area).
Socioeconomic data are taken from the 1999 national census (source: INSEE) and provide counts of population, households and residences at BG scale covering all the social, economic and demographic aspects. Median income at the BG scale is taken from a second database: the “Revenusfiscaux des ménages” database (source: INSEE-DGI). Using this raw data, 37 variables are defined at the BG scale based on the INSEE definitions. These variables are chosen to be representative of the theoretical concept of SES and in line with the variables most often used in the literature, or that could be considered as linked with the SES concept.
All variables are related to family structure, household type, immigration status, employment, income, education and housing (more details are available in
Domain | Variable name | Description |
---|---|---|
BG type | Type | Census block group type (H: housing; A: activity; D: miscellaneous; Z: one BG municipality)c |
Family and Household | UnderAge25 | people under the age of 25 in the total population |
OverAge65 | people over the age of 65 in the total population | |
SingleParentFamilies | Single-Parent families in the total population | |
HouseHolderAlone | Householders living alone in the total population | |
Immigration | ForeignForce | People in the labor force in the total population |
Employment and Income | LabourForce | People in Labor Force in the total populationa |
MenLabourForce | Men in labour Force in the male populationa | |
WomenLabourForce | Women in labour Force in the female populationa | |
UnemploymentTotal | Unemployment people in the labor forceb | |
UnemploymentForeigners | Unemployment Foreigners in the labor forceb | |
UnemploymentAge1524 | Unemployment people in the 15 - 24 years old labor forceb | |
UnemploymentOverAge50 | Over 50 years old unemployment people in the labor forceb | |
UnemploymentMen | Unemployment people in the male labor forceb | |
UnemploymentWomen | Unemployment people in the female labor forceb | |
UnemploymentMore year | People unemployed for more than 1 year in the labor forceb | |
SelfEmployed | Self-employed (independent workers, employers, etc.) in the labor force | |
InsecureJobs | People with unstable jobs in the labor force | |
SteadyJobs | People with steady jobs in the labor force | |
MedianIncome | Median income per consumption unit (in euros per year)c |
Domain | Variable name | Description |
---|---|---|
Education | AttendingSchool | People 6 - 15 years old school in the 6 - 15 years old population |
NoDiplomas | People with no diploma (and no studying) in the 15 years old and more population | |
BasicGeneralQualifications | People with basic or intermediate general or vocation qualifications (and not studying) in the 15 years old and more population | |
GeneralCertificates | People with general or vocational maturity certificates (and not studying) in the 15 years old and more population | |
LowerTertiaryEducation | People with at least a lower tertiary education (and not studying) in the 15 years old and more population | |
HigherEducationalDegree | People with a higher educational degree (and not studying) in the 15 years old and more population | |
Students | Student in the 15 years old and more population | |
Housing | IndividualHouse | Individual houses in the main residences |
MultipleDweelingUnit | Multiple dwelling units in the main residences | |
NonOwner | Non-Owner-occupied in the main residences | |
SubsidizedHousing | Subsidized housing in the main residences | |
BuildBefore1968 | Main residences built before 1968 | |
BuildAfter1990 | Main residences built after 1990 | |
Less40m2 | Main residences less than 40 m2 | |
Larger150m2 | Main residences larger than 150 m2 | |
ParkingSpace | Main residences with a parking space (garage or other) | |
WithoutCar | Households without a car | |
TwoOrMoreCars | Households with 2 or more cars |
implemented in the proposed package). In our example, there are two such groups: 7 variables of unemployment and 3 variables of labor force. We also note there are an unexpectedly high number of missing values for median income but, willing to keep this variable in the analysis, we filled missing values with the average value of the adjacent BGs.
The SES index creation procedure is detailed in Lalloué et al. [
(1) Study of the redundant variables. As already mentioned, several variables represent the same notion and we want to determine which best represented this notion. Therefore, one variable is selected for each group by applying principal component analysis (PCA) to each of the groups of redundant variables. The selected variable for each group is the one with the largest correlation with the first component of the PCA on the group.
(2) Selection of the variables. A PCA or a multiple factor analysis (MFA) on the remaining variables (i.e., non-redundant variables and variables selected in step (1) is used to select the variables with a contribution to the first component larger than the average one, i.e., variables that were best correlated with the first component ; i.e. variables that were best correlated with the first component. The choice of PCA or MFA depends on the willingness to give the same weight in the analysis to each domain (MFA) or not (PCA).
(3) Construction of the index. A final PCA is carried out including the variables selected in step (2) Provided that the first component of this PCA could be interpreted as a “SES component”, it is used to calculate the socioeconomic index as the reduced first component.
The SesIndexCreatoR package depends on the FactoMineR [
Because the package is also aimed to be used by R novice users, the example data are not included as R dataset but as a text file, in order to show in the package’s manual how to import a file.
SesIndexCreatoR is composed of three main functions and several visualizing and internal functions (see
SesIndex function creates a socioeconomic index such as defined in the Equit’Area project. It is possible to choose the starting set of variables, the potential redundant groups of variables, the potential supplementary units, the method of selection (PCA or MFA) and the step of the procedure to perform. Results include the final index and all the results of the intermediate steps.
SesClassif function creates socioeconomic categories, based on a socioeconomic index created by SesIndex function, with different technics such as hierarchical clustering, quantiles or equals subdivisions. Results include both a table with the original data set with class of each unit and the results of the classification technic (cut points, classes particularities,...).
SesReport function creates a .html file with a report summarizing the results of the different steps of the creation of a socioeconomic index with the SesIndex function and, if any, the classification of the index using the SesClassif function. This function also allows to create a.csv file containing the original data set and the index and, if any, the classification.
First, the socioeconomic data from the text file are imported in a data frame:
R>library(“SesIndexCreatoR”) R>SesData<- read.table( + system.file(“extdata”,”SesData.txt”, package = “SesIndexCreatoR”), + header=TRUE,sep=“\t”, row.names=1)
The SesData.txt contains 37 socioeconomic variables and 1 type variable (giving the type of BG) for each BG
Function | Description |
---|---|
ClassifHC | Internal function: Classification with Hierarchical clustering (HC) |
ClassifInt | Internal function: Classification by intervals |
ClassifQuant | Internal function: Classification by quantiles |
plot.SesClassif | Plot the results of the classification of a socioeconomic index |
plot.SesIndex | Plot the results of the construction of a socioeconomic index |
print.SesClassif | Print the classification of a socioeconomic index results |
print.SesIndex | Print the creation of a socioeconomic index results |
SelectVar | Internal function: selection of variables |
SesClassif | Create categories from a socioeconomic index |
SesIndex | Creation of a Socio-Economic Index |
SesReport | Creation of a report for SesIndex and SesClassif functions |
SesStep1 | Internal function: performs the first step of the creation of the socioeconomic index |
of the Lille municipality and adjacent municipalities, as describe in Section 2.1 Data. Then, the SesDatadataframe has 234 rows representing the BGs and 38 columns representing the variables.
As the SesIndex function needs vectors or lists of variables’ names as arguments, we then extract the different vectors and lists needed to call the function (with redundant groups). The first line of the following code chunk allows to extract the names of the variables to analyse as a vector. The remaining lines extract the names of the variables in the two groups of redundant variables (see
R>varnames<- colnames(SesData)[2:ncol(SesData)]
R> group1 <- grep(“+Unemployed”, colnames(SesData), value=TRUE)
R> group2 <- grep(“+LabourForce”, colnames(SesData), value=TRUE)
R>groupvarnames<- list(group1, group2)
In order to consider activity and miscellaneous BGs as illustrative units, we extract the names of the corresponding rows (in our example, A is for “Activity” and D for “Miscellaneous” types of BGs):
R>illus<- rownames(SesData[SesData[,”Type”] %in% c(“A”, “D”),])
It is “now” possible to create a socioeconomic index described in Materiel and methods using SesIndex. Here, we will create a socioeconomic index using all the 3 steps. Two groups of redundant variables are defined in groupvarnamesand several BGs are set illustrative. By default, all the 3 steps are performed and step 2 uses a PCA.
R> index <- SesIndex(SesData, varnames=varnames, groupvarnames=groupvarnames,
+ sup=illus)
R>plot(index, choice=“ind”, label=“none”)
Once the index is created, we want to explore the results of the procedure. For instance, among the groups of redundant variables listed in
R> index$step1$selection
[
Or, among the list of variables in
R> index$step2$selection
[
[
[
[
[
[
[
R>plot(index, choice=“var”, step=2)
It is also possible to obtain detailed results of the data mining technics, like the correlation coefficients of the variables with the two first components of the second step analysis:
R> index$step2$analysis$var$coord[,c(1,2)]
Dim.1 Dim.2
UnderAge25 0.63630110 0.21862821
OverAge65 −0.43163887 −0.18186596
ForeignPop 0.77678724 −0.15910609
LabourForce −0.21536615 0.11435159
UnemployedTotal 0.87073008 −0.29624249
SelfEmployed −0.54334383 0.47689134
InsecureJobs 0.89130598 0.13786739
SteadyJobs −0.87777733 0.03703831
SingleParentFamilies 0.78556166 −0.19555464
NoDiplomas 0.64635948 −0.67461860
HouseholderAlone 0.45573846 0.72622501
AttendingSchool −0.37025379 −0.02709806
BasicGeneralQualifications −0.15542464 −0.87073704
GeneralCertificates −0.62380039 0.12642215
LowerTertiaryEducation −0.59037219 0.55626000
HigherEducationalDegree −0.36942886 0.79823028
Students 0.35951016 0.68000017
IndividualHouse −0.73553777 −0.46219873
MultipleDwellingUnits 0.74861629 0.43400686
BuiltBefore1968 0.09240263 −0.19191144
Builtafter1990 −0.05250108 0.60279988
ParkingSpace −0.77646906 0.09453152
NonOwner 0.83998699 0.24679532
Less40 m2 0.51827452 0.69853014
Larger150 m2 −0.43976155 0.19532641
WithoutCar 0.90096827 0.17029037
TwoOrMoreCars −0.87800029 −0.10183450
SubsidizedHousing 0.71268195 −0.27645939
MedianIncome −0.82471346 0.17342006
Or the proportion of variance explained by the four first components of the final step:
R> index$step3$analysis$eig[1:4,]
eigenvalue percentage of variance
comp 1 9.4233115 67.309368
comp 2 1.6390996 11.707854
comp 3 1.0678887 7.627776
comp 4 0.5014364 3.581689
cumulative percentage of variance
comp 1 67.30937
comp 2 79.01722
comp 3 86.64500
comp 4 90.22669
The above outputs are especially interesting to understand the procedure of variable selection. We can see in these results that the variables of total unemployment and total labor force were respectively selected from the groups of redundant unemployment variables and labor force variables. Then, for these two groups only these two variables were kept in the next steps.
We can see in the selection from the step 2 that only variables with the highest correlations with the first component were selected. Here, 14 variables out of 29 were kept for the final step and the construction of the SES index.Eventually, the first component of the final step PCA performed on these 14 variables explained more than 67% of the total variance.
R>plot(index, choice=“var”, step=3)
Some graphical outputs can be seen in Figures 1-3.
Finally,
We now want to create categories from the socioeconomic index. We use a hierarchical clustering followed by a k-nearest neighbor (k-nn) algorithm. We decide to have an automatic number of classes (i.e., to cut the hierarchical clustering tree where the relative loss of inertia is the highest):
R> categories <- SesClassif(index)
Others possibilities currently in the SesClassif function are to create classes with hierarchical clustering without k-nn consolidation, with quantiles or with equal range of values.
We can summarize some characteristics of the different categories using simple functions. For instance, it is possible to compare variables average values in each category and the overall mean:
R> for (i in 1:3) {
+ print(paste(“Category”,i))
+ print(round(categories$analysis$desc.var$quanti[[i]][,c(2,3,6)],2))
+
[
Mean in category Overall mean p.value
TwoOrMoreCars 0.33 0.21 0
IndividualHouse 0.71 0.45 0
SteadyJobs 0.73 0.65 0
ParkingSpace 0.60 0.43 0
MedianIncome 27529.06 21986.21 0
NoDiplomas 0.11 0.16 0
ForeignPop 0.02 0.05 0
SubsidizedHousing 0.08 0.26 0
UnemployedTotal 0.10 0.16 0
SingleParentFamilies 0.11 0.17 0
MultipleDwellingUnits 0.25 0.52 0
WithoutCar 0.16 0.29 0
InsecureJobs 0.09 0.13 0
NonOwner 0.35 0.58 0
[
Mean in category Overall mean p.value
MultipleDwellingUnits 0.66 0.52 0.00
NonOwner 0.69 0.58 0.00
WithoutCar 0.35 0.29 0.00
InsecureJobs 0.14 0.13 0.00
SingleParentFamilies 0.18 0.17 0.02
SteadyJobs 0.63 0.65 0.01
MedianIncome 19693.52 21986.21 0.00
ParkingSpace 0.35 0.43 0.00
IndividualHouse 0.30 0.45 0.00
TwoOrMoreCars 0.14 0.21 0.00
[
Mean in category Overall mean p.value
UnemployedTotal 0.33 0.16 0
ForeignPop 0.14 0.05 0
SingleParentFamilies 0.28 0.17 0
SubsidizedHousing 0.74 0.26 0
NoDiplomas 0.30 0.16 0
InsecureJobs 0.18 0.13 0
WithoutCar 0.47 0.29 0
NonOwner 0.90 0.58 0
MultipleDwellingUnits 0.85 0.52 0
IndividualHouse 0.13 0.45 0
TwoOrMoreCars 0.08 0.21 0
ParkingSpace 0.19 0.43 0
MedianIncome 12624.39 21986.21 0
SteadyJobs 0.46 0.65 0
NULL
R>plot(categories$analysis, choice=“map”, label=“none”, draw.tree=F)
We can see that the optimal number of categories (according to the inertia criterion) was 3. The description of these categories showed that they are organised by decreasing socioeconomic status. Indeed, category 1 has higher average values of variables like median income or proportion of steady jobs, and lower average values of proportion of unemployed people or proportion of subsidized housing; whereas category 3 has lower values of median income and higher values of unemployment.
Eventually, we want to export the detailed results of all the three steps of creation of the SES index and of the classification. We also want to export a data file containing the index and the categories. To do so, the SesReport function is used to create .html report (see Appendix). By default, files are created in the current working directory with basename “SesReport” (which can be change as arguments of the SesReport function).
R>SesReport(categories)
In this article we presented the SesIndexCreatoR package, designed to easily create socio-economic indices with a reproductible statistical procedure. One originality of this procedure compared to other existing indices lays in selecting the final variables for the index by usage of data mining techniques rather than only information gleaned from a literature review, allowing to discard part of the subjectivity that may influence the choice of the variables. This data driven approach allows the data “speak by themself”.
The SesIndexCreatoR package allow applying this procedure in a versatile way, by specifying which steps of the procedure should be runned (for instance only step 2 if the aim is to compare selection of variables between metropolitan areas without create indices, or only step 3 if one wants all the introduced variables to be in the index), adding illustrative units or selecting the method used. Once the index created, several tools are available to visualize, synthetize, explore and export the results in a convenient way for further utilization.
We project to extend the package in the future and among other improvements we foresee to implement others methods of classification, to add more tools to help the interpretation of the results, or to allow other ways of visualization (such as mapping). However, these improvements will be made according to users’ returns and needs.