

Hrishikesh Chakraborty
Corresponding author: Hrishikesh Chakraborty rishic@mailbox.sc.edu
Author Affiliations
Department of Epidemiology and Biostatistics, Arnold School of Public Health, The University of South Carolina, Columbia, USA.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Studies frequently have missing values for both the dependent and independent variables. Multiple imputation (MI) techniques are approaches to replacing missing values so that complete-data analysis methods may be employed. Often, mixed models are used to analyze data with missing independent variable values, but in mixed models, having a large number of independent variables with missing values will exclude a significant number of data vectors from the analysis, creating a major problem when attempting to draw meaningful conclusions from an analysis. Our goal was to use multiple imputation techniques to impute missing values for independent variables in a longitudinal study setting, and then to use the complete data matrix to fit mixed models independently for each of the imputed datasets. In this paper, we present a method to combine multiple estimates and inferential statistics generated from multiply imputed datasets using the same mixed model, which has not previously been done. Additionally, we are incorporated two sets of variance-covariance matrices for each imputed set and also adjusted degrees of freedom. In the example we compared estimates using complete and multiply imputed data using mixed models. We conclude that in some situations it is desirable to use multiple imputation techniques and mixed models together to draw conclusions.
Keywords: Multiple imputations, mixed models, missing values, vectors
Multiple imputation (MI) techniques are commonly accepted methods for replacing missing values with imputed values based on the underlying model. Data is imputed prior to analysis. Statistical methods are then used to analyze the imputed datasets. Over the years there has been a remarkable improvement in statistical methods that address the different types of missing values. Additionally, statistical software is readily available for data analysis with missing observations in many modeling contexts. Some previous studies conducted research on missing data and mixed models, but none of them expanded it to missing data multiple imputation our study. One study showed that in studies with a high percentage of missing values [1], the mixed model approach without any ad hoc imputation is more powerful, and another simulation study [2] examined the MI of missing repeated outcome measurements did not increase precision in the estimated rate of change in the end point to linear mixed-effects models. One recent study [3] presented a comparison of Mixed-Effect Model Repeated model verses last observation carried forward using 25 NDA data sets. However, the application of inferential statistics within mixed models using imputed data is not fully developed, nor is software readily available to support such an approach. We became interested in using the multiple imputation method to impute missing values of independent variables to make use of maximum available information for mixed model analysis.
In mixed models, independent variables are treated as known values, but in practice we encounter missing values for both dependent and independent variables. When conducting data analysis using mixed models, an entire row of the data matrix is excluded if there is only one missing value for independent variables. This results in a fitted model based on fewer subjects or time points, or a reduced data matrix, therefore excluding a lot of information from the model. Since so much information can be excluded from the model, it can be difficult to draw meaningful conclusions.
In this paper, we present a way to use the entire dataset by using appropriate partial multiple imputations (imputing what are scientifically meaningful values for the independent variables) with mixed models to fit a model. First, we imputed the missing independent variable values five times using an appropriate multiple imputation technique, thereby creating five imputed datasets. We then used a mixed model to analyze the imputed datasets. We combined the estimates (one set of estimates from each imputed dataset) for unique interpretation by accounting for within-and between-imputation variability in a mixed model setting, while incorporating several variancecovariance components and adjusting the degrees of freedom for the significant tests.
We use two types of data to illustrate the technique: (1) crosssectional data with clustering and (2) longitudinal repeatedmeasures data. We imputed independent variables only if it made sense scientifically and there was enough information to impute the missing values. We presented the model estimates from different multiply imputed datasets and also compared the original estimates (using datasets with the missing values) with combined estimates (from five models based on the five, different partially imputed datasets). We also calculated the variance inflation factor for each variable.
Inference from complete data is typically based on a point estimate for a parameter θ, a variance-covariance matrix U, and a normal reference distribution. In the presence of missing values, multiple imputation is a state-of-the-art method to impute the missing values [4]. Different imputation methods are available to impute the missing values. Employing an appropriate imputation method will yield m (> 1) sets of imputed data.
To get m sets of estimates and variance-covariance matrix estimates, we used appropriate mixed models (based on the research question) on each of the imputed datasets. The mixed model equation for the jth subject can be written as Yj = Xj β + Zj dj + Vj with the assumptions: dj ~ NID(0, Δ), Vj ~ NID(0, σ2I), and the covariance matrix for the model equation is V(Yj) = Σj = Zj Δ Zj' + Vj. This model can be constructed using the "usual" linear model methods E(Yj) = Xj β, where Yj represents the vector of measurements from the j-th subject through all periods, Xj is the fixed-effects design matrix for the jth subject, β is the fixed-effect parameter for all subjects, Zj is the random effects design matrix for the jth subject, dj is the random effects coefficient for jth subject (dj contains increments to population intercepts and slopes), and Vj represents the vector of random "measurement errors" for jth subject.
The design matrix, Xj, and fixed-effect parameter effects,
elements of β, are similar to the design matrix, X, and the
regression parameters in a typical multiple regression, ANOVA,
or ANCOVA model, in that E(Yj) = Xj β. Thus, an element of β may represent the "slope" of a regression surface with respect
to a covariate, or a treatment effect, or similar quantity. The
random effects design matrix, Zj, and the subject-specific
random effects coefficient, dj, represent random deviations
about E(Yj) = Xj β that are associated with data from the
jthsubject. The vector Vj is the vector of random deviations,
or "measurement errors," about the expected value of data
from the j-th subject. It is very similar to the vector of random
deviations, usually denoted Єj or ej in a multiple regression,
ANOVA, or ANCOVA model.
The combined estimate of θ from m mixed models with
imputed datasets is
(where k = 1, 2, 3, …., m),
the average of the m estimates and the variance-covariance
matrix estimates associated with θ has within-
and
between-
imputation components, and the total
variance-covariance matrix estimate is
where,
and
[15]. From each mixed model run we estimated the variancecovariance
matrix of
where Z is the known design matrix, G is the random effects
covariance matrix, and R is the error variance.
To test linear hypotheses
[where Lθ is a
vector of coefficients for the linear hypotheses, and c is a
vector of constants] about the parameters, we used two
different test statistics depending on whether univariate or
multivariate inference was required. For Univariate inference,
we used a t-test
for univariate inference (when
the rank of L is equal to 1) where,
. The
quantity
is the relative increase in variance due
to missing values, and the quantity
is called
the fraction of missing information about θ [15]. For multivariate
inferences (when the rank of L is greater than 1), we use an
F-test
where rank (L)= p, and
where
is an average relative variance increase due to nonresponse
[4,5]. In some situations, specifically, for small m, betweenimputation
covariance matrix B is unstable and does not
have full rank [5]. The suggestion for handling an unstable
between-imputation covariance matrix is to assume that
the population between and within imputation covariance
matrices are proportional to one another. Then a more stable
estimate of total variance can be calculated, which leads to a
different test statistic and a change in the degrees of freedom.
One suggestion is to change the degrees of freedom v as [6].
for p(m-1) ≤ 4, and
for p(m-1)< 4
A Number of statistical software packages are able to impute missing values. In this paper we used SAS [7] Proc MI to generate the multiply imputed datasets, which uses the EM algorithm to do the imputation, and assumes a multivariate normal distribution and missingness at random (MAR). Next we used Proc MIXED to generate estimates and variance covariance matrices where we used the random intercept and slope. There is no SAS procedure available to accommodate two different covariance matrices (G and R) in the analysis and to make inferences. SAS Proc MIANALYZE can be used to generate combined inferences in simple situations. We used Proc IML to program the procedure to obtain the average estimates, the within-between combined standard errors, the approximate degrees of freedom, and to generate inferential results.
Cross sectional data with clustering
For this analysis, we merged databases from the National
Center for Education Statistics [8] and the National Institute of
Child Health and Human Development (NICHD) Study of Early
Child Care and Youth Development (SECCYD). The SECCYD was
conducted by the NICHD Early Child Care Research Network
supported by NICHD through a cooperative agreement that
calls for scientific collaboration between the grantees and
the NICHD staff. The NCES database contains descriptive
information on all U.S. public schools and their students,
teachers, curricula, and factors related to the children's
classroom environment [9]. Details of the original sampling
procedure and characteristics of the SECCYD database have
been described in several publications [10-14]. The SECCYD
database contains longitudinal information collected from
an initial sample of 1,364 children born in 10 study sites [6].
Data from the 54-months and first grade time points were
selected from the SECCYD database. Data corresponding to
the children's year in first grade were selected from the NCES
database. These variables were used to model the children's
experience in first grade, based on economic resources at
the family level and the school level [13-17].
The NCES database contained information only on public schools, but the SECCYD database included children in both public and private schools. Out of 1,364 children from the SECCYD database, 709 were successfully mapped to their respective public schools, after excluding 57 children from the analysis because only one child was selected from each classroom. Therefore, each child in the combined dataset represents a unique classroom, resulting in clusters of unique classrooms in each school district.
The dependent variable used in this example reflects classroom instructional quality in first grade. Independent variables used in these models are related to the children's socio-demographic, academic, classroom, and school-related attributes at 54 months and in first grade. We did not impute all the missing values for all independent variables. Instead we selectively imputed the missing values based on their scientific validity. For example, we chose not to impute independent variables related to instructional spending or percentage of nonwhite students for each school in a school district because there was no way to scientifically estimate such values from the NCES database. Additionally, categorical variables were not imputed. Given the structure of these data, clustering was taken into account at the school-district level in the mixed model analysis.
The dependent variable for the model, with only one missing value, was classroom instructional quality at first grade. Independent variables were spending for instruction per child, percentage of nonwhite students, total number of children in class at first grade, teacher's experience, teacher's education, income-to-needs ratio at first grade, mother's education, class persistence of students at first grade, teacher's perception of barriers at first grade class, WJ-R mean score of cognitive at 54 months, WJ-R mean score of achievement at 54 months, and percentage in daycare from 6 to 54 months.
Longitudinal data
Data from the SECCYD study were used again in the second
example. The data were structured longitudinally and were
collected at the 24-, 36-, and 54-months and during the first
grade. A total of 1364 children were included in the analysis,
each potentially having four cognitive and language data
points. Of the 5456 observations in the analysis dataset, 998
were excluded due to missing dependent variable values from
659 children, and 6 additional observations were deleted due
to variable values being nonimputable, as in the previous
example.
Independent variables at the four time points included household income, race, age, maternal education, maternal depression, maternal sensitivity, presence of husband/partner in the household and number of hours in childcare. Household income was based on a calculation of median income-to-needs ratio at two time periods up to first grade. Income-to-needs was dichotomized, after imputations, as low and high at the two age periods, resulting in four categories of income: chronically poor, early poor, late poor, and never poor. Mother's education was measured in years. Maternal depression was measured with the Center for Epidemiological Studies Depression scale (CES-D). Maternal sensitivity represents constructs of measurements related to sensitivity to nondistress, positive regard, and intrusiveness at 24 months, and supportive presence, respect for autonomy, and hostility at 36 and 54 months and first grade time points. The dependent variable was based on the children's cognitive and language development at 24, 36, and 54 months and first grade [18]. A mixed model with individual random intercept and slope (unstructured covariance matrix for the intercept and slope) was used to model the child's cognitive outcome. (Table 3) contains the percentages of missing values for the analysis dataset, (Table 4) contains the parameter stimates and significances for the mixed models with imputed values, and (Table 5) compares the combined estimates based on the imputed datasets to the original model.
Table 3 : Percentage of Missing Values at Four Time Points.
Table 4 : Estimates from Mixed Models using Different Imputed Datasets.
Table 5 : Comparison of Combined Estimates to Original Model.
(Table 1) displays the percentage of missing values for independent variables in the database, and the model estimates for the five imputed datasets of the cross-sectional model. The NCES variables have a maximum of nearly 20% missing values, whereas the SECCYD data have a maximum of nearly 8% missing. We assumed that the missing values are missing at random. The five imputed datasets correspond to the five sets of model estimates showing the between imputation variability among the parameter estimates. The estimates demonstrate slightly different values and significance levels. For example, the estimate of the income–to-needs ratio for Model 2 is lower than the other models and ossesses a p-value greater than 0.05.
Table 1 : Estimates from mixed models Using different imputed datasets.
(Table 2) compares the differences in parameter estimates and significance levels of the original dataset and the combined results of the five imputed datasets. "Original model" implies that the same mixed models were used to generate the estimates by using the dataset before imputation. The model based on the original dataset uses 585 data vectors, but the combined model based on the imputed datasets uses 699, an increase of 114 data vectors with the MI technique. Noteworthy is the lack of significance of percentage of nonwhite students and income-to-needs ratio and the significance of class persistence of students for the combined estimates. The variable income-to-needs ratio has the greatest relative increase in variance, 0.22.
Table 2 : Comparison of Combined Estimates to Original Model.
(Table 3) shows the percentage of missing values for four
categories of longitudinal measures: maternal sensitivity,
maternal depression, partner/husband in the household,
and hours per week in child care. Missing values abound for
these variables, ranging from 9 to 26% at each measurement
point. Because missing values may not coincide on the same
data vector, a high potential that an appreciable percentage
of observations will be excluded from the model exists.
(Table 4) contains the estimates from the five imputed datasets. As in the cross-sectional example, differences exist in parameter estimates among the five models. There do not seem to be differences in significance, at the p < 0.05 level, among the models. However, differences in estimates and significance are evident when the original model is compared to the combined imputed models (Table 5). Using the multiple imputation technique, the effective number of data vectors used in the model increased from 2833 (original model) to 4452. The estimate of maternal sensitivity decreased from 1.33 in the original model to 0.88 in the combined model. The estimate of maternal depression changed from -0.03 in the original model to -0.05 in the combined model and became significant, at the p < 0.05 level, in the combined model. Maternal sensitivity, which had 14% to 26% missing data, experienced the greatest relative increase in variance at 0.34.
The use of multiple imputations in combination with mixed models is ideal when a significant number of data vectors are not included in the final model. It is very important to look at the nature of the missing values and type of variable and judge the appropriateness of the imputation before imputing any missing values. Only when it is scientifically logical should missing dependent variable values be filled in by appropriate imputation methods. Using appropriate analysis methods, correctly specifying the variance-covariance components, and properly adjusting degrees of freedom in a mixed model scenario from a partially or completely multiply imputed datasets improves the acceptability of the results to the scientific community. We also found that the estimates and inference value changes in the combined model compared with the original model are not imputation dependent.
Whether fitting a cross-sectional or longitudinal model, a multiple imputation technique can impact model estimates and their corresponding significance levels. The examples demonstrated that even though there is a relative increase in variance associated with a combined estimate, the inclusion of more data vectors can offset this increase and achieve significance where there was none in the original model. The original model with missing values may not be representative of the underlying population model because it did not include all available information. The MI technique is equally difficult to interpret because imputed values are based on an assumed distribution and model. For these reasons it is desirable to compare both the model estimates from the original and multiple imputed datasets on a scientific basis to come up with expected desirable estimates and draw inferences.
The author declares that he has no competing interests
The authors would like to thank the National Institute of Child Health and Human Development (NICHD) Study of Early Child Care and Youth Development (SECCYD) for permission to use the data.
Editor: Guy Nathaniel Brock, University of Louisville, USA.
EIC: Max K. Bulsara, University of Notre Dame, Australia.
Received: 01-Jul-2013 Revised: 25-Sep-2013
Accepted: 01-Oct-2013 Published: 31-Oct-2013
Chakraborty H. Mixed model variance adjustments when missing values are multiply imputed. J Med Stat Inform. 2013; 1:2. http://dx.doi.org/10.7243/2053-7662-1-2
Copyright © 2015 Herbert Publications Limited. All rights reserved.