Monday, October 21, 2013

A great lecturer, and the contextuality of nonresponse

I love watching videos from Richard Feynman on Youtube. Apart from being entertaining, Feynman in the video below does explain quite subtly about what constitutes a good scientific theory, and what doesn't. He is right about the fact that good theories are precise theories.

Richard Feynman: fragment from a class on the Philosophy of science (source: Youtube)

The video also makes me jealous of natural scientists. In the social sciences, almost all processes and causal relationships are contextual, as opposed to the natural sciences. For example: in survey methods, nonresponse is one of the phenomena that is contextual. Nonresponse always occurs, but the predictors of nonresponse differ across countries, survey topics, time, survey mode, and subpopulations. In other words, that is what makes building a theory about nonresponse so difficult.

Thursday, October 17, 2013

My prayers on peer-reviewed datasets are instantly answered

Three weeks ago, I wrote about the fact that I think that it would be great if we could have a journal on peer-reviewed datasets (along with data being accessible).

It seems I am not alone thinking this. Jelte Wicherts, a psychologist/statistician atTilburg University has just started the Journal of Open Psychology data.

Jelte writes:

"The Journal of Open Psychology Data (JOPD) features peer reviewed data papers describing psychology datasets with high reuse potential. Data papers may describe data from unpublished work, including replication research, or from papers published previously in a traditional journal. We are working with a number of specialist and institutional data repositories to ensure that the associated data are professionally archived, preserved, and openly available. Equally importantly, the data and the papers are citable, and reuse is tracked.

Monday, October 14, 2013

Imagine we have great covariates for correcting for unit nonresponse...

I am continuing on the recent article and commentaries on weighting to correct for unit nonresponse by Michael Brick, as published in the recent issue of the Journal of Official Statistics (here).

The article by no means is all about whether one should impute or weight. I am just picking out one issue that got me thinking. Michael Brick rightly says that in order to correct succesfully for unit nonresponse using covariates, we want the covariates to do two things:

1. They should explain missingness.
2. They should highly correlate with our variable of interest.

In other words, these are the two assumptions for a  Missing At Random process of missing data.

The variables (covariates) we currently use for nonresponse adjustments do neither. Gender, age, ethnicity, region, (and if we're lucky) education, household composition and house characterics do not explain missingness, nor our variable of interest. Would it ever be conceivable to obtain covariates that do this? What are the candidates?

1. covariates (X) that explain missingness (R):
Paradata are currently our best bet. Those may be interviewer observations or call data during fieldwork (note the absence of sample level paradata for self-administered surveys - here lies a task for us). Paradata don't explain missingness very well at the moment, but I think everyone in survey research agrees we can try to collect more.
Another set of candidates are variables that we obtain by enriching sampling frames. We can use marketing data, social networks, or census data to get more information on our sampling units.

2. covariates (X) that explain our variable of interest (Y):
Even if we find covariates that explain missingness, we also want those covariates to be highly correlated to our variable of interest. It is very unlikely that a fixed set of for example paradata variables can ever achieve that. Enriched frame data may be more promising, but is unlikely that this will generally work. I think it is a huge problem that our nonresponse adjustment variables (X) are not related to Y, and one that is not likely to ever be resolved for cross-sectional surveys.

But. In longitudinal surveys, this is an entirely different matter. Because we usually ask the same variables over time, we can use variables from earlier occasions to predict values that are missing at later waves. So, there, we have great covariates that explain our variable of interest. We can use those as long as MAR holds. If change in the dependent variable is associated with attrition, MAR does not hold. Strangely, I know very few studies that study whether attrition is related to change in the dependent variable. Usually, attrition studies focus on covariates measured before attrition, to then explain attrition. They do not focus on change in the dependent variable.

Covariate adjustment for nonresponse in cross-sectional and longitudinal surveys

(follow-up 28 October 2013): When adjustment variables are strongly linked to dependent variables, but not to nonresponse, variances tend to be increased (See Little and Vartivarian). So, in longitudinal surveys, the weak link between X and R should really be of medium strength as well, if adjustment is to be successful.

I once thought that because we have so much more information in longitudinal surveys, we could use the lessons that we learn from attrition analyses to improve nonresponse adjustments in cross-sectional surveys. In a forthcoming book chapter, I found that the correlates of attrition are however very different from the correlates of nonresponse in wave 1. So in my view, the best we can do in cross-sectional surveys is to focus on explaining missingness, and then hope for the best for the prediction of our variables of interest.

Sunday, October 6, 2013

To weight or to impute for unit nonresponse?

This week, I have been reading the most recent issue of the Journal of Official Statistics, a journal that has been open access since the 1980s.  In this issue is a critical review article of weighting procedures authored by Michael Brick with commentaries by Olena Kaminska (here), Philipp Kott (here), Roderick Little (here), Geert Loosveldt (here), and a rejoinder (here).

I found this article a great read, and to be full of ideas related to unit nonresponse. The article reviews approaches to weighting: either to the sample or the population, by poststratification and with different statistical techniques. But it discusses much more, and I recommend reading it.

One of the issues that is discussed in the article, but much more extensively in a commentary by Roderick Little, is the question whether we should use weighting or imputations to adjust for unit nonresponse in surveys. Over the years, I have switched allegiances to favouring weighting or imputations in certain missing data situations many times, and I am still not always certain on what is best to do. Weighting is generally favoured for cross-sectional surveys, because we understand how it works. Imputations are generally favoured when we have strong correlates for missingness and our variable(s) of interest, such as in longitudinal surveys. Here are some plusses and minuses for both weighting and imputations.

Weighting is design based. Based on information that is available for the population or whole sample (including nonrespondents), respondent data are weighted in such a way that the survey data reflect the sample/population again.

+ The statistical properties of all design-based weighting procedures are well-known.
+ Weighting works with complex sampling designs (at least theoretically).
+ We need relatively little information on nonrespondents to be able to use weighting procedures. There is however a big BUT...
- Weighting models mainly use socio-demographic data, because that is the kind of information we can add to our sampling frame. These variables are never highly correlated with our variable of interest, nor missingness due to nonresponse, so weighting is not very effective. That is, weighting theoretically works nicely, but in practice, it doesn't ameliorate the missing data problem we have because of unit nonresponse much.

Imputations are model based. Based on available information for respondents and nonrespondents, a prediction model is built for a variable which has missing information. The model can take an infinite number of shapes, depending on whether imputation is stochastic, how variables are related within the model, and what variables are being used. Based on this model, one or multiple values are imputed for every missing value on every variable for every case. The crucial difference is that weighting uses the same variables for correcting the entire dataset, whereas imputation models differ for every variable that is to be imputed.

+ Imputation models are flexible. This means that the imputation model can be optimized in such a way that it strongly predicts both the dependent variable to be imputed, and the missingness process.

- In the case of unit nonresponse, we often have limited data on nonrespondents. So, although a model-based approach may have advantages over design-based aproaches in terms of its ability to predict our variable(s) of interest, this depends on the quality of the covariates we use.

This then brings me, and the authors of the various papers in JoS back to the basic problem: we don't understand the process on nonresponse in surveys. Next time, more on imputations and weighting for longitudinal surveys. And more on design vs. model based approaches in survey research.

p.s. This all assumes simple random sampling. If complex sampling designs are used, weighting is until now I think the best way to start dealing with nonresponse. I am unaware of imputation methods that can deal with complex sampling (other than straightforward multilevel structures).