Machine Learning in Social Research: Predicting Item Nonresponse Error Using Naive Bayes Classifier




item nonresponse, refusal to answer, no answer, “Don't know” option, naive Bayes classifier, text-mining, European Social Survey, ESS, machine learning, measurement quality


 Various reasons may cause missing data in social research. The article highlights the non-response errors caused by ignorance, the lack of desire, or difficulty searching for answers to specific questionnaire questions. Predicting item nonresponse, which would help reduce missing data, poses particular concerns. Based on the data from the European Social Survey (UK respondents) this article shows how text mining and machine learning can predict item nonresponse. The study employs the Naive Bayes Classifier, a popular method to predict the class of dependent variables based on textual data. It relies on scientific literature to show how this method performs. The author provides a database combining full wordings of questions, answers, and instructions, and the ESS survey results in the UK. The paper shows how separate models for predicting the occurrence of item nonresponse were trained using the Naive Bayes technique based on the word frequency and TF-IDF weights (their calculations are also provided). The author evaluated each model for the frequency of error occurrence. As a result, lists of words causing or not causing item nonresponse errors were obtained. The results show that respondents are less likely to answer sensitive questions; certain words related to the procedure of getting an answer to a question can also lead to high levels of item nonresponse.

Author Biography

Marina Yu. Aleksandrova, National Research University Higher School of Economics

  • National Research University Higher School of Economics, Moscow, Russia
    • Lecturer, Doctoral Student at the Department of Collection and Analysis of Sociological Information