Automatic Coding of Occupations
Project duration: 01.01.2014 to 31.12.2016
Abstract
In recent years several German large-scale panel studies demonstrated the demand for the coding of open ended survey questions on respondents’ occupations (e. g. NEPS, SOEP and PASS). So far occupational coding in Germany is mostly done semi-automatically, employing dictionary approaches with subsequent manual coding of cases which could not be coded automatically.
Since the manual coding of occupations generates considerably higher costs than automatic coding, it is highly desirable from a survey cost perspective to increase the proportion of coding that can be done automatically. At the same time the quality of the coding is of paramount importance calling for close scrutiny. The quality of the automatic coding must at least match that of the manual coding if survey cost is not to be traded for survey error. From a total survey error perspective this would free resources formerly spent on the reduction of processing error and offer the opportunity of employing those resources to reduce other error sources.
In contrast to dictionary approaches, which are mainly used for automatic occupational coding in German surveys, we will employ different machine learning algorithms (e. g. naïve bayes or k-nearest-neighbours) for the task. Since we have a substantial amount of manually coded occupations from recent studies at our disposal we will use these as training data for the automatic classification. This enables us to evaluate the performance as well as the quality – and hence the feasibility – of machine learning algorithms for the task of automatic coding of open ended survey questions on occupations.