What is a codebook?

A codebook provides information on the structure, contents, and layout of a data file. Users are strongly encouraged to review the codebook of a study before downloading the data file(s).

Although codebooks vary widely in quality and amount of information given, a typical codebook includes:

  1. Column locations and widths for each variable
  2. Definitions of different record types
  3. Response codes for each variable
  4. Codes used to indicate nonresponse and missing data
  5. Exact questions and skip patterns used in a survey
  6. Other indications of the content and characteristics of each variable

Additionally, codebooks may also contain:

  1. Frequencies of response
  2. Survey objectives
  3. Concept definitions
  4. A description of the survey design and methodology
  5. A copy of the survey questionnaire
  6. Information on data collection, data processing, and data quality

The body of a codebook describes the content of the data file. The following elements are generally included for each variable in the data file:

  1. Variable Name: Indicates the variable number or name assigned to each variable in the data collection.
  2. Variable Column Location: Indicates the starting location and width of a variable. If the variable is a multiple-response type, then the width referenced is that of a single response.
  3. Variable Label: Indicates an abbreviated variable description (maximum of 40 characters) that can be used to identify the variable. In some cases, an expanded version of the Variable Name can be found in a Variable Description List.
  4. Missing Data Code: Indicates the values and labels of missing data. If "9" is a missing value, then the codebook could note "9 = Missing Data." Other examples of missing data labels include "Refused," "Don't Know," "Blank (No Answer)," and "Legitimate Skip." Some analysis software requires that certain types of data be excluded from analysis and designated as "Missing Data," (i.e., inappropriate, not ascertained, not ascertainable, or ambiguous data categories). Users can use these "Missing Data" codes as needed.
  5. Code Value: Indicates the code values occurring in the data for a variable.
  6. Value Label: Indicates the textual definitions of the codes. Abbreviations commonly used in the code definitions are "DK" ("Don't Know"), "NA" ("Not Ascertained"), and "INAP" ("Inapplicable").