What this primer is intended for:
A quick and superficial, but definitely fun intro to standard linguistic data analysis in R. R is a free software package for statistical data analysis and data visualization that also includes more cutting-edge implementations of the hottest models currently used in linguistics data analysis. The class does *not* provide an introduction to all the concepts we will cover. Rather, we will see how to open and read in your data in R, prepare the data for analysis, visualize the data, analyze it, and test some of the assumptions of your analysis. The class can be followed with minimal or even no statistical background, although I encourage people to take the other statistics classes in the pre-session (which will be taught in SPSS). Although we will touch on pretty advanced topics here and there, the use of these techniques is often relatively straight-forward once the basics of R are understood. The class sets the ground for some of the classes using R taught during the main session.
What you need to prepare before class:
The sessions are intended to be tutorials. That is, I encourage you to bring your own laptop and follow along. For that you should follow the instructions on this page to install R and all the required libraries on your system. This will come in handy for several of the other classes during the main session of the institute, too (most, if not all, of the libraries you will need for those classes, are considered on this page). *Please also download the data sets we will use during class*.
Overview:
The first tutorial (repeated twice) deals with continuous dependent variables (also called continuous outcomes), such as your classical reaction time measure, word durations or frequencies in phonetic experiments, or acceptability judgments from magnitude estimation. The second lecture deals with categorical dependent variables, such as yes/no-answers, button-presses, any kind of multiple choice, the choice of a variant in linguistic variation, and so on. We will focus on why the most common analysis for such data (ANOVAs over proportions of the categorical outcomes) is highly problematic, and what alternatives modern statistics and R provide us with (logistic regression, mixed logit models).
Session 1:
Standard Continuous Data Analysis We will start by loading some data into R, and making ourselves familiar with it (data frames, variables, classes, summary and structural overviews of data, basic plotting). The types of outcomes we will be investigating in this class is a continuous variable. We will start with data from a typical psycholinguistic experiment with a balanced design (2 x 2) and a with categorical independent variables and continuous dependent variable. We will use analysis of variance (ANOVA) to analyze it (F1, F2). Once we understand the output of an ANOVA in R, we will test some of the assumptions that ANOVA make (normality, linearity). Next, we'll look into linear regression (ANOVAs can be understood as a special case of linear regression), which allows us to include covariates (i.e. continuous independent variables) in addition to the categorical independent variables. We can also use multiple linear regression to analyze unbalanced data sets (that are common in sociolinguistics, corpus work, and whenever we lost a lot data in an experiment). We will discuss what we need to pay attention to when we do that (most of all: collinearity). R provides a variety of tools that allow us to test how much we can trust the models we use to investigate our data.
Session 2:
Categorical Data Analysis (avoiding a common fallacy) While the first session deals with data where the dependent variable is continuous (e.g. reading times), the second sessions introduces analysis of categorical outcomes (e.g. whether a speaker produces the word "that" in "I think (that) we should stop using ANOVAs for the analysis of categorical outcomes). We will start by reviewing why the technique that is arguably still the most common for categorical data analysis (ANOVAs over average proportions) is problematic, why it is prone to producing spurious significances and spurious insignificances. In the second half of the session, we will introduce logistics regression in R. We will close by discussion mixed logit models, a modern statistical method that allows us to combine the advantages of logistic regression with the need to model random subject and item effects. A more in depth introduction to mixed models will be provided by Harald Baayen later during the main session (see LSA schedule), but this lecture focuses on mixed *logit* models (for categorical outcomes) rather than mixed linear models (for continuous outcomes).
If you have any questions, please contact the instructor, Florian Jaeger, or his Laboratory Technician, Andrew Watts.
