Thursday, October 11, 2012

How to run a Conditional ANOVA

Today is a wee bit heavier on the stats side again. If you are interested in Differential Item Functioning and how to do it with an easy to use tool, this is for you...

Aim: Identify differential item functioning in numerical scores across groups in order to decide whether the items are unbiased and can be used for cross-cultural comparisons.

General approach: Van de Vijver and Leung (1997) describe a conditional technique which can be used if you use Likert-type scales. It uses traditional ANOVA techniques. The independent variables are (1) the groups to be compared and (2) score levels on the total score (across all items) as an indicator of the true observed or ‘latent’ trait (please note that technically it is not a latent variable). The dependent variable is the score for each individual item. Since we are using the total score (divided into score levels) as an IV, the analysis is called ‘conditional’.

Advantages of Conditional ANOVA: It can be easily run in standard programmes such as SPSS. It is simple. It highlights some key issues and principles of differential item functioning. One particular advantage is that working through these procedures, you can easily find out whether score distributions are similar or different (e.g., is an item bias analysis warranted and even possible?).

Disadvantages of Conditional ANOVA: There are many arbitrary choices in splitting variables and score groups (see below) that can make big differences. It is not very elegant. Better approaches that circumvent some of these problems and that can be implemented in SPSS and other standard programmes include Logistic Regression. Check out Bruno Zumbo’s website and manual. I will also try and put up some notes on this soon.

What do we look for? There are three effects that we look for.
First, a significant main effect of score level would indicate that individuals with low score overall also show a lower score on the respective item. This would be expected and therefore is generally not of theoretical interest (think of it as equivalent to a significant factor loading of the item on the ‘latent’ factor).
Second, a significant main effect of country or sample would indicate that scores on this item for at least one group are significantly higher or lower, independent of the true variable score. This indicates ‘uniform DIF’. (Note: this type of item bias can NOT be detected in Exploratory Factor Analysis with Procrustean Rotation).
Third, a significant interaction between country and score level on the item mean indicates that the item discriminates differently across groups. This indicates ‘non-uniform DIF’. The item is differently related to the true ‘latent’ variable across groups. For example, think of an item of extroversion. In one group (let’s say New Yorkers), ‘being the centre of attention at a cocktail party’ is a good indicator of extroversion, whereas for a group of Muslim youth from Mogadishu in Somalia it is not a relevant item of extroversion (since they are not allowed to drink alcohol and probably have never been at a cocktail party, for obvious reasons).
Note: Such biases MAY be detected through Procrustean Rotation, if examining differentially loading items.

Important: What is our criterion for deciding whether an item shows DIF or not?

Statistical Procedure:

The procedure requires in most cases at least four steps.
Step 1: Calculate the sum score of your variable. For example, if you have an extraversion scale with ten items measured on a scale from 1 to 5, you should create the total sum. This can obviously vary between 10 and 50 for any individual. Use the syntax used in class.

For example:

Compute extroversion=sum(extraversion1, extraversion2,…,extraversion10).

Step 2: You need to create score levels. You would like to group equal numbers of individuals into groups according to their overall extroversion score.
Van de Vijver and Leung (1997) recommend having at least 50 individuals per score group and sample. For example, if you have 100 individuals in each group, you can maximally form 2 groups. If you have 5,000 individuals in each of your cultural samples, you could theoretically form up to 100 score levels (well actually not, because you would have only 40 meaningful groups in this example since the difference between maximum and minimum possible score is 40). Therefore, it is up to you how many score levels you create. Having more levels will obviously allow more fine-grained analyses (you can make finer distinctions between extroversion levels in both groups) and probably more powerful (you are more likely to detect DIF). However, because you have fewer people in your analysis, it might also be less stable. Hence, there is a clear trade-off, but don’t despair. If an item is strongly biased, it should show up in your analysis independent of you have fewer or more score levels. If the bias is less severe, analyses might change across different options.

One issue is that if you have less than 50 people in each score group and cultural sample, the results might become quite unstable and you may find interactions that are hard to interpret. In any case, it important to consider both statistical significance as well as effect sizes when interpreting item bias.

A simple way of getting the desired number of equal groups is to use the rank cases option. You find this under ‘Transform’ -> ‘Rank cases’. Transfer your sum score into the variables box. Click on ‘Rank types’. First, unclick ‘Rank’ (it will rank your sample, but this is something that you do not need). Second, click on ‘Ntiles’ and specify the number of groups you want to create. For example, if you have 200 individuals, you could create 4 groups. If you have larger samples, the discussion from above applies (you have to decide about the number of levels, facing the before-mentioned trade-off in terms of power versus stability).

As discussed above, it is strongly advisable to interpret effect sizes (how big is the effect) in addition to statistical significance levels. This is particularly important if you have large sample sizes in which often minute differences can become significant. SPSS gives you partial eta-squared values routinely (if you click on ‘effect sizes’ under the ‘options’). Cohen (1988) differentiated between small  (0.01), medium (0.06), and large effect size (0.14) for eta-squared. Please note that SPSS gives you partial eta-squared values (which is the variance due to the effect, independent of the effect of other effects), whereas eta-squared does not take the other effects take into account. Partial eta-squared values are often larger than the traditional eta-squared values (overestimating the effect), but at the same time there is much to be recommended for using partial instead of traditional eta-squared values (see Pierce, Block & Aguinis, 2004, in Educational and Psychological Measurement).

Step 3:  Run your ANOVA for each item separately. The IV’s are country/sample and score level (the variable created using ranking procedures). Transfer your IV’s into the ‘Fixed Factor’ boxes. As described above, the important stuff to look out for is the significant main effect of country/sample (indicating uniform DIF) and/or the significant interaction between country/sample x score level (indicating non-uniform DIF). You can use plots produced by SPSS to identify that nature and direction of the bias (under plots, transfer your score level to the ‘horizontal axis’ and the country/sample to ‘separate lines’, click ‘add’ and then ‘continue’). Van de Vijver and Leung (box 4.3) describe a different way of plotting the results. However, the results are the same, only different way of visualising the main effect and/or interaction.
This little figure for example shows evidence of both uniform and nonuniform bias. The item is overall easier for the East German sample and it does not discriminate equally well across all score levels. Among higher score levels, it does not differentiate well for the UK sample. 

Step 4: Ideally, you would not like to have DIF. However, it is likely that you will encounter some biased items. I would run all analyses first and identify the most biased items. If all items are biased, you are in trouble (well, unless you are a cultural psychologist, in which case you rejoice and party). In this case, there is probably little you can do at this point except trying to understand the mechanisms underlying the processes (how do people understand these questions, what does this say about the culture of both groups, etc.).
If you have only a few biased items, remove them (you can either remove the item with the strongest partial eta-square or all of the DIF items in a single swoop – I would recommend the former procedure though) and recompute the sum score (step 1). Go through step 2 and 3 again to see whether your scale is working better now. You may need to repeat this analysis various times, since different items may show up as biased at each iteration of your analysis.


My factor analysis showed that one factor is not working in at least one sample: In this case, there is no point in running the conditional ANOVA with that sample included. You are interested in identifying those items that are problematic in measuring the latent score. You therefore assume that the factor is working in all groups included in the analysis.

My overall latent scores do not overlap: This will lead to situations where the latent scores are so dramatically different that you can not find score levels with at least 50 participants in each sample. In this case, your attempt to identify Differential ITEM functioning is problematic, since something else is happening. One option is to increase score levels (make the groups larger – obviously this involves a loss of sensitivity and power to detect effects). Sometimes, even this might not be possible.
At a theoretical level, it could be that you have a situation where you have generalized uniform item bias in at least one sample (for example because one group gives acquiescent answers that are consistently higher or lower). It also might indicate method bias (for example, translation problems that make all items significantly easier in one group compared to the others) or construct bias (for example, you might have tapped into some religious or cultural practices that are more common in one group than in another – in this case your items might load on the intended factor but conceptually the factor is measuring something different across cultural groups). Of course, it can also indicate a true differences. Any number of explanations (construct or method bias or substantive effects that lead to different cultural scores) could be possible.

What happens if most items are biased and only a few unbiased items remain? In this situation you run into the paradox that you can not actually determine whether your biased items are actually unbiased or unbiased items are biased. This type of analysis only functions properly if you have a small number of biased items, up to probably half the number of items in your latent variable. Once you move beyond this, it means that there is a problem with your construct. If you mainly find uniform bias, but no interactions, you can still compare correlations or patterns of scores (since your instrument most likely satisfies metric equivalence). If you have interactions, you do not satisfy metric equivalence and you may need to investigate the structure and function of your theoretical and/or operationalized construct (functional and structural equivalence). 

Any questions? Email me ;)