Today is a wee bit heavier on the stats side again. If you are interested in Differential Item Functioning and how to do it with an easy to use tool, this is for you...
Aim: Identify differential item functioning in numerical scores across
groups in order to decide whether the items are unbiased and can be used for
cross-cultural comparisons.
General approach: Van de Vijver and Leung (1997) describe a
conditional technique which can be used if you use Likert-type scales. It uses
traditional ANOVA techniques. The independent variables are (1) the groups to
be compared and (2) score levels on the total score (across all items) as an
indicator of the true observed or ‘latent’ trait (please note that technically it
is not a latent variable). The dependent variable is the score for each
individual item. Since we are using the total score (divided into score levels)
as an IV, the analysis is called ‘conditional’.
Advantages of Conditional ANOVA: It can be easily run in standard
programmes such as SPSS. It is simple. It highlights some key issues and
principles of differential item functioning. One particular advantage is that
working through these procedures, you can easily find out whether score
distributions are similar or different (e.g., is an item bias analysis
warranted and even possible?).
Disadvantages of Conditional ANOVA: There are many arbitrary choices in
splitting variables and score groups (see below) that can make big differences.
It is not very elegant. Better approaches that circumvent some of these
problems and that can be implemented in SPSS and other standard programmes
include Logistic Regression. Check out Bruno Zumbo’s website and manual.
I will also try and put up some notes on this soon.
What do we look for? There are three effects that we
look for.
First, a
significant main effect of score level would indicate that individuals with low
score overall also show a lower score on the respective item. This would be
expected and therefore is generally not of theoretical interest (think of it as
equivalent to a significant factor loading of the item on the ‘latent’ factor).
Second, a
significant main effect of country or sample would indicate that scores on this
item for at least one group are significantly higher or lower, independent of
the true variable score. This indicates ‘uniform DIF’. (Note: this type of item
bias can NOT be detected in Exploratory Factor Analysis with Procrustean
Rotation).
Third, a
significant interaction between country and score level on the item mean
indicates that the item discriminates differently across groups. This indicates
‘non-uniform DIF’. The item is differently related to the true ‘latent’
variable across groups. For example, think of an item of extroversion. In one group
(let’s say New Yorkers), ‘being the centre of attention at a cocktail party’ is
a good indicator of extroversion, whereas for a group of Muslim youth from
Mogadishu in Somalia it is not a relevant item of extroversion (since they are
not allowed to drink alcohol and probably have never been at a cocktail party,
for obvious reasons).
Note: Such
biases MAY be detected through Procrustean Rotation, if examining
differentially loading items.
Important: What is our criterion for deciding
whether an item shows DIF or not?
Statistical Procedure:
The procedure requires
in most cases at least four steps.
Step 1: Calculate the sum score of your variable. For example, if you have an
extraversion scale with ten items measured on a scale from 1 to 5, you should create
the total sum. This can obviously vary between 10 and 50 for any individual.
Use the syntax used in class.
For example:
Compute
extroversion=sum(extraversion1, extraversion2,…,extraversion10).
Step 2: You need to create score levels. You would like to group equal numbers
of individuals into groups according to their overall extroversion score.
Van de Vijver and
Leung (1997) recommend having at least
50 individuals per score group and sample. For example, if you have 100
individuals in each group, you can maximally form 2 groups. If you have 5,000
individuals in each of your cultural samples, you could theoretically form up
to 100 score levels (well actually not, because you would have only 40
meaningful groups in this example since the difference between maximum and
minimum possible score is 40). Therefore, it is up to you how many score levels
you create. Having more levels will obviously allow more fine-grained analyses
(you can make finer distinctions between extroversion levels in both groups)
and probably more powerful (you are more likely to detect DIF). However,
because you have fewer people in your analysis, it might also be less stable. Hence,
there is a clear trade-off, but don’t despair. If an item is strongly biased,
it should show up in your analysis independent of you have fewer or more score
levels. If the bias is less severe, analyses might change across different
options.
One issue is that if
you have less than 50 people in each score group and cultural sample, the
results might become quite unstable and you may find interactions that are hard
to interpret. In any case, it important to consider both statistical
significance as well as effect sizes when interpreting item bias.
A simple way of
getting the desired number of equal groups is to use the rank cases option. You
find this under ‘Transform’ -> ‘Rank cases’. Transfer your sum score into
the variables box. Click on ‘Rank types’. First, unclick ‘Rank’ (it will rank
your sample, but this is something that you do not need). Second, click on
‘Ntiles’ and specify the number of groups you want to create. For example, if
you have 200 individuals, you could create 4 groups. If you have larger
samples, the discussion from above applies (you have to decide about the number
of levels, facing the before-mentioned trade-off in terms of power versus
stability).
As discussed above, it
is strongly advisable to interpret effect sizes (how big is the effect) in
addition to statistical significance levels. This is particularly important if
you have large sample sizes in which often minute differences can become
significant. SPSS gives you partial eta-squared values routinely (if you click
on ‘effect sizes’ under the ‘options’). Cohen (1988) differentiated between small (0.01), medium (0.06), and large
effect size (0.14) for eta-squared. Please note that SPSS gives you partial
eta-squared values (which is the variance due to the effect, independent of the
effect of other effects), whereas eta-squared does not take the other effects
take into account. Partial eta-squared values are often larger than the
traditional eta-squared values (overestimating the effect), but at the same
time there is much to be recommended for using partial instead of traditional
eta-squared values (see Pierce, Block & Aguinis, 2004, in Educational and
Psychological Measurement).
Step 3: Run your ANOVA for each item
separately. The IV’s are country/sample and score level (the variable created
using ranking procedures). Transfer your IV’s into the ‘Fixed Factor’ boxes. As
described above, the important stuff to look out for is the significant main
effect of country/sample (indicating uniform DIF) and/or the significant
interaction between country/sample x score level (indicating non-uniform DIF).
You can use plots produced by SPSS to identify that nature and direction of the
bias (under plots, transfer your score level to the ‘horizontal axis’ and the
country/sample to ‘separate lines’, click ‘add’ and then ‘continue’). Van de
Vijver and Leung (box
4.3 ) describe a different way of plotting the results. However,
the results are the same, only different way of visualising the main effect
and/or interaction.
This little figure for example shows evidence of both uniform and nonuniform bias. The item is overall easier for the East German sample and it does not discriminate equally well across all score levels. Among higher score levels, it does not differentiate well for the UK sample.
Step 4: Ideally, you would not like to have DIF. However, it is likely that you
will encounter some biased items. I would run all analyses first and identify
the most biased items. If all items are biased, you are in trouble (well,
unless you are a cultural psychologist, in which case you rejoice and party).
In this case, there is probably little you can do at this point except trying
to understand the mechanisms underlying the processes (how do people understand
these questions, what does this say about the culture of both groups, etc.).
If you have only a few
biased items, remove them (you can either remove the item with the strongest
partial eta-square or all of the DIF items in a single swoop – I would
recommend the former procedure though) and recompute the sum score (step 1). Go
through step 2 and 3 again to see whether your scale is working better now. You
may need to repeat this analysis various times, since different items may show
up as biased at each iteration of your analysis.
Problems:
My factor analysis showed that one factor is
not working in at least one sample: In this case, there is no point in running the conditional ANOVA with
that sample included. You are interested in identifying those items that are
problematic in measuring the latent score. You therefore assume that the factor
is working in all groups included in the analysis.
My overall latent scores do not overlap: This will lead to situations where the latent
scores are so dramatically different that you can not find score levels with at
least 50 participants in each sample. In this case, your attempt to identify Differential
ITEM functioning is problematic, since something else is happening. One option
is to increase score levels (make the groups larger – obviously this involves a
loss of sensitivity and power to detect effects). Sometimes, even this might
not be possible.
At a theoretical level,
it could be that you have a situation where you have generalized uniform item
bias in at least one sample (for example because one group gives acquiescent
answers that are consistently higher or lower). It also might indicate method
bias (for example, translation problems that make all items significantly
easier in one group compared to the others) or construct bias (for example, you
might have tapped into some religious or cultural practices that are more
common in one group than in another – in this case your items might load on the
intended factor but conceptually the factor is measuring something different
across cultural groups). Of course, it can also indicate a true differences. Any
number of explanations (construct or method bias or substantive effects that
lead to different cultural scores) could be possible.
What happens if most items are biased and only
a few unbiased items remain?
In this situation you run into the paradox that you can not actually determine
whether your biased items are actually unbiased or unbiased items are biased.
This type of analysis only functions properly if you have a small number of
biased items, up to probably half the number of items in your latent variable.
Once you move beyond this, it means that there is a problem with your
construct. If you mainly find uniform bias, but no interactions, you can still
compare correlations or patterns of scores (since your instrument most likely
satisfies metric equivalence). If you have interactions, you do not satisfy
metric equivalence and you may need to investigate the structure and function
of your theoretical and/or operationalized construct (functional and structural
equivalence).
Any questions? Email me ;)