Showing posts with label functional equivalence. Show all posts
Showing posts with label functional equivalence. Show all posts

Monday, March 2, 2015

A gentle intro to cross-cultural equivalence - or how can we measure across cultures?

Psychology is the study of human behaviour and mental processes through scientific methods. The claim of psychology is often to be universal, that is applicable to all of humanity. Using scientific methods, we psychologists rely on a systematic and objective process of proposing and testing hypotheses and making predictions about the state of human nature.  Ever since the beginning of psychology as an academic discipline, the scientific quest to quantify natural occurrences to better understand and predict them in the future became one of the ultimate goals. Of course, this requires often extensive qualitative research, but ultimately the hope was and is that we can understand a behaviour or mental process so precisely that we can quantitatively measure it and also change it.



The application of such quantitative methods are now often taken for granted, even though the levels of quantification may vary. For example, we may want to select the most able person for a particular job, refer a child with learning problems to a specialist or we may wish to help a person with mental health problems to fully function in society again. Even though all these problems can be phrased in qualitative terms (a good person for the job, a child that has problems learning, a person who is not well), these are essentially quantitative problems because they always have some reference to implicit or explicit standards. A person might be BETTER qualified than another to take up a job or a person may have GREATER problems understanding concepts or material than 75% of the children of her age. Therefore, in many day-to-day situations we make implicit and intuitive quantitative statements.

If we want to make quantitative statements about a scientific concept, we run into one of the central problems in psychology. This is namely WHAT do we want to make a comparison about? Or in other words, how do we define a psychological construct so that we can measure it? A geographer, chemist or physicist is unlikely to phase the problems that psychologists have… after all, we can easily measure distances (e.g., how far is Auckland from Wellington), we have ways of dating the age of a piece of rock or we can measure the energy of particles when we collide them at the near speed of light. Psychologists on the other hand are dealing with intangible concepts that are difficult to specify. Most of you are familiar with concepts such as intelligence, attitudes, personality traits, depression or identity. However, if we were to ask you to pinpoint any of these concepts in the real world, we would be unable to do so. Our psychological terminology refer to unobserved mental constructs that we create in our community of fellow psychologists to indicate a particular set of problems, describe a particular set of behaviours or mental representations. I would argue that underlying many of these psychological terms are assumptions about relative coherence, stability, generalizability and potentially even some general biological foundations that lead to the emergence of such a syndrome. Therefore, we don’t just invent these terms on a whim, but we think that there is something meaningful to them that we think is important enough to look into and tell other people about.

Therefore, the first issue in any psychological study, even though it may not seem obvious anymore, is to clearly and unambiguously define and specify what we want to study. What is our construct or process of interest? It is at this point, that culture will throw the first curve ball at any psychologist attempting to address this question. How can we make sure that our definition or mental construct of our psychological term or process is actually valid or does have some meaning in another cultural context? How does our upbringing in a highly developed Western society influence how we think about psychological constructs? Can we assume that identity is a concept that is meaningful in a village in the lowland Amazon basin? Is our definition of depression applicable to refugees coming from Syria or Iraq? Is conscientiousness a useful term to screen out applicants for jobs in an international organization? Therefore, the first problem in any psychological study is to unambiguously define and describe the psychological process for all the populations that we are interested in. We could think of this as a mental bubble that we draw around some problem or process. Does this bubble ‘exist’ in all the different cultures that we want to include in our study? How can we find out whether this bubble is meaningful and has some value or relevance for all the local populations? We will discuss this as the question of functional equivalence.

If we are confident that there is some value to this mental bubble of ours (let’s say, depression, personality or identity) and that the terms are meaningful in two or more cultures, then we need to find good indicators for it. In psychological terms, this is called operationalization. How can we empirically say that one person has more of this latent category quality that we just created with our mental bubble compared to another person? What would be a good indicator to tell us that one person is better for a job compared to another person or that one person is a better learner than another, who in turn may need some help? Here again, culture will throw lots of beautiful little challenges at us. We need to find indicators that are meaningful and relevant in each cultural context, but obviously we would still need to be able to compare the results across contexts. Therefore, we can’t have indicators that are relevant and meaningful in each context, but cannot be compared across cultures. We want to aim for some level of comparability. For example, is staying late at your desk a good indicator of being conscientious? Or could it be seen as being disorganized and incompetent? What if people are unfamiliar with office jobs? Is the number of items that you circled the temple this morning before going to work a better indicator of your conscientiousness? Is the ability to track animals over long distances and varied terrain a good indicator of concentration?  Or should we give people lots of d’s and b’s and p’s and q’s and then ask them to count how many p and q’s were together in each line? Should we measure intelligence by asking people to name as many types of medicinal plans for diarrhoea? Or give them complex questions about history and philosophy? This problem of identifying good measurement indicators will be called structural equivalence. Obviously, how we define and how we operationalize a construct is very much dependent on each other. For this reason, some researchers lump the two terms together as construct equivalence. For reasons that we will discuss later, I prefer to keep them separate.

So, we now have a mental bubble and we have a number of indicators that give us some clue about the latent bubble. However, we don’t actually know how good each of these indicators is in representing that latent bubble. We need to find a way to show us how well each indicator works in each of our cultures. In other words, is the same indicator better in capturing a key aspect of our construct in one culture compared to another? For example, is going to parties and having lots of friends a good indicator of extraversion? Is having many wives a good indicator of social status in all cultures? Is staying late at work to finish a good indicator in all cultures for high conscientiousness? This problems is called metric equivalence. It is the question about the relative strength of the indicator-latent variable relationship. In technical terms, we are concerned with the equivalence of factor loadings or item slopes in classic test theory or the item discriminability in item response theory.

Finally, we may be convinced that our indicators work equally well in all contexts. Each questionnaire or test items is really giving us a good and reliable insight into the construct. But there may be still problems. Some items, even though they have the same relationship with the latent construct in all cultures, may still be a bit more difficult or easier in one context compared to another.  If I would ask you to name the capital of Benin, most of you would probably struggle finding the correct answer. Benin is a country that is quite far from our thoughts and most of us will never set foot in this place or may not have heard about it in the media. However, if I would ask you about the capital city of one of your neighbouring countries, you would probably quite easily be able to name it. Therefore, asking about the capital of Benin would be easier for somebody living in Togo or Nigeria compared to somebody living in NZ or Denmark. This is the issue of full score or scalar equivalence. Technically, we would look at the invariance of item intercepts (in a multi-group CFA) or the differential item difficulty (in IRT).


In summary, measuring psychological attributes or processes across cultural contexts is quite difficult. I gave some relatively superficial and easy examples to make this a relatively non-technical and easy intro to the problem. We need to define our construct – draw our mental bubble around what we want to study. The first step in any cultural study then is to make sure that this construct or mental bubble is meaningful and functional in all cultures that we want to study. Once we think this is the case, we need to find good indicators that are observable and give us some insight into the position or state of an individual in relation to our mental bubble. We then need to discuss whether the indicators are equally good in all contexts or whether some are better in telling us something about a person or process in one cultural context compared to another. Finally, we need to find out whether all indicators are equally easy or difficult. Only once we have fulfilled this last criterion can we actually make any comparisons between individuals or groups across cultures. This is a tough task and unfortunately, most studies that you will see in the literature do fall well short of it. But this is the challenge that we really need to meet in order to develop a meaningful and universal psychological science. 

Monday, February 3, 2014

Philosophy of measurement, functional equivalence, DSM V... or how did I get here?

Here is a very raw and unfinished "trying to wrap my head around some rather confusing issues" post. I have been thinking about levels of equivalence or invariance in cross-cultural measurement. I have been a wee bit unhappy with a couple of conceptual problems in the framework, but particularly the most general or abstract level of 'functional equivalence' has intrigued me for a while. Traditionally, it is more of a philosophical or theoretical statement of the similarity of functions of a psychological construct in different cultural groups. In other words, a particular behaviour serves the same functions in two or more cultural contexts.

I have been following some of the discussions on IOnet and the posts by Paul Barrett as well as the more biologically oriented personality literature. Following a few of these leads, I recently started reading some more conceptual and philosophical papers on the philosophy of measurement in psychology. More specifically, I just finished reading Joel Michell's Quantitative science and the definition of measurement in psychology and Michael D. Maraun's Measurement as a Normative Practice. These papers are superbly well written (as far as you can say that about these kinds of papers) and express quite a few of my growing concerns about psychological research in very clear terms. I started off wondering about functional equivalence, but got much bigger issues to chew on now.

Michell's main logical argument is as follows (from his very concise reply to a number of commentaries, p. 401):

Premise 1. All measurement is of quantitative attributes.
Premise 2. Quantitative attributes are distinguished from non-quantitative attributes by the possession
of additive structure.
Premise 3. The issue of whether or not any attribute possesses additive structure is an empirical one.
Conclusion 1. The issue of whether or not any attribute is measurable is an empirical one.
Premise 4. With respect to any empirical hypothesis, the scientific task is to test it relative to the evidence.
Premise 5. Quantitative psychologists have hypothesized that some psychological attributes are
measurable.
Final thesis. The scientific task for quantitative psychologists is to test the hypothesis that their
hypothesized attributes are measurable (i.e. that they possess additive structure).

The  major task for psychology is to actually prove that anything that we do has a quantitative structure. Much of his review is taking to task the legacy of Fechner and especially Stevens (for those of you who ever suffered through some advanced methods classes... these names should be painfully familiar). It was an eye opener to see the larger context and the re-interpretation of stuff that I just took for granted as a student and never really questioned later on in my professional life. Fechner's legacy leading to a so-called quantitative imperative (e.g., Spearman, Cattell, Thorndike) was challenged in the early to mid-parts of the last century (the so-called Ferguson Committee), but Stevens became the most successful defender of this empiricist tradition. He argued in a representational theory of measurement that measurement is the numerical representation of empirical relations. There is a
'kind of isomorphism between (1) empirical relations among objects and events and (2) the properties of...' numerical systems (Stevens, 1951, p. 1). From this starting point he developed his theory of the four possible types of measurement scales (nominal, ordinal, interval and ratio)' (Michell, page 370). This is the foundation of any scale development in psychology. In a second argument beautifully laid out by Michell, it then becomes clear that these numerical representations due to their assumed isomorphic relations then both define the relations represented and represent them. Given this operationism, 'any rule for assigning numerals to objects or events could be taken as providing a numerical representation of at least the equivalence relation operationally defined by the rule itself.' (Michell, p. 371). 

And this loop is where we are stuck. We take a few items or questions, administer them to a bunch of people, factor analyze them to get a simple structure and voila... we have measured depression, anxiety, dominance, identity... you name it. Or take implicit measures...  you present a number of stimuli with no inherent coherent meaning and present them to individuals to measure their accuracy or reaction speed or whatever you want. Take the score and you have some measure of implicit bias, cognitive interference, etc. There is no relation between the empirical reality and the numerical representation as scores anymore. The question of whether the phenomenon of interest can be quantified has disappeared.

How does the DSM V fit in here? Well, it could be seen as just the latest installment of the same confusion. We don't know what exactly we are measuring (see for example this article on grief as a case in point).

The issue is that we need to test whether psychological constructs can actually be quantified. As simple or complex as that. As much as I agree, I can't stop scratching my head and wondering how the heck we are going to do that. How would you be able to examine whether any psychological construct (which is basically just an idea in our beautiful minds that we try to use and build some kind of professional convention around it) is actually quantifiable or not? The responses by a number of eminent psychometricians to this challenges suggested that nobody was able to come with an example to show that this has worked in a wider context within mainstream psychology.

Enter the second paper. Approaching the problem using Wittgenstein's philosophy of measurement as normative practice (comparing it to the logical structure of language), Maraun argues that measurement needs to be rule-based or normative. You need to start with a definition that then leads to a specific set of rules or norms of how to measure this particular phenomenon just defined. The definition and the set of rules are the most basic form of expression. There is nothing simpler or more basic than this. Once these norms are established, any other person should be able to arrive at a similar result, that even if based on a different metric should still be convertible (e.g., from meters to feet). In psychology in contrast, we have no rules. We have a test or an experiment that is being conducted and the results are examined against another set of empirical observations to claim that the results are valid. According the practice of measurement in physics, empirically based arguments are not relevant for claiming that something has been measured. Measuring a number of items that factor together and then correlating it with some other instrument similarly derived does not mean that anything meaningful has been measured. Observing some kind of empirical pattern in an experiment does not constitute measurement if it is then validated or compared to a different set of  empirical observations. The issue is that the concept is not sufficiently precise defined to lead to a set of rules that govern its measurement.

There a number of other points in that paper around validity, nomological networks, covariance structure and the like. Again, I keep scratching my head. These guys got a point... but how to get out of it. Maraun is very pessimistic. He argues:
Simply put, measurement requires a formalization which does not seem well suited to what Wittgenstein calls the 'messy' grammars of psychological concepts, grammars that evolved in an organic fashion through the 'grafting of language onto natural ("animal") behaviour' (Baker & Hacker, 1982). One aspect of this mismatch arises from the flexibility in the grounds of instantiation of many psychological concepts, the property that Baker and Hacker (1982) call an open-circumstance relativity (see also Gergen, Hepburn, & Comer Fisher, 1986, for a similar point). Take, for example, the concept dominance. Given the appropriate background conditions, practically any 'raw' behaviour could instantiate the concept. Hence, Joe's standing with his back to Sue could, in certain instances, be correctly conceptualized as a dominant action. On the other hand, Bob's ordering of someone to get off the phone is not a dominant action if closer scrutiny reveals the motivation for his behaviour to be a medical emergency which necessitated an immediate call for an ambulance. The possibility for the broadening of background conditions to defeat the application of a psychological concept is known as the defeasibility of criteria (Baker & Hacker, 1982). Together, open-circumstance relativity and the defeasibility of criteria suggest that psychological concepts are simply not organized around finite sets of behaviours which jointly provide necessary and sufficient conditions for their instantiation (Baker & Hacker, 1982). Yet, this is precisely the kind of formalization required if a concept is to play a role in measurement. (p. 457-458).
Maybe what we are studying is just the social construction of meanings of psychological concepts as expressed in the heads of individuals? Is this a feasible reconciliation? From a researcher perspective it might be a worthwhile endeavor (think of discourse analysts embracing factor analysis... the thought is actually quite amusing). However, this approach leaves our search for a) latent variables and b) measurement invariance completely meaningless.

The reading continues. Some random thoughts at 1am while I am writing these notes:
a) The search for quantitative latent constructs in psychology probably should (?) or could (?) start from basic biological principles. In essence, we assume that there is something 'latent' out there if we use EFA or CFA or any of the typical covariance structure tests. If there are biological mechanisms that lead to certain psychological phenomena, we can study the biological principles and their interaction with the social environment that lead to psychological realities. Then we could get around the quantification problem. Problem... what biological principles and at what level of specificity?
b) The use of covariance analyses provide simple structures of language concerning folk concepts. This may be useful and meaningful for understanding how people in a specific context interpret items or questions. It is probably more of a sociological analysis of meaning conventions than a psychological analysis. This could be useful or interesting for research purposes, but it is not quite how we commonly understand or interpret the results when we are using these kinds of techniques.

Or am I missing something? How can this measurement paradox be tackled?