Sunday, July 20, 2014

A crisis in cultural psychology? Lack of replications, bias & publication pressures


Social psychology is facing an existential crisis. Ype Poortinga and I took the opportunity to examine how cross-cultural psychology fares in comparison. What is the background? A collective drive for presenting novel, sexy and sensational findings has propelled social psychology into a minefield of public mistrust and claims of being a pseudoscience. The list of sins in the eyes of the public are long: Central methods at the core of the discipline such as priming have been challenged, the drive to find significant differences has led to a neglect of the meaningfulness of psychological findings, publication pressures opened the doors for unscientific data massaging and most notoriously, glamorous stars of the discipline have been found to fabricate their data. There has rarely been a month since the now infamous Staples affair, when the field was not in the spotlight of public and internal scrutiny. This series of events has led to some agonizing soul-searching among psychologists.

Addressing methodological vulnerabilities in research on behavior and culture


Ype Poortinga and myself used the opportunity of the 22nd International conference of Cross-Cultural Psychology (organized by IACCP) to critically examine how cross-cultural psychology as a sister discipline of social psychology is faring. We assembled an A-list of leading cross-cultural psychologists and former editors of the flagship journal for research on culture and psychology (Journal of Cross-Cultural Psychology). Our instructions were simple: we requested them to critically evaluate the methods of our field and comment on ways how our field may move forward. Ype and I also provided a summary of our own concerns about the state of the field. The session was exceptionally well attended and the panel managed to create a lively debate and exchange of views with each other and the audience. This was particularly remarkable given the technical challenges, the double booking of the room and the incredible heat, lack of seats and oxygen in the late afternoon (it felt like a 2 hour sauna session). I have received quite a few requests for our slides, so I am summarizing some key points from our introductory presentation, the talks by Peter Smith, Johnny Fontaine and David Matsumoto as well as discussion that followed the presentations. I will also outline some ideas of the next steps that we are considering taking.



Poortinga and Fischer: Why questionable null-hypotheses and convergent search for evidence erode research on behavior and culture


Null hypothesis significance testing is the modus operandi for conducting research in psychology overall. At the same time, it has come under increasing pressure and scrutiny. Some quotes from some recent papers illustrate the various problems with the state of psychology:


Ioannides (2005): “[A] research finding is less likely to be true when … when effect sizes are smaller; when there is … lesser preselection of tested relationships; … greater flexibility in designs, definitions, outcomes, and analytical modes; … and when more teams are involved in a scientific field in chase of statistical significance”
Vul et al. (2009) report on “voodoo correlations” in fMRI: “We show how … nonindependent analysis [of voxels] inflates correlations while yielding reassuring-looking scattergrams”
Simmons et al. (2011) on “false-positive psychology”: “… flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not”.



The application of the experimental research paradigm with an emphasis on null-hypothesis significance testing is particular problematic in cross-cultural psychology, because some of the basic assumptions of experimental design are violated by default:

a) There is no random assignment of respondents to conditions and

b) The experimenter has little control over conditions and ambient events.




This figure shows these problems in a nice way and clearly highlights that cross-cultural studies do not even meet the conditions for good quasi-experimental designs and have significant
shortcomings. 

Further challenging current experimental practices, Simmons, Nelson and Simonsohn (2011) eloquently exposed the problems of researcher degrees of freedom and the impact of quite innocent appearing research practices on significance levels. They demonstrated how a logically impossible hypothesis (listening to songs about age will decrease the age of listeners) can be empirically supported. Applied to the topic of their investigation, they discovered the psychological equivalent of the proverbial fountain of youth by using questionable research practices. The following figure shows the outcomes of their simulation study and the impact of four researcher degrees of freedom on significance levels. We highlighted the relevance of these conditions for cross-cultural research. 



First, assuming per definition that culture is a shared meaning system, any two cultural variables will be correlated to a significant degree. The very nature of the phenomena under investigation makes finding significant differences more likely. This non-independence is well recognized and the negative impact on significance testing is well recognized in methods circles but not well-understood in general cross-cultural research circles.

Second, a researcher may add 10 more observations or cases to the study if a first examination did not reveal any significant differences. This is probably a more common practice in cultural priming studies, but may be less of an issue in comparative survey studies.

The third questionable practice is controlling for third variables, especially if their impact is not theoretical grounded. In their case study, Simmons et al. used gender as an example, but in cross-cultural psychology it is often GDP at the country level or some demographic variables at the individual level that is entered as a covariate. This is a double-bind of cross-cultural psychology, on one hand we need to control for other variables that may explain any differences between samples, on the other hand, these simulations demonstrated that such practices have a sizable impact on significance levels.

The last questionable practice is to drop (or not) one of the conditions. The equivalent in cross-cultural psychology is to omit samples that may not fit the expected pattern (outlier removal). Talking to other researchers, this seems a common practice.

These individual practices individually increase the likelihood of finding significant results only in a relatively minor way, but the combination of these practices will lead to substantively inflated ratios of significance results: a significant result at the magical .05 significance level is 60% more likely if you combine all four of these questionable practices! Based on conversations with colleagues and observations of publication trends, these practices are common in cross-cultural psychology. This now means that we probably need to question a good number of empirical findings published!

A further issue is that the null hypothesis of no difference is likely to be rejected if there is a difference on any third variable that is related to the dependent variable. In such instances, there is a high rate of Type 1 errors (false positive results). One pressing issue is method bias. In questionnaire studies, response biases such as acquiescence or yes-saying are particularly salient.

The next figure shows the probability of finding a significant result as a function of sample size and the size of the bias. The various lines show the various levels of bias in terms of the standard deviation. If the bias effect is small (e.g., 1/16th of the standard deviation on the DV), increasing sample sizes are not increasing the probability of finding a significant effect by much. However, when bias approaches a .25 of the standard deviation, the probability of finding a significant effect in a sample of 100 participants approaches 60%. You may argue that ¼ of a standard deviation is large. However, it is not an unrealistic scenario given the prevalence and extent of response styles in questionnaire research – see for example our earlier research showing that response styles produce bigger effect sizes than 1/3 of theoretically important research studies.



These two simulation studies suggest that cross-cultural differences might be spurious and driven by method effects. In addition, our field seems to be driven by differences and appears to pay unduly emphasis on differences, without questioning their validity. The next figure shows the emphasis on differences and the lack of studies hypothesizing and finding similarities. This graph is adapted from a review by Brouwers and others, published in JCCP in 2004. As can be seen, the majority of studies expect differences only (N=55) and only 25 studies expected differences and similarities. At the same time, 57 studies found both. Most importantly, given laws of probability, we should also have studies that expect and report only similarities. Brouwers and colleagues did not find a single study that either hypothesized or reported similarities only. Where are these studies?




The points raised so far should not be understood as challenging the experimental methods underlying comparative research. We would urge our colleagues to critically question some of our designs and analytic procedures. In the larger experimental literature, a number of strategies have been proposed, including:

- stricter designs (larger n, Button et al., 2013) for more power
- stricter analysis (p < . 005, Johnson, 2013)
- prevention of experimenter bias (O. Klein et al., 2012)
- more transparency (e.g., pre-registration of hypotheses)
-replication across multiple researchers and labs (R. A. Klein et al., 2014)
We see it as a good sign that replication studies have achieved new status. For example, an earlier attempt of our lab to replicate the culture-level value structure by Schwartz using data from the Rokeach Value Survey faced some real uphill battle in getting it published. The saving grace to get it published seemed to be the appearance of a new value type that was not evident in the earlier Schwartz circle (a replication of this new value type is still outstanding). The new emphasis on replication in my opinion is a major achievement. The first findings of this new wave of replications are coming in. For example, the following graph shows the replication success of a number of studies in the ‘many labs’ replication report. Some of the older studies hold up well to scrutiny, but many of the newer findings, in particular priming studies are not replicable.

What is also noteworthy is that in the original dissemination of these findings, the lack of cross-cultural differences in the patterns was emphasized. Some commentators were quick to jump on that and suggested that careful and experimentally strict replications will do away with cross-cultural differences. We may want to challenge such an assumptions, but these comments clearly demonstrate that we as cross-cultural and cultural psychologists need to engage with the replication debate. We cannot sit back and pretend that the replication crisis does not affect us!



Replications vary along an underlying dimension, with exact replications being at one end and conceptual replications forming the opposing end. The conventional experimental wisdom is to prioritize exact replications or to stick as closely as possible to the original designs (close replications) with large samples sizes to have high power to detect effects. Of course, we know that exact replication in a cross-cultural context is problematic due to the different cultural conditions of participants.

However, an even more important point for us is that the presence of bias (e.g., response styles, speed-accuracy trade-offs) challenges the validity of exact or close replications. A replication of a biased study is a replication of a biased study.

In addition, if we have two samples and we define one sample as belonging to X culture and the other sample as belonging to a Y culture (this could be anything: collectivistic vs individualistic; independent vs interdependent self-construals, honour vs dignity; holistic vs analytic thinking), then any difference on whatever variable will be statistically related to the presumed X-Y difference. Therefore, replications in cross-cultural psychology need to be positioned towards the conceptual replication end and require additional methodological safeguards.

We suggest that cross-cultural replications need to:

-ensure validity of procedures in local context

-empirical checks on the postulated antecedent (what theoretical process is likely to drive these expected patterns and to empirically test these theoretical processes)

-manipulation checks (including a “no-difference” condition, on what variable or set of variables do we NOT expect a difference)

-control on likely alternative explanations (e.g., response styles).



In summary of the points so far, cross-cultural psychology suffers from many of the same shortcomings that have created the crisis in social psychology. A somewhat humorous account borrowing from Dante’s version of hell is provided by this cartoon (by the Neurosceptic, published in Perspectives in Psychological Science). Our research culture that emphasizes differences instead of similarities leads a state of limbo, overselling and post-hoc story-telling. Our narrow orientation towards ghost in the machine variables (such as collectivism, self-construals and values) lead to overselling (everything needs to be explicable by single dimensions, typically of personal relations or self-construals), post-hoc story telling and p-value fishing. From personal experience publishing cross-cultural research, nearly any difference can with a bit of theoretical creativity be related back to individualism-collectivism, self-construals or any of the other fashionable constructs these days. These biases in orientation and the researcher practices and researcher degrees of freedom then lead to p-value fishing and creative outlier utilization. Of course, the absence of no-difference studies suggest a significant file drawer problem.

Our suggestions are therefore:

Better designs (including efforts to reduce bias, testing of alternative theoretical processes, etc.)

Planned replications

Depositing hypotheses and methods in a public archive prior to data collection



Peter Smith: To understand cultural variation let’s sample cultural variations


Peter Smith and colleagues suggested a rather straightforward approach for addressing some of the concerns. Their recommendation was to go beyond two culture comparisons and to sample cultural variation more broadly, e.g., by studying multiple Asian and non-Asian samples that are typically lumped together as collectivist, interdependent, holistic, etc. In addition, Peter and colleagues included more diverse instruments capturing conceptually similar constructs to examine variability in intended constructs across a broader range of instruments. Peter presented some preliminary data that supported the usefulness of this approach. However, he also acknowledged that the current study has some important limitations, including studying students, not having enough samples yet to properly examine effects (e.g., though multi-level modeling) and a high demand on participants (e.g., completing long sets of questionnaires).


Johnny Fontaine: A plea for domain representation


Johnny presented a more technical account of domain representation that examined the meaning of constructs across a larger number of languages and cultural contexts. Using examples from the emotion domain, he showed that we can avoid confusion and biases in meaning through the use of sophisticated non-metric statistical methods in combination with elaborate designs that allow separating situational and personal characteristics. His approach demands a theoretical analysis of possibly important variables that need to be incorporated into the research design. Johnny really got the methods guns blazing in his presentation and I have to admit that the heat of the room by that time had fried my brain. As a consequence, I was not able to follow all the intricate steps in the procedure and not having a seat did not allow me to take good notes (but the graphs looked very convincing). He is working on a manuscript detailing the procedures and I am certainly looking forward to reading it when it is ready.

David Matsumoto: Random thoughts about methodological vulnerabilities in research on behavior and culture.


David broadened the symposium by focusing on the broader research climate in culture and psychology. Most people in this overheating room will have appreciated his first demand: before he started talking he requested everyone to stand up from their seats. Beyond bringing some oxygen into our brains, this also then became a beautiful point of reference for his short and sharp presentation. Here are his three main arguments (my paraphrasing):

Point 1: Study behavior

Point 2: Respect the literature

Point 3: The current pressures on young academics makes following recommendations 1 and 2 challenging


The first point is obvious – our discipline confuses self/other/peer-reports of behavior for behavior. I do not have hard stats here, but from memory – I cannot remember a single cross-cultural social psyc study in the last year or two in JCCP that studied actual behavior. He pointed out that everyone had stood up when he asked at the beginning of his presentation – a success rate of 100%. In contrast, when asking people whether they would stand up in a seminar room when asked by the presenter (e.g., on a scale from 1-7), there would have been significant variability and the mean would definitely been lower than 100%. Drawing upon his own research on emotion display, he argued that triangulation of research method is necessary.

The second point highlights the emphasis of getting to know more about previous research. In the current research environment, researchers need to present novel result and theory. There is no incentive for (or penalty for not) reading older research that may have been conducted 10 or 40 years ago. Journal editors are keen to get citations to recent papers to increase the journal’s Impact Factor. Yet, this leads to impoverished and non-cumulative research.

The last point highlights the constraints that young researchers pre-tenure are facing: more publications in less time. Studies of behavior are time consuming and therefore are less appealing. Reading relevant literature in one’s field or neighboring disciplines is also detracting from writing articles and funding applications. David emphasized that IACCP has a richer intellectual tradition than many mainstream researchers who have discovered culture and now publish in high-impact journals.


Some of the discussion points


The discussion turned repeatedly on a number of points. I will try and summarize some of the key ones that stood out for me.

Representativeness of samples: One key concern that came up repeatedly was that studying students is not appropriate for making claims about cultural processes. Students are not good representatives of the larger population.

Studying nations: One early comment that drew spontaneous applause from the audience was that psychology has failed in studying culture. Instead, psychologists are studying nations. Yet, nations are highly diverse and consist potentially of many subcultures. Various other commentators picked up similar themes throughout the discussion. One issue that is related here was the relative emphasis on between-country/culture differences and the lack of attention to within-country/culture differences. Both Geert Hofstede and Shalom Schwartz were in the audience, but they remained silent – it would have been nice to hear their responses to some of these comments (and both have done some interesting work that would have been informative in this debate).

Lack of strong theory: Peter Richerson argued that psychologists lack strong theory and recommended looking to neighboring disciplines such as biology for inspiration. David Matsumoto defended psychology in his response, suggesting that psychology has some good theories. But he also added that we need truly exploratory work that can understand phenomena on their own terms. My thought on this is that we have not enough strong theory (in a philosophy of science perspective) and that exploratory research with attention to various alternative explanations may bring us closer to developments of stronger theories of culture (e.g., by including the possibilities of no differences, attention to alternative processes beyond the usual suspects in current psychological thinking on culture).

Validity of findings: One point that occurred in various disguises in a number of comments was the importance of validity of findings in the local context. Amina Abubakar was the first to get this point across in the debate: To what extent can cultural psychology and cross-cultural research as a method of choice yield insights into the minds and behaviours of people in a specific context? How applicable and relevant is cross-cultural research for people around the world? This is a major question and needs some serious contemplation as we face a rapidly changing world and need to collectively respond to multiple pressing challenges (e.g., increasing intergroup conflict, climate change, decreasing natural resources).

Next steps


An immediate opportunity following this debate arose the next day after the round-table discussion. Ype challenged the assembly that methods issues need more attention and in response Walt Lonner as the founding editor of JCCP suggested a methods oriented special issue for JCCP. We had a discussion during the coffee break and he invited us to write a proposal for a special issue. Any thoughts for topics and contributors for such a special issue addressing the methods challenges are much welcome (please flick me an email or respond below – I would love to hear from you).

Looking at some other associations (APS comes to mind here), we could adopt some of their criteria for publication – there have been some interesting suggestions and changes in policies recently. Even JPSP now publishes replications (hooray!!!!!!)!

Overall, I think that the overall change in research climate is promising. There has never been a more positive time to discuss how we collectively do research, there is much promise of change in the air and I strongly believe that collectively we can make a positive change. Without this conviction, we would not have had the symposium and such a large crowd keen to brave tropical temperatures and horrible conditions in the late afternoon to debate a topic so passionately. I felt humbled by this enthusiasm of the audience and the positive comments that we received over the next couple of days. I look forward to continuing this debate and hearing your opinions and suggestions!

3 comments:

  1. 1)Some journals (e.g., J. Business & Psych) are offering a nice option: researchers can submit their (sort of) research proposals including a detailed section on planned methods and analysis sections to journals, and if reviewers gives a go, the results are published regardless of whether tests were significant or not. I think that can be an option for jccp too.
    2) Representativeness is an issue way beyond using student samples vs. community samples. Re-introducing some probability sampling might be another nice topic for the special issue you're talking about.

    ReplyDelete
    Replies
    1. I agree that some form of prepublishing hypotheses, methods and proposed analyses would be highly desirable. This could be published on the journal website and the final article with results with be published in the traditional journal format. I think this is a good and sensible approach, will try and work towards it with JCCP team.
      The representativeness of samples is also an important issue - I completely agree with you. One important question that needs to be addressed is the pervasiveness of cultural processes.

      Delete
  2. Thanks for the write up! Sounds like a very productive bunch of presentations and broadly speaking I don't have any issue with any of the points that the presenters/the general academic climate are making in regards the need for replication, more diverse samples etc.

    I would however suggest a note of caution from the discipline of anthropology. Critical self-assesment of a field is great and it is important to recognise limitations but there does come a point when such a preoccupation especially when comparing differences in culture can tip over into promoting relativism. I also wonder how possible it is to expect young academics to draw their samples from multi national sites. Collaborations can certainly help address this but in creating these you are also adding in additional researcher degrees of freedom and potentially extending data collection time significantly.

    These aren't impossible hurdles to overcome and, to a large extent, I think young researchers are in favour of making serious efforts to create improvements to the field. I would just be wary of retreading the path to cultural relativism that anthropologists have already laid bare. I say this incidentally as a person pursing an anthropology DPhil.

    ReplyDelete