Are moral foundation scores heritable? A new paper by Kevin Smith et al. has tested moral foundations theory in a sample of Australian twins and found that moral foundations are neither heritable nor stable over time. The authors summarized MFT fairly, and they used a research design that is appropriate for the questions they asked, but with one major flaw: They improvised an ultra-short measure of moral foundations (around 2008) that turned out to be so poor that they essentially have no measure of moral foundations at time 1. Then at time 2 (around 2010), they used a better measure when the MFQ20 became available. As I show below, little can be concluded about heritability and nothing can be concluded about stability from their data.
Citation: Smith, K. B., Alford, J. R., Hibbing, J. R., Martin, N. G., & Hatemi, P. K. (2016). Intuitive Ethics and Political Orientations: Testing Moral Foundations as a Theory of Political Ideology. American Journal of Political Science. doi: 10.1111/ajps.12255
Abstract: Originally developed to explain cultural variation in moral judgments, moral foundations theory (MFT) has become widely adopted as a theory of political ideology. MFT posits that political attitudes are rooted in instinctual evaluations generated by innate psychological modules evolved to solve social dilemmas. If this is correct, moral foundations must be relatively stable dispositional traits, changes in moral foundations should systematically predict consequent changes in political orientations, and, at least in part, moral foundations must be heritable. We test these hypotheses and find substantial variability in individual-level moral foundations across time, and little evidence that these changes account for changes in political attitudes. We also find little evidence that moral foundations are heritable. These findings raise questions about the future of MFT as a theory of ideology.
Below is the signed (non-anonymous) review I submitted in August of 2014, when I was asked to evaluate a previous version of the manuscript for a different journal. The authors seem to have added some new datasets that use the MFQ 30 (on non-twins) to examine the factor structure of the MFQ, but there is nothing they can do to fix the measurement problems from wave 1 of their twin study. I have put in bold the main problems with the article, so you can get the point by just reading those sections.
I did not know who the authors were, in 2014. Now that I know, I can add that I am a fan of their work and their general approach integrating to integrating biology, psychology, and political science. We all agree that individuals are biologically and genetically “predisposed” to find certain political ideas more agreeable, but that these predispositions are not predeterminations — life experiences and learning shape political identities as they develop beyond those predispositions. So we all agree that something about political identity is heritable. (To see our main statement on Moral Foundations Theory, evolution, and personality, click here.)
This paper has the potential to be very important. It is always a big deal when a psychological construct gets is first heritability data, particularly when that construct is related to highly visible recent developments on the heritability of ideology. It is also the case that this paper is well argued and well developed theoretically. The authors have read my work closely and summarized it fairly with a few small exceptions.
The entire question of publication, however, turns on whether the authors have adequately measured moral foundations. If the measures they use to quantify the moral foundations in each participant are nearly as good as the measures they use to quantify ideology, then the inferences they make about causal directions and heritability are on solid ground. But if the two-item measures of moral foundations used here are much less reliable than the 23 item measure of ideology, then of course it will appear that ideology is stable and heritable (if indeed it is heritable), while moral foundations will appear unstable and unheritable (even if they are heritable).
I believe that the 2-item measures used in this paper to assign scores on moral foundations are extremely poor measures, based on my own data, on nationally representative American data, and on the authors’ reported data.
Here’s the main problem: My colleagues and I have examined scales of varying lengths, and concluded that 20 items is the shortest MFQ that we are willing to endorse. We base this conclusion on what we think was an innovative analysis. In Graham, Nosek et al. 2011, the paper that presented and validated the MFQ-30, we conducted a variety of tests of reliability and validity. We began with our original 40 item scale, consisting of 20 “relevance” items – 4 for each of the 5 foundations, and 20 “judgment” items, consisting of 4 statements for each of the 5 foundations. We then asked how much we would lose, in terms of the scale’s ability to predict other scales (as markers of external validity) if we shortened the scale. We found that nothing is lost when we move down from 40 to 30 items, dropping the worst 2 items for each foundation (one relevance, and one judgment item). However, dropping from 30 to 20 items produces a moderate loss, and dropping from 20 to 10 produces a big loss in predictive validity. We therefore endorsed the MFQ-30 as our main measure of the moral foundations, and we also offered the MFQ-20 for cases in which researchers need the absolute shortest form they can get away with. However, on our website, MoralFoundations.org, we specifically say this:
“If you must have a shorter version, you can use the MFQ20 (a 20 item version of the MFQ, with alphas that are slightly lower and with less breadth of conceptual coverage)… Please use the MFQ30 if at all possible. It’s hard to get good measurement with just 4 items per foundation!”
Unfortunately, the current authors used just 10 items – a mere 2 items per foundation. To make matters worse, they chose the format that is more abstract, further away from moral intuition, and, in our experience, harder for less educated participants to process. As we say in our 2011 paper validating the MFQ:
“we wanted to supplement the abstract relevance assessments—which, as self-theories about how one makes moral judgments, may be inaccurate with regard to actual moral judgments (Nisbett & Wilson, 1977)—with contextualized items that could more directly gauge actual moral judgments…. the Relevance subscale may better assess explicit theories about what is morally relevant, and the Judgments subscale may better assess actual use of moral foundations in judgment.”
To make matters even worse still, three of the ten items they chose happen to be among the worst of our 40 original items – they are items we don’t even include in the MFQ30, because they reduced (or did not contribute to) reliability. The items are “harmed” “affected your group,” and “fulfilled duties of his or her role.” So three of the five subscales in Wave 1 are essentially using a single valid item to assess foundation scores.
To examine the matter empirically, I found an old database from 2008 that had data from our original 40 item scale, including all 10 of the items used by the present authors. The dataset had 28,596 participants who completed the MFQ40, mostly from the USA, followed by Canada, UK, and Australia. I then examined how well each of the 5 foundation subscales do in terms of reliability and also correlation with self-described political ideology (1=very liberal, 7 = very conservative). I computed these values for the full MFQ (with 6 items per scale), as this is our baseline, or best measure of the constructs. Those numbers appear in the first data column, below. You can see that reliability is generally good (average alpha = .74), and the correlations with ideology are robust (average absolute value = .45)
I then examined how much these numbers fall off when the scale is shortened to the MFQ-20, using the items that we endorsed. As you can see, the dropoff is slight, which validates our choice of items for the MFQ20. Finally, I examined how these numbers fall off when the scale is shortened to the exact 10 items used in Wave 1 in the current study. The dropoff is large. Reliability plummets to .532, and correlations with ideology plummet to r=.286.
2008 Megafile with n = 28,596 who took the MFQ40
|6 item (MFQ-30)||4-item (MFQ-20)||2 item scale used in twin study wave 1|
|CORRELATIONS W. IDEOLOGY|
|Avg absolute value||.445||.401||.286|
I believe this very poor measurement constitutes an insurmountable defect in the data, and therefore in the argument made in the paper on the basis of the data.
To use an analogy, suppose I wanted to assess the stability of a mountain range over time. I ask my 8 year old son to draw mountain #1, and I take a photograph of Mountain #2. Two years later, I ask my son to draw mountain #1 again, and I take another photograph of Mountain #2. If I compare the images across time, I would conclude that mountain #1 was unstable and highly changeable, whereas Mountain #2 was very stable over time.
I think the same thing is happening here: The present study claims that moral foundations and the MFQ are largely useless, but in fact it’s just a measurement problem. It is not appropriate to demolish a theory when one has not actually measured the central constructs of the theory.
I will now run through the manuscript in order, noting additional points.
–p. 1-6: excellent summary of the theory and its entailments, except as noted below.
–p.2: The claim is made that “any within-individual change in these modules should cause a predictable within-individual change in ideology.” This is false. If people become a bit more compassionate after traveling in a poor country, or a bit more prone to endorse authority after having children, there is no reason to think that one would pick up a change in their self-described ideology or political party. Ideology is a very pronounced social identity; it is socially sticky. It is not just a direct readout of one’s scores on the MFQ.
–p.5, the environmental component of moral development is not just “social reinforcement”. We are not behaviorists. It’s learning of all sorts, some of which is self-constructed, some of which is related to narrative… moral development is complex. People’s moral and political identities can change with no change in their underlying foundations, just as a house can be remodeled with no change to its foundation. It’s just that some kinds of houses/ideology are more likely than chance to be built on some kinds of foundations.
–p.7: its not appropriate to speak of fractions, such as “a fifth of the moral domain”. Care and Fairness support most of the moral domain for just about everybody in the Western world, liberal or conservative. We do not claim that all foundations are equally important or prominent; it’s very difficult to judge percentages anyway.
–p. 7, the affirmative action example is a poor one, because it’s about differing conceptions of fairness, mostly, on both sides.
–p. 8, the social reinforcement point again. Also, “ideology is a stable dispositional trait IN PART because the underlying moral foundations are stable dispositional traits.” People’s self-narrative, local community, public identity, etc contribute to the stability of ideology.
—-p.9, with the above modifications, the 3 assumptions of MFT seem valid to me. I would not expect small changes in moral foundations to be detectible in later ideology, but big changes might. And given that just about everything is heritable to some degree, I would expect moral foundations to be so too, even if they are Level 2 constructs in MacAdams’ terms. So I really like the overall design of this study.
–P. 12: The authors devote considerable effort to showing that their 2-item MFQ subscales have high reliability and predictability, but as I’ve shown above, I think this is unlikely. One of the problems with the “relevance” items is that it uses an unusual item format, which we have found is difficult for less educated participants. They are not asked to tell us what they believe, as in the “judgments” format. They are asked to rate “When you decide whether something is right or wrong, to what extent are the following considerations relevant to your thinking?” We have concluded that this format introduces a large method effect – some people interpret the request in such a way that everything gets a high rating; others interpret it such that they use the middle of the scale, or a broader range of the scale. As evidence of this, we can look at the alpha of the entire scale for each format – all 15 items in the relevance section, compared to all 15 items in the judgment section. There is no reason why the entire relevance section should have a very high alpha – there are 5 subscales, and we don’t want to see that they all correlate with each other. Yet in fact, they do.
Using that same dataset as before, with the whole MFQ40, we find that alpha for the Full set of 20 relevance items is very high, .853, whereas for the 20 judgment items, it is lower, .728. This indicates that there is more of a method artifact in the Relevance section. Some people give really high scores, on all questions, some give lower, on all questions.
When we first started getting nationally representative datasets, in which the education level is lower than at YourMOrals.org, we began to see that this problem is even more acute. We have one nationally representative American sample, collected as part of a paid module added on to the ANES, which included the full MFQ20. We can assume that this dataset is more comparable to the Australian dataset than is data collected at YourMorals.org. When I run the alphas for the two parts there, the relevance section goes up, to .89, whereas the alpha for the judgment section goes down slightly, to .699. If the present study also has an overall alpha for the 10 relevance items that approaches .90, then of course any two items will correlate well, and will produce alphas in the .6 or .7 range.
Again, my point is that the relevance section is just less powerful and reliable as a measure of moral foundations than is the judgments section. This is why we are phasing it out in the new MFQ that we are currently developing. It is not surprising that the authors obtained moderately high alphas on their two-item MFQs, which use just one format that has a big method effect. Their alphas are indeed comparable to the alphas we report in Graham et al, but those alphas reflect subscales with two different item formats (relevance and judgments). Using two item formats was a choice we made to reduce the influence of any one method effect, but it lowers our alphas substantially from the standard personality measure, which uses just a single item format.
–P. 12 Given how poor the measurement of moral foundations is in both waves, particularly wave 1, it is neither surprising nor informative that the authors could not replicate our factor findings.
–P.13. The authors say: “ Overall, then, the MFQ items perform reasonably well psychometrically and replicate the key empirical finding of MFT in two different, though overlapping samples. This gives us confidence that we have a robust platform to test the stability of moral foundations, their impact on political attitudes across time, and their heritability.” I strongly disagree. The moral foundations were simply not measured well enough to conclude anything about stability or causality or heritability. This would explain why the test-retest numbers are so dismal on p. 14.
–P. 15: The authors attempt to validate their 2-item MFQs by analyzing a larger dataset that used MFQ-20. They show that for each foundation, their 2 items correlate with the four items for that foundation in the MFQ-20 quite well, ranging from r=.73 to r=.88. But these correlations seem to reflect part-whole correlations. Of course a score composed of two numbers correlates with a score composed of 4 numbers including those two numbers. To test just how much the part-whole confound could account for their high correlations, I used a random number generator to create 4 columns of 1000 digits each, ranging from 1 to 6, as MFQ items do. I then calculated one score that was the average of the first two columns, and a second score that was the average of all four columns. I then checked the pearson correlation of the two scores. The answer: r = .744. Four columns of random digits give us nearly the same correlations that the authors report as a validation that their 2-item MFQ scores are good proxies for the 4 item MFQ20 scores. They are not good proxies. The authors are therefore incorrect when they say that “These results give us considerable confidence that our MFQ instruments are capturing the lion’s share of the variance that would be picked up by the more contemporary MFQ 20, and that the psychometric properties of our instruments are reliable and consistent.”
The bottom line is that if the authors had used the actual MFQ20 in both waves of their study, they would have obtained good enough measurement to begin drawing conclusions. But because their proxies are demonstrably worse – much worse — than the MFQ20 and MFQ30, their measures of moral foundations simply cannot be compared to their measure of ideology, which we presume is much more accurately measured with a 23 items scale. So even though I think their logic is generally sound while drawing these inferences on pages 13-21, they simply don’t have data that could justify ANY inferences about the moral foundations, or about moral foundations theory.
–P.24: The authors raise some valid concerns about the MFQ: we agree that many of the items are measuring contexualized judgments and attitudes, which we believe are constructed on top of the foundations – they resonate with some people more than others because of underlying psychological differences. But such items do not measure the foundations themselves. We are trying, in our ongoing revision efforts, to create assessment methods that activate more rapid intuitions, and less reasoning.
To conclude, let us look at the authors’ summary or the implications of their work, on p. 23: “Our findings run contrary to assumptions underpinning MFT as it gains increasing traction in the political, psychological and behavioral sciences as theory of ideology. The obvious inference to take from our analyses is that individuals are not born with innate moral value systems, or at a minimum that MFQ instruments do not comprehensively tap into these innate systems.”
I think this inference simply cannot be drawn when, as I have shown, the authors did not use a valid version of the MFQ and do not seem to have accurately measured the key constructs of MFT. Furthermore, all data aside, how likely is their conclusion to be true? What should our prior expectation be? Given that almost every aspect of personality is heritable to some degree, including emotional predispositions such as compassion (which is at the heart of the care foundation), how likely is it that the authors have discovered one of the very few aspects of personality that is not heritable? Is it more likely that moral value predispositions don’t exist, or that they were not measured well in this particular study?Read More