Lies, Trans Lies and Statistics
How gender ideology uses logical fallacies and bad data to fool people
I have been following the gender debates (if you can call them that) for nearly three years and it has been an eye opening experience. In many years of being involved in law and politics I have never encountered as much intellectual dishonesty as I have seen from the people who try to defend males in female sports or destroying healthy body systems in gender confused children.
I emphasize that this is intellectual dishonesty. Direct lying of the kind that could result in a perjury charge is rare. What is everywhere is fallacious reasoning, equivocation and misleading use of statistics. Most of the popular media outlets that have historically challenged this type of misinformation have been captured by the new orthodoxy.
I don’t have a science degree but I did take introductory statistics, and as a lawyer spotting and challenging weak arguments and evidence was part of my job. I have learned a great deal in the past three years by following the dedicated scientists and medical professionals who have been giving their time to challenge gender nonsense on Twitter and elsewhere. This article is a summary of some of the key concepts I have had to learn or re-learn.
All Studies are Not Created Equal
In Twitter debates on gender being able to cite an article from a peer reviewed journal is considered the gold standard of evidence. However, it is important to understand just what a peer involves.
A peer reviewed journal is one in which all of the articles are reviewed by other experts in the same field before publication. The reviewer will look, among other things, at whether the authors have used valid research methods, whether the article makes an original scientific contribution and whether the results of the data are described correctly. Peer review does not guarantee that the conclusions stated in an article are correct. It just means that the article was good enough to be published so that its conclusions can be examined by other scientists.
A peer reviewed study simply provides some evidence for its conclusions. How strong that evidence is depends on many factors. Some of the questions you need to ask when evaluating medical studies are:
How large was the study sample?
How was the sample chosen?
How reliable was the data collection method?
How many participants dropped out of the study before it was completed?
Were there confounding factors (things other than what the study is trying to measure) that could have affected the result?
Was there a control group?
Weak studies are not, by themselves, bad science. Sometimes they are the only practical way of studying an issue. The problems start when weak studies are used to support treatment decisions and policy decisions without acknowledging their limitations.
Systematic Review and Guidelines
There may be hundreds of even thousands of peer reviewed studies in a given area and even specialists have difficulty keeping up with them all. A systematic review is a process for examining and evaluating studies in a consistent way. A team of researchers will conduct searches of medical literature databases to identify relevant studies and make a critical appraisal of each study to assess how relevant and trustworthy it is.
Systematic review is normally the starting point for the development of clinical guidelines. A high quality clinical guideline also needs to be developed through a process which is transparent and free from conflict of interest. The 7th edition of the Standards of Care from the World Professional Association for Transgender Health (WPATH SOC7) did not meet the requirements for a reliable medical guideline. They were not based on a systematic literature review and there was substantial conflict of interest in the development process.
WPATH says that its forthcoming SOC 8 is being developed through a process that includes a systematic review. It remains to be seen whether the final product will reflect the results of this review or political pressure from activist members.
Statistical Significance – Mind Your p’s
Results in medical studies are often referred to as “statistically significant”. This is a technical term that can be misunderstood. You might think of a result as significant if it is large enough to be meaningful. A 1 percent change in something is probably not significant but a 10 percent change would be.
The problem researchers encounter is that data always has some random variation. Even very large variations might be the result of nothing more than pure chance. Statisticians use mathematical techniques in an attempt to identify results which are not likely to be the result of chance. The probability that a result is not the product of pure chance is the statistical significance of the result.
When designing a medical study, researchers start with what is called null-hypothesis. This is something like, “This treatment will not produce any changes greater than what can be explained by pure chance.” They then analyze the data to see if it disproves the null hypothesis. A statistician (or more likely a statistical program) calculates a p-value, which is an estimate of the probability of the null hypothesis being false.
A p-value can range from 1 to 0 with a lower number representing a higher degree of significance. If a study finds a reduction of 12 percent in depression scores after treatment with a p-value of 0.7, this means that there is a 70% possibility that the changes observed are due to chance. If the p-value is .3, there is a 30% possibility that the changes are simply random. Most medical studies use a p-value of 0.05 as the threshold for statistical significance. This means that there is a 95% possibility that the result is due to something other than chance.
Correlation, Causation and Confounding
Finding a statistically significant result is only part of the research process. The fact that a change in one variable (treatment) is correlated with a change in another variable (improved symptoms) in a statistically significant way does not prove that the treatment led to the improvement.
In order to prove causation you must demonstrate a plausible mechanism of causation and eliminate confounding factors. A confounding factor is something other than the factor you are trying to measure which could affect your result.
Studies on the benefits of puberty suppression and cross sex hormones usually have confounding factors. If a patient is receiving other treatment such as psychotherapy or an anti-depressant in in addition to hormone therapy, then the additional treatments are confounding factors which might have resulted in the same improvement in symptoms even without the hormones.
This is an issue that arises in connection with discussions of suicide risk in transgender people. One example of this problem is an article by Dr. Jack Turban and others which claimed to find an association between recalled exposure to “conversion therapy” and suicide attempts among transgender adults.
The article had multiple and major flaws (some of which will be discussed later) but one in particular was that it did not eliminate poor mental health as a confounding factor. Poor mental health is a major contributor to suicide risk and may also be a reason why therapists are reluctant to approve a person for gender transition.
Turban’s article was dubious science but the way in which it has been used in the media is grossly irresponsible. Suicide is complex and there are always many contributing factors. Suicide prevention organizations such as Samaritans have media guidelines for responsible reporting on suicide which warn against attributing suicidality to a single cause. These guidelines are flagrantly ignored in reporting on gender medicine when activists and even professionals proclaim that puberty blockers save lives.
Control Groups
One of the methods of eliminating confounding factors in studies is to use a control group. This is a group of patients with similar symptoms who received a different type of treatment from the one being studied.
The importance of a control group is shown by the fate of a study published in The American Journal of Psychiatry which attempted to show reduction in mental health treatment in transgender people following gender affirming surgery. The study tracked mental health service usage by people who had received a diagnosis for gender dysphoria and subsequently had gender affirming surgery. It looked at data for mental health related doctors visits, prescriptions for psychiatric medication and hospitalization following suicide attempts.
On the surface the study looked like high quality research. While previous studies suffered from small samples, self reported data and high loss to follow up, this one used records from the Swedish national health system over a period of 10 years. The study found that there was a significant reduction in the utilization of mental health services after surgery. This conclusion was widely reported as proof of the mental health benefits of gender affirming care.
However, the study lacked a control group, and this was pointed out by a series of letters to the editor from clinicians associated with the Society for Evidence Based Gender Medicine. The editors of the journal requested a re-analysis of the data using patients who had experienced gender dysphoria but not undergone surgery as a control group. This analysis found no significant differences in utilization of mental health services between the surgery and no surgery groups. The journal published a correction in which the authors conceded that their original conclusions were too strong. This correction has not received anything like the media attention that was given to the original article.
Sampling Bias
Statistical analysis enables researchers to draw conclusions about a larger population by examining a small sample of the population. These conclusions will be accurate within a predictable margin of error. For example, a report on an election poll will say that it is accurate within a margin of 3 percent 19 times out of 20. The accuracy of this analysis depends on the sample being randomly selected from the whole population. If the sample is not truly random, this can lead to sampling bias and make the result unreliable.
One of the most famous examples of sampling bias is the 1948 U.S. presidential election. The Chicago Tribune conducted a poll and were so confident that the Republican was going to win that they printed their headline before the results were in. The poll was conducted by telephone calls to numbers randomly selected from the telephone directory. Although the selection was random, there was a built in bias to the sample. In 1948 well to do Republicans were more likely to own a telephone than poorer Democrats. The telephone directory was therefore not representative of the whole electorate. The Democrat won by a large margin.
Since then researchers have learned to be more careful in selecting samples that are representative of the populations they want to study. However, this principle is frequently ignored in transgender research and one repeat offender has been Dr. Jack Turban, who has published a series of studies based on the 2015 US Transgender Survey. Participants in the survey were recruited by transgender advocacy organizations across the United States. This resulted in a very large sample but one that may not have been representative of the transgender population as a whole. Any transgender people who were not involved in transgender activism would have been underrepresented.
Nevertheless, Dr. Turban and his colleagues have data mined this survey a number of times for publications which claim to support the gender affirming model of care. While the original publications may have noted the limitations of the data source, these limitations were ignored when the studies were discussed in the mass media.
The study on the association between conversion therapy mentioned earlier was based on this survey. One of the many criticisms of the study is that it excluded anyone who did not identify as transgender at the time the survey was done. It would therefore not include anyone who experienced gender dysphoria but desisted and reconciled their mind and body as a result of therapy. It would also exclude detransitioners who regretted their decision to transition and wish that their therapists had done more to address their real mental health issues.
Another widely quoted study by Dr. Turban on the alleged benefits of puberty suppression relies on the same 2015 survey. In this case, in addition to the usual problems of the low quality of the data set, the sub-sample included people who would never have had access to puberty blockers. Puberty blockers dysphoria by a clinic in Amsterdam in the 1990s and were not widely used in North America until at least 2009. However, the sub-sample included people who were under 18 in 1998. That meant that many of the people in the sub-sample would have been too old to receive puberty suppression by 2009. It is likely that they confused puberty suppression with cross sex hormones.
Use and Misuse of Convenience Samples
True random sampling can be difficult or impossible when dealing with a small group like the transgender population. Often the only way to conduct research is through a convenience sample collected by voluntary recruitment through internet sites or organization membership lists.
Convenience samples are not bad research in themselves. It all depends on how they are used. Lisa Littman's study of parent reports of rapid onset gender dysphoria was a convenience sample survey with participants recruited mainly through the website 4thwavenow.com. The study was strongly criticized both for its method of recruitment and the fact that it relied on parents rather than the children themselves. However, she pointed out that her methodology was consistent with the methodology used in many of the studies that were being relied on to support affirmative care.
Convenience samples can be used as a preliminary research to develop a hypothesis which can be tested using different methods. Sometimes the conclusions in a convenience sample study can be confirmed by independent studies of different samples. A trained researcher can read a study based on a convenience sample, recognize its limitations and use it appropriately.
The problems start when study results escape the lab and spreads through the media. Political activists will latch onto any study which supports their position without paying attention to its limitations. (These are sometimes the same researchers who did the study.) Journalists will report the conclusions without asking hard questions and bad policy follows.
The difference between the a convenience sampled study like Dr. Littman’s and some of the studies relied on by trans-activists is not that one is better science than the others but how they have been used. The Littman study was intended to be used as a basis for further research, meanwhile the tentative conclusions were used to urge caution in committing teen girls to irreversible medical treatments. By contrast, Dr. Turban and his supporters have been using his convenience sampled research on conversion therapy and puberty blockers to press for legislative and policy changes on the basis of conclusions his research methods cannot support.
Loss to Follow Up
Any study that depends on following patients over a period of time is going to have some loss to follow up. This can happen for many reasons. People may die, move away, recover to the point where they no longer need a doctor or become disillusioned with the researchers.
In medical research generally, a loss to follow up rate of less than 5% will generally not affect the study while a loss of greater than 20% will threaten the validity of the results. One of the ways to testing the effect of the loss to follow up is to assume the worst case for the loss to follow up and see how it affects the results.
Studies of regret and detransition have reported loss to follow up rates of between 15 percent and 75 percent. For example, this study of post-operative regret for gender surgery found a regret rate of only 6 percent. However, only 37% of eligible patients responded so the loss to follow up was 63%. If you apply a worst case test to these results the regret rate could be anywhere from 6 percent to 69%.
Survivorship Bias
Loss to follow up is often associated with survivorship bias. This is a tendency to focus on successes and ignore failures.
A classic case of survivorship bias occurred in World War II when the U.S. Navy was looking for ways to better protect its planes. When a plane returned to base, the maintenance crews recorded the locations of bullet holes. The Navy compiled this information and found that most bullet holes were found on the wings and fuselage. They sent this information to statistician Abraham Wald and asked him to calculate the most efficient way of distributing additional armour on planes.
Wald looked at the data and saw a problem. There were main bullet holes on the outer wings and tail but almost none on the engines, cockpit or fuel system. The explanation for this was that the maintenance crews could only examine planes that returned to base and not the planes that crashed. Wald adjusted his calculations to account for this bias and ultimately concluded that fewer bullet holes were observed in the engines or fuel systems because planes hit in these areas were much less likely to return to base.
Survivorship bias is recognized as a factor in medical research and it is a major factor in gender medicine. There are many anecdotal reports of detransitioners who find it very painful to return to the gender clinic. This convenience sampled survey of 100 detransitioners by Lisa Littman found that only 24 percent had informed the doctor or clinic that assisted their transition that they detransitioned.
In gender medicine, the patients who are satisfied with their transition are the planes that return to base. They come back to the gender clinic for ongoing care and are more likely to take part in studies sponsored by a gender clinic or a transgender support group. Detransitioners are like the planes that crash. They do not return to the clinic so they cannot be counted or studied.
Follow the Science
The catchphrase “follow the science” shows up on both sides of the debate. Sometimes neither side realizes that following the science simply leads to more science. Scientific research gives a range of possible outcomes within a range of probability. Even these tentative conclusions can be upended with further research. For scientists, this is part of an ongoing search for knowledge.
But clinicians and policy makers need to make decisions now. Furthermore, there are questions that science cannot answer. Research studies may tell you that between 65 and 95 percent of gender dysphoric children will desist by puberty but they will not help decide what percentage of children it is acceptable to sterilize by mistake.
Scientific research does not dictate practical conclusions. It can be used to inform clinical or policy decisions but it needs to be applied within an ethical framework. When someone is claiming to follow science, there are two questions you need to consider. First, have they stated the underlying scientific research fairly and accurately. Second, have they been transparent about the ethical and ideological basis on which they are applying the science.
Cognitive Bias
In any discussion of gender (or anything else for that matter) you will eventually run across the term cognitive bias. It refers a series of psychological concepts which help explain why people cling to a belief even in the face of overwhelming evidence to the contrary.
The problem with talking about cognitive bias is that you first need to produce the overwhelming evidence. Explaining why people may believe something is not the same as proving they are wrong. Discussions of cognitive bias can quickly descend into ad hominem or argument by personal attack.
Where you should keep cognitive bias in mind is when reading things that you agree with. The transgender activist community does not have a monopoly on intellectual dishonesty or bad science. A recent article published by the Heritage Foundation claims that increasing access to puberty blockers is associated with increased suicide rates. The study is filled with flawed methodology and statistical errors. It has been ridiculed on Twitter by informed commentators on all sides of this debate. However a lot of people who ought to know better have retweeted it.
Equivocation
In formal logic, equivocation means using a term with ambiguous meanings in different senses in different parts of an argument. The use of the term gender is a common example. A book or article may start out by drawing a clear distinction between sex as a biological category and gender as a social category but then slide in to using gender as a synonym for sex.
The motte and bailey argument is an example of the use of equivocation to deflect criticism. Th origin of the term is a type of castle which consisted of a large area with stables, workshops and storehouses surrounded by a low wall. It was connected to the motte which was a stone stronghold on a hill. When marauders came, the garrison could evacuate the bailey and take refuge in the motte until the danger passed.
You can see the motte and bailey defence in the issue of informed consent in gender clinics. The motte is academic writing on informed consent emphasizes that informed consent does not mean hormones on demand and that there is a continuing role for professional judgment. However, in the bailey of the actual clinics, you can get a referral letter or even a prescription after a 20 minute appointment.
Outright Lying
Deliberate and unequivocal falsehoods are actually quite rare in the gender debate but they do occur. One of the most persistent is that minors don’t get gender surgery. Keep a link to this article on “top surgery” on 13 year old “trans boys” handy for when it comes up.
This particular issue would come up less if critics of pediatric gender transition toned down their rhetoric a bit. Surgery on minors does happen but it is still rare. The more important arguments is the puberty suppression, cross sex hormones and even social transition are as irreversible as surgery and create the conditions which lead many young people to seek surgery when they have just turned 18 and are still not fully mature.
Flap-doodle: Send in the Clown fish
Flap-doodle is not a recognized term in formal logic but was coined by science commentator @lecanardnoir to define a style of argument used to support dubious scientific claims:
Introduce a complex tangential subject to your argument, such as quantum physics, neuroscience or genetics, that you are sure your opponent cannot spend the time unpicking, so that you can make wild, arbitrary claims about your own beliefs.
This line of argument crops up quite often in debates on the binary nature of sex. In fact, without flap-doodle arguments, there would be no debate on the binary nature of sex. The biological definition of sex is provided in this scientific statement by the Endocrine Society:
The classical biological definition of the 2 sexes is that females have ovaries and make larger female gametes (eggs), whereas males have testes and make smaller male gametes (sperm); the 2 gametes fertilize to form the zygote, which has the potential to become a new individual. The advantage of this simple definition is first that it can be applied universally to any species of sexually reproducing organism. Second, it is a bedrock concept of evolution, because selection of traits may differ in the 2 sexes. Thirdly, the definition can be extended to the ovaries and testes, and in this way the categories—female and male—can be applied also to individuals who have gonads but do not make gametes.
Articles which try to defend the idea of sex as a spectrum, like these in Scientific American and Psychology Today totally carefully avoid this basic definition. Instead they leap from discussions of variations in hormone levels to the complexities of sexual determination. So called intersex conditions, which affect at most 0.02 percent of the population, receive undue attention. The Clownfish, which can change from male to female, usually appears.
This line of argument resembles the Gish Gallop which was named for Duane Gish, a prominent advocate for young earth creationism. In debate on evolution, Gish would reel off a stream of loosely connected talking points drawn from geology, biology, nuclear physics, genetics or any thing else that came to hand, leaving his opponents overwhelmed. Young earth creationism was a belief confined to the fringes of fundamentalist churches but sex denial has taken over mainstream publications and begun to infect university science departments.
Flap-doodle arguments are difficult for a non-expert to rebut because they rely on streams of obscure facts to create confusion around the central argument. Fortunately, there are dedicated scientists who will take the time to slap down flap-doodle whenever it rears its head. Biologist Colin Wright has published a detailed response to the main sex as a spectrum arguments. You can find a series of explainer videos (with links to peer reviewed sources) at the Paradox Institute.
The Appeal to Authority
The people who write gender flap-doodle are not stupid. You need to learn a lot of facts in order to twist them and to get your stuff published you need fairly impressive credentials. Any particularly pointed challenge to a piece of gender woo woo from a lay person is likely to be met with the indignant response along the lines of, “How dare you challenge someone who holds a Doctorate from an Ivy League University and publishes in Peer Reviewed Journals.”
This line of argument is a fallacy which has a fancy Latin name, the argument um ad verecundiam or the appeal to authority.
People with doctoral degrees from prestigious universities have to be very bright and hardworking at least at some points in their careers. However, they can still make mistakes and if someone points out a specific mistake, you need to answer the specific point.
A weak or fallacious argument from an expert does not get any better if it is endorsed by lots of other experts. This is the fallacy of argumentum ad numeram or the appeal to popularity. When someone points out that gender affirming care has been endorsed y the American Psychological Association, the American Pediatric Association, the Endocrine Society and many other medical organizations, it is still legitimate to ask whether any of these groups based their endorsement on a systematic evidence review.
The appeal to authority persists because it serves a social purpose. Questioning everything is good advice in the classroom but in day to day life there simply is not time. We need to be able to rely on expert advice without scrutinizing every detail.
Liberal society has developed a matrix of safeguards to ensure that expert advice is reliable, most of the time. Professionals are licensed and regulated by governing bodies. Scientific papers are subject to peer review. Academic tenure protects researchers from undue government and corporate pressure. The press watches for cases of abuse.
The rise of gender ideology has seen all of these safeguards fail simultaneously. It will take time to rebuild them and the loss of public trust will take even longer to undo.
Acknowledgments
Keeping up with the flood of misinformation on gender issues is challenging. These are some of the sources that I have found consistently useful:
Tl;dr: Scientifically speaking, there's 50 different ways to be bullshitted, and if you can think of 3 of them, you're a genius.
(With apologies to Mickey Rourke and Body Heat)
Fantastic, thank you!
Another issue for these studies is following up too soon (e.g., checking for regret after a year).
And indeed, for the AAP adoption of the affirmative model (Rafferty et al, 2018): "Remarkably, not only did the AAP statement fail to include any of the actual outcomes literature on such cases, but it also misrepresented the contents of its citations, which repeatedly said the very opposite of what AAP attributed to them." https://www.tandfonline.com/doi/abs/10.1080/0092623X.2019.1698481 Lot of experts, but still....false!