**Quotation: **Päll T, Luidalepp H, Tenson T, Maiväli Ü (2023) A field-wide evaluation of differential expression profiling by high-throughput sequencing reveals widespread bias. PLoS Biol 21(3):

e3002007.

https://doi.org/10.1371/journal.pbio.3002007

**Tutorial Editor: **Marcus Munafò, College of Bristol, UNITED KINGDOM

**Obtained: **July 29, 2022; **Accepted: **January 20, 2023; **Printed: ** March 2, 2023

**Copyright: ** © 2023 Päll et al. That is an open entry article distributed below the phrases of the Inventive Commons Attribution License, which allows unrestricted use, distribution, and replica in any medium, offered the unique writer and supply are credited.

**Information Availability: **The authors affirm that each one information underlying the findings are totally out there with out restriction. The code to provide the uncooked dataset is on the market on the rstats-tartu/geo-htseq Github repo (https://github.com/rstats-tartu/geo-htseq). The uncooked dataset produced by the workflow is deposited in Zenodo https://zenodo.org with doi: 10.5281/zenodo.7529832 (https://doi.org/10.5281/zenodo.7529832). The code to provide the article’s figures and fashions is deposited on the rstats-tartu/geo-htseq-paper Github repo (https://github.com/rstats-tartu/geo-htseq-paper). Particular person mannequin objects are deposited in G-Node with doi: 10.12751/g-node.p34qyd (https://doi.org/10.12751/g-node.p34qyd). Code and workflow used to run and analyze RNA-seq simulations are deposited in Zenodo with doi: 10.5281/zenodo.4463804 (https://doi.org/10.5281/zenodo.4463804). Processed information, uncooked information and workflow of the RNA-seq simulations with enter fasta file is deposited in Zenodo with doi: 10.5281/zenodo.4463803 (https://doi.org/10.5281/zenodo.4463803).

**Funding: **This work was supported by the European Regional Growth Fund by the Centre of Excellence in Molecular Cell Engineering (2014-2020.4.01.15-0013 for ÜM and TT) and by grants from the Estonian Analysis Council (PRG335 for ÜM and TT; PUT1580 for TP). The funders had no position in research design, information assortment and evaluation, resolution to publish, or preparation of the manuscript.

**Competing pursuits: ** The authors have declared that no competing pursuits exist.

**Abbreviations:
**CPM,

counts per million; DE,

differential expression; FDR,

false discovery fee; HT,

seq, high-throughput sequencing

## Introduction

Over the previous decade, a sense that there’s a disaster in experimental science has more and more permeated the considering of methodologists, captains of business, working scientists, and even the lay public [1–6]. This manifests in poor statistical energy to search out true results [7,8], in poor reproducibility (outlined as getting an identical outcomes when reanalyzing the unique information by the unique analytic workflow), and in poor replicability (outlined as getting comparable outcomes after repeating your complete experiment) of the outcomes [9]. The proposed causes behind the disaster embrace sloppy experimentation, selective publishing, perverse incentives, difficult-to-run experimental programs, inadequate pattern sizes, overreliance on null speculation testing, and much-too-flexible analytic designs mixed with hypothesis-free research of massively parallel measurements [10–15]. Though there have been makes an attempt at assessing experimental high quality by replication of experiments, prohibitive prices and theoretical shortcomings in analyzing concordance in experimental outcomes have encumbered this strategy in biomedicine [16–19]. Nevertheless, outcomes from a latest replication of fifty experiments from 23 high-profile preclinical most cancers research point out that ca 90% of impact sizes had been overestimated and that over half of the revealed results had been both within the unsuitable path or had been wrongly assigned to the non-null class [20].

One other approach to assess the large-scale high quality of a science is to make use of surrogate measures for high quality that may be extra simply obtained than the complete replication of a research. Essentially the most usually used measure is technical reproducibility, which includes checking for code availability and operating the unique evaluation code on the unique information. Though the proof base for reproducibility remains to be sketchy, it appears to be effectively beneath 50% in a number of fields of biomedicine [18]. Nevertheless, as there are a lot of the reason why a profitable replica may not point out a great high quality of the unique research or why an unsuccessful replica might not point out a nasty high quality of the unique research, the reproducibility criterion is inadequate.

One other proxy for high quality will be present in revealed *p*-values, particularly the distribution of *p*-values [21]. In a pioneering work, Jager and Leek extracted ca. 5,000 statistically vital *p*-values from abstracts of main medical journals and pooled them to formally estimate, from the form of the following *p*-value distribution, the Science-Extensive False Discovery Charge or SWFDR as 14% [22]. Nevertheless, as this estimate relatively implausibly presupposes that the unique *p*-values had been calculated uniformly accurately and that unbiased units of serious p-values had been obtained from the abstracts, they subsequently revised their estimate of SWFDR upwards, as “probably not >50%” [23]. For observational medical research, by a distinct methodology, a believable estimate for field-wide false discovery fee (FDR) was discovered to be someplace between 55% and 85%, relying on the research sort [24].

Whereas our work makes use of revealed *p*-values as proof for field-wide high quality and presupposes entry to unbiased full units of unadjusted *p*-values, it doesn’t pool the *p*-values throughout research, nor does it assume that they had been accurately calculated. In reality, we assume the other and do a study-by-study evaluation of the standard of calculating *p*-values. This makes the standard of the *p*-value a proxy for the standard of the experiment and the scientific inferences primarily based on these *p*-values. We don’t see our estimate of the fraction of poorly calculated *p*-values as a proper high quality metric however merely hope that by this measure, we are able to shed some gentle on the general high quality of a area.

We think about area of differential expression profiling research utilizing high-throughput sequencing (HT-seq DE, principally RNA-seq) for two causes. First, HT-seq has turn out to be the gold commonplace for entire transcriptome gene expression quantification in analysis and scientific purposes [25]. Secondly, because of the massively parallel testing in particular person research of tens of 1000’s of options per experiment, we are able to entry study-wide lists of *p*-values. From the shapes of histograms of *p*-values, we are able to establish the experiments the place *p*-values had been calculated apparently accurately, and from these research, we are able to estimate the study-wise relative frequencies of true null hypotheses (the *π*_{0}-s). Additionally, we consider that the very nature of the HT-seq DE area, the place every experiment compares the expression ranges of about 20,000 completely different options (e.g., RNA-s) on common, predicates that the standard of information evaluation and particularly statistical inference primarily based on *p*-values should play a decisive half in scientific inference. Merely, one can not analyze an HT-seq DE experiment intuitively, with out resorting to formal statistical inference. Subsequently, high quality issues of statistical evaluation would very probably straight and considerably influence the standard of science. Thus, we use the standard of statistical evaluation as a proxy for the standard of science, with the understanding that this proxy may match higher for contemporary data-intensive fields, the place a scientist’s instinct has a relatively extra minor position to play.

## Outcomes

### Information mining

We queried the NCBI GEO database for “expression profiling by high-throughput sequencing” (for the precise question string, see Strategies), retrieving 43,610 datasets (GEO collection) from 2006, when the primary HT-seq dataset was submitted to GEO, to December 31, 2020. The yearly new HT-seq submissions elevated from 1 in 2006 to 11,604 by 2020, making up 26.6% of all GEO submissions in 2020.

First, we filtered the GEO collection containing supplementary processed information information. NCBI GEO database submissions observe MINSEQE pointers [26]. Processed information are a required a part of GEO submissions, outlined as the information on which the conclusions within the associated manuscript are primarily based. The format of processed information information submitted to GEO will not be standardized, however within the case of expression profiling, such information embrace, however should not restricted to, quantitative information for options of curiosity, e.g., mRNA, in tabular format. Sequence learn alignment information and coordinates (SAM, BAM, and BED) should not thought-about processed information by GEO.

In line with our evaluation, the 43,610 GEO collection contained 84,036 supplementary information information, together with RAW.tar archives. After unpacking RAW.tar information, we programmatically tried to import 647,092 information, leading to 336,602 (52%) efficiently imported information, whereas 252,685 (39%) information weren’t imported as a result of they had been both SAM, BAM, BED, or in different codecs likely not containing *p*-values. We didn’t import 57,805 (8.9%) information for varied causes, primarily due to textual content encoding points and failure to establish column delimiters.

In line with GEO submission necessities, the processed information information might comprise uncooked counts of sequencing reads and/or normalized abundance measurements. Subsequently, a sound processed information submission might not comprise lists of *p*-values. Nonetheless, we recognized *p*-values from 4,616 GEO collection, from which we extracted 14,813 distinctive unadjusted *p*-value units. Whereas the imply variety of *p*-value units, every set similar to a separate experiment, per 4,616 GEO submissions was 3.21 (max 276), 46% of submissions contained a single *p*-value set, and 76% contained 1–3 *p*-value units. For additional evaluation, we randomly chosen 1 *p*-value set per GEO collection.

*P*-value histograms

We algorithmically categorised the *p*-value histograms into 5 courses (see Strategies for particulars and Fig 1A for consultant examples) [27]. The “Uniform” class accommodates flat *p*-value histograms indicating no detectable true results. The “Anti-Conservative” class accommodates in any other case flat histograms with a spike close to zero. The “Conservative” class accommodates histograms with a definite spike shut to at least one. The “Bimodal” histograms have 2 peaks, one at both finish. Lastly, the category “Different” accommodates a panoply of malformed histogram shapes (humps within the center, gradual will increase in the direction of one, spiky histograms, and many others.). The “Uniform” and “Anti-Conservative” histograms are the theoretically anticipated shapes of *p*-value histograms.

We discovered that total, 25% of the histograms fall into the anti-conservative class, 9.5% had been conservative, 26% bimodal, and 39% fell into the category “different” (Fig 1B). Solely 17 of the 4,616 histograms had been categorised as “uniform.” The median variety of options in our pattern was 20,954. Apparently, there are fewer options in anti-conservative *p*-value histograms, as in comparison with histograms with all different shapes, suggesting completely different information preprocessing for datasets leading to anti-conservative histograms (S1 Fig). Logistic regression reveals a transparent development for an growing proportion of anti-conservative histograms, ranging from <10% in 2010 and surpassing 30% in 2020 (S2 Fig). Multinomial hierarchical logistic regression signifies that almost all differential expression (DE) evaluation instruments exhibit temporal will increase of anti-conservative *p*-value histograms, aside from cuffdiff, which has the other development, and glc genomics and deseq, the place a transparent development couldn’t be recognized (Figs 2A and S3). The rise within the fraction of anti-conservative histograms is achieved by decreases primarily within the class “different” and “bimodal,” relying on the DE evaluation instrument.

This optimistic temporal development in anti-conservative *p*-value histograms suggests bettering the standard of the HT-seq DE area. Considerably surprisingly, Fig 2A additionally signifies that completely different DE evaluation instruments are related to very completely different proportions of *p*-value histogram courses, suggesting that the standard of *p*-value calculation, and subsequently, the standard of scientific inferences primarily based on the *p*-values, is dependent upon the DE evaluation instrument. We additional examined this conjecture in a simplified mannequin, proscribing our evaluation to 2018 to 2020, the ultimate years in our dataset (Fig 2B). As no single DE evaluation instrument dominates the sector (the highest 5 are DESeq2 28%, cuffdiff 27%, edgeR 14%, DESeq 8%, limma 2%; see S4 Fig for temporal developments), a scenario the place proportions of various *p*-value histogram courses don’t considerably differ between evaluation instruments would point out lack of tool-generated bias. Nevertheless, we discovered that each one *p*-value histogram courses, besides “uniform,” which is especially unpopulated, rely strongly on the DE evaluation instrument (Figs 2B, S3, and S5). This impact is kind of excessive, extending from almost zero fraction in cuffdiff to about 0.75 in Sleuth. Utilizing the entire dataset of 14,813 *p*-value histograms—as a verify for robustness of outcomes—or adjusting the evaluation for GEO publication yr, of the taxon (human, mouse, and pooled different), of the RNA supply or sequencing platform—as a verify for doable confounding—doesn’t change this conclusion (S5B–S5F Fig). The shortage of confounding in our outcomes permits a causal interpretation, indicating that DE evaluation instruments bias the evaluation of HT-seq experiments, whereas the big DE evaluation platform-dependent variations counsel a really substantial bias [28].

### The proportion of true null hypotheses

To additional enquire into DE evaluation tool-driven bias, we estimated from user-submitted *p*-values the fraction of true null results (the *π*_{0}) for every HT-seq experiment. The *π*_{0} is a statistic calculated solely from the *p*-values, and it’s routinely used as an intermediate amount wanted to repair FDR on the desired degree [29]. The standard of FDR management is dependent upon the standard of estimation of the *π*_{0}, which in flip is dependent upon the standard of underlying *p*-values. Thus, the *π*_{0} will be seen as an estimate of the true proportion of true nulls or as a single quantity abstract of the *p*-value distribution, and within the latter capability, it may be used as a high quality verify of the *p*-values used for FDR management of the experiment.

As non-anti-conservative units of *p*-values (excepting the “uniform”) already point out low high quality of each the *π*_{0} estimate, and of the following FDR management [30], we solely calculated the *π*_{0} for datasets with anti-conservative and uniform *p*-value distributions (*N* = 1,188). Nonetheless, the *π*_{0}-s present a particularly broad distribution, starting from 0.999 to 0.02. Remarkably, 37% of the *π*_{0} values are smaller than 0.5, which means that, in accordance with the calculated *p*-values, in these experiments, over half of the options are estimated to vary their expression ranges upon experimental remedy (Fig 3A). Conversely, solely 23% of *π*_{0}-s exceed 0.8, and 9.9% exceed 0.9. The height of the *π*_{0} distribution will not be close to 1, as could be anticipated from experimental design concerns, however there’s a broad peak between 0.5 and 0.8 (median and imply *π*_{0}-s are each at 0.59). Relying on the DE evaluation instrument, the imply *π*_{0}-s vary over 20 share factors, from about 0.45 to 0.65 (Fig 3B, see S6A Fig for all DE evaluation instruments). Utilizing the entire dataset confirms the robustness of this evaluation (S6B Fig.).

Fig 3. Affiliation of the proportion of true null results (*π*_{0}) with DE evaluation instrument.

(**A**) Histogram of *π*_{0} values estimated from anti-conservative and uniform *p*-value units. *N* = 1,188. The information file is in S3 Information. (**B**) Sturdy linear mannequin [pi0 ~ de_tool, beta likelihood] signifies an affiliation of *π*_{0} with the DE evaluation instrument. Factors denote greatest estimates for the imply *π*_{0} and thick and skinny strains denote 66% and 95% credible intervals, respectively. *N* = 1,188. The information file is in S4 Information. (**C**) Histogram of *π*_{0} values in GEO most cancers research in comparison with non-cancer research. The information file is in S5 Information. (**D**) Histogram of *π*_{0} values in GEO transcription issue research in comparison with non-TF research. The information file is in S6 Information. The mannequin object associated to panel B will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool_sample.rds.

By way of experimental design, to get a low *π*_{0} an experiment must change the expression of most genes below research considerably. A big supply of such experiments can be comparisons of various most cancers cell strains/tissues, the place *π*_{0} = 0.4 could possibly be thought-about an affordable final result [31]. We, subsequently in contrast the *π*_{0}-s coming from GEO HT-seq submissions associated to look phrases “neoplasms” or “most cancers” (for the precise question string, please see Strategies) with all different non-cancer submissions. There may be little or no distinction within the means and commonplace deviations of *π*_{0}-s for most cancers and non-cancer experiments (0.58 (0.22) and 0.59 (0.24), respectively) (Fig 3C). Filtering by research mentioning transcription issue led to very comparable outcomes (Fig 3D), suggesting that the broad dispersion of *π*_{0}-s will not be brought on by intentional experimental designs. Additionally, research involving most cancers and TFs resulted in anti-conservative or uniform *p*-value distributions with comparable possibilities to non-cancer/non-TF research (threat ratios with 95% CI are 0.95 (0.83; 1.07) and 0.99 (0.74; 1.28), respectively).

As well as, controlling for time, taxon, or sequencing platform, didn’t considerably change the affiliation of DE evaluation instruments with the *π*_{0}-s (S6C–S6F Fig). Recalculating the *π*_{0}-s with a worldwide FDR algorithm [32] didn’t result in considerably completely different *π*_{0} distribution, though there seems to be a slight directional shift of the distribution, the imply *π*_{0} is shifted from 0.58 within the native FDR methodology to 0.54 within the international FDR methodology (S7 Fig.). As there’s a robust affiliation between each *π*_{0} and the proportion of anti-conservative *p*-value histograms with the DE evaluation instrument (Figs 2 and 3), we additional checked for and didn’t see comparable associations with variables from uncooked sequence metadata, such because the sequencing platform, library preparation methods, library sequencing methods, library choice methods, and library structure (single or paired) (S8–S15 Fig). These outcomes assist the conjecture of specificity of the associations with DE evaluation instruments.

### The pattern measurement distribution of DE-HT experiments signifies persistently low energy

We assigned pattern sizes to 2,393 GEO submissions (see Supplies for particulars). From these, 91% had pattern sizes of 4 or much less, 25% had N of simply 2 and 12% used a single pattern from which to calculate *p*-values, thereby solely ignoring organic variation (Fig 4A). Only one% of experiments had pattern sizes over 10.

Are the noticed pattern sizes ample to convey cheap energy to DE RNA-seq experiments? Simulations on idealized information with pure variation and impact sizes lead *N* = 2 experiments to 30% to 60% energy and *N* = 3 experiments to 50% to 70% energy for human/mouse experiments, the place larger *π*_{0}-s are related to decrease energy (Fig 4B). The ability of real-life experiments is prone to be even decrease, attributable to imperfect match between the information and the statistical mannequin used to investigate it [34]. Thus, assuming that practical ranges of organic variation are captured within the samples, most DE RNA-seq experiments performed in eukaryotic programs should be underpowered. This conclusion is in line with the methodological literature (see Dialogue).

Fig 4. The pattern sizes of HT-seq DE experiments point out low energy.

(**A**) Histogram of two,393 pattern sizes. The information file is in S7 Information. (**B**) Statistical energy simulations utilizing completely different *π*_{0} settings (proven as shades of blue coloring) and a pair of completely different organic variation settings (“Gilad” corresponds to human liver samples and “Bottomly” to inbred mice; [33]). The information file is in S8 Information.

### A low pattern measurement lowers the chance of acquiring anti-conservative *p*-value distributions

There’s a clear development, proven in Fig 5A, whereby very small samples of 5 or fewer result in progressively smaller fractions of anti-conservative *p*-value distributions. Whereas the chance of acquiring an anti-conservative *p*-value distribution is just round 0.1, if the pattern measurement is one, it will increase to 0.3 for pattern measurement 3 and additional to about 0.5 for pattern measurement 5. We might detect no additional enhance on this chance as pattern measurement grew additional (however word that the overwhelming majority of experiments have pattern sizes of 4 or much less). As our simulations point out that fairly powered (energy is round 0.7 to 0.8) experiments begin round pattern sizes 4 to five for multicellular programs, this end result means that the prevailing DE evaluation algorithms may match poorly on information obtained from underpowered experiments.

Experiments that didn’t end in anti-conservative *p*-value distributions had been much less prone to have samples bigger than 3 and extra prone to have pattern sizes of two and 1 than the experiments that led to anti-conservative *p*-value distributions (Fig 5B). This impact was particularly pronounced for experiments with conservative *p*-value distributions, of which 52% had pattern sizes of 1 or 2, as in contrast with 26% for anti-conservative *p*-value distributions. Total, the experiments leading to anti-conservative *p*-value distributions have a barely bigger imply pattern measurement (3.5 versus 2.7; *p* = 10^{−9}).

These outcomes point out a transparent causal hyperlink between pattern measurement and technical high quality of most present workflows of DE evaluation. Thus, a low energy not solely results in missed discoveries and overestimated DE-s, as predicted by principle [35], but it surely additionally appears to be damaging in the direction of the algorithms that calculate these DE-s.

### Lack of affiliation of pattern measurement with the proportion of true nulls

In line with the statistical principle, there’s a linear dependence of the facility of the experiment on the sq. root of pattern measurement. An underpowered experiment leads to a smaller variety of near-zero *p*-values, resulting in overestimation of the *π*_{0} at small N-s, particularly at *N* = 2, as confirmed by simulation (Fig 6A). Thus, lowering the facility will finally convert an anti-conservative *p*-value distribution right into a uniform one. Nevertheless, we not solely have only a few uniform *p*-value distributions in our empirical information but in addition few near-one estimated *π*_{0}-s that could possibly be construed as manifestations of low energy (Fig 3A). We regarded into this additional by plotting the *π*_{0} values versus pattern measurement (Fig 6B). Consequently, as a substitute of the anticipated lower of the imply *π*_{0} upon growing the pattern measurement, we see a slight enhance of *π*_{0}-s at pattern sizes >5, whereas the general correlation is negligible (r = 0.06; 95% CI [0.007, 0.12]). Thus, there doesn’t seem like a pattern size-dependent bias in *π*_{0}-s, as estimated from the GEO-submitted *p*-value distributions. Because the true energy of most small pattern DE RNA-seq experiments is predicted to be low, this end result leads us to query additional the statistical adequacy of the underlying *p*-values, of which the *π*_{0}-s are however single-number summaries. These outcomes present that statistical strategies utilized in HT-seq DE evaluation have a tendency to provide extremely suspect output even when the *p*-value distribution is anti-conservative.

Fig 6. π_{0}-s calculated from anti-conservative p-value units don’t behave in accordance to statistical principle.

(**A**) Calculated π_{0}-s (on the y-axis) from simulated information vs. given “true” proportions of DE-features (on the x-axes). Pattern sizes are indicated in colour code. The dotted line reveals the proper correspondence between the given π_{0}-s and the estimated π_{0}. The information file is in S11 Information. (**B**) Dependence of imply π_{0} from the binned pattern sizes with 95% CI. Sturdy linear mannequin [pi0 ~ N], Scholar’s chance. Factors denote greatest estimates for the imply and thick and skinny strains denote 66% and 95% credible intervals, respectively. The information file is in S12 Information. Mannequin object associated to panel B will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0percent20~%20N.rds.

### Curing of *p*-value histograms by eradicating low-count options

We noticed a slight discount within the imply variety of *p*-values per GEO experiment of anti-conservative and uniform histograms in comparison with different *p*-value histogram courses, suggesting that *p*-value units with anti-conservative and uniform shapes usually tend to have been pre-filtered or have been extra extensively pre-filtered (S1 Fig). Accordingly, we speculated that by additional filtering out options with low counts, we might convert among the untoward *p*-value histograms into anti-conservative or uniform sorts. Our purpose was to not present optimum interventions for particular person datasets, which might require tailoring the filtering algorithm for the necessities of a specific experiment. We merely goal to offer proof of precept that by a easy filtering strategy, we are able to enhance the proportion of anti-conservative *p*-value units and/or cut back the dependence of outcomes on the evaluation platform. Subsequently, we utilized filtering to three,426 *p*-value units the place we might establish gene expression values (see Strategies for particulars on deciding on filtering thresholds).

We discovered that total we might enhance the proportion of anti-conservative *p*-value histograms by 2.4-fold, from 844 (24.6%) to 2,022 (59.0%), and the variety of uniform histograms by 2.5-fold from 8 (0.23%) to twenty (0.6%) (Fig 7A). For all evaluation platforms, most rescued *p*-value distributions got here from courses “bimodal” and “different,” whereas virtually no rescue was detected from conservative histograms (Fig 7B–7F). After eradicating low depend options, the proportion of anti-conservative *p*-value histograms elevated for all evaluation platforms, with the most important results noticed for cuffdiff and deseq, which offered the bottom pre-rescue fractions of anti-conservative *p*-value histograms (Fig 7G and 7H). Nonetheless, substantial variations between evaluation platforms stay, indicating that the removing of low-count options, whereas typically beneficent, was inadequate to thoroughly take away the sources of bias originating from the evaluation platform. Additionally, the π_{0}-s calculated from the rescued anti-conservative *p*-value units have very comparable distributions in comparison with π_{0}-s from the pre-rescue anti-conservative *p*-value units and concomitantly very comparable dependence on the evaluation platform (Fig 7I and 7J).

Fig 7. Elimination of low-count options leads to an growing proportion of anti-conservative *p*-value histograms.

**(A-F)** Sankey charts of transformation of *p*-value histogram form. Ribbon measurement is linearly proportional to the variety of *p*-value units that change their distributional class. Solely the three,426 experiments that could possibly be subjected to this remedy are depicted. (**A**) Full information, *N* = 3,426. (**B**) The subset the place the *p*-values had been calculated with cuffdiff, *N* = 1,116. (**C**) The subset the place the *p*-values had been calculated with DESeq, *N* = 252. (**D**) The subset the place the *p*-values had been calculated with DESeq2, *N* = 1,114. (**E**) The subset the place the *p*-values had been calculated with edgeR, *N* = 515. (**F**) The subset the place the *p*-values had been calculated with limma, *N* = 73. (**G**) Posterior summaries of anti-conservative *p*-value histogram proportions in uncooked and filtered *p*-value units. Filtered *p*-value information is from a Bernoulli mannequin [anticons ~ de_tool], *N* = 3,426. The information information are in S13 Information and in S14 Information (for uncooked information). (**H**) Impact sizes in share factors of low-count characteristic filtering to the proportion of anti-conservative *p*-value histograms. The information information are in S13 Information and in S14 Information (for uncooked information). (**I**) Posterior summaries of π_{0} values of *p*-value histograms in uncooked and filtered *p*-value units. Filtered *p*-value information is the *p*-value from the beta mannequin [pi0 ~ de_tool], *N* = 2,042. The information information are in S15 Information and in S16 Information (for uncooked information). (**J**) Impact sizes in π_{0} items (share factors) of low-count characteristic filtering to π_{0}. The information information are in S15 Information and in S16 Information (for uncooked information). The mannequin object associated to filtered *p*-value units in panel G will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/anticons_detool_filtered.rds. The mannequin object associated to filtered *p*-value units within the panel I will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool_full_data_filtered.rds. See S16 Fig for all platforms.

## Dialogue

On this work, we assess the standard of the differential expression evaluation by HT-seq primarily based on a big, unbiased NCBI GEO dataset. Our purpose was to review real-world statistical inferences made by working scientists. Thus, we research how experimental design selections and analytic selections of scientists have an effect on the standard of their statistical inferences, with the understanding that in a area the place every experiment encompasses ca. 20,000 parallel measurements of DE on common, the standard of statistical inference is extremely related to the standard of the scientific inference. To the perfect of our data, that is the primary large-scale research to supply quantitative perception into the overall high quality of experimentation and information evaluation of a big area of biomedical science.

We present that

- Total, three-quarters of HT-seq DE experiments end in
*p*-value distributions that point out that assumptions behind the DE exams haven’t been met. Nevertheless, for a lot of experiments, a easy exclusion of low-count options rescues the*p*-value distribution. - Only a few experiments end in uniform
*p*-value distributions that point out <100 DE options. - The pattern sizes of the overwhelming majority of HT-seq DE experiments are inconsistent with fairly excessive statistical energy to detect true results.
- Nonetheless, the distribution of
*π*_{0}-s, the fraction of true null hypotheses in an experiment, which is estimated solely from the*p*-value distributions of experiments presenting anti-conservative*p*-value distributions, peaks at round 0.5, as if in lots of experiments, most genes had been DE. Moreover, the estimated*π*_{0}-s don’t correlate with the pattern sizes of the experiments, indicating that the underlying*p*-value units are problematic for controlling FDR. - The proportion of anti-conservative
*p*-value distributions and the values of*π*_{0}-s are strongly related to the DE evaluation platform.

These outcomes present that not solely are most *p*-value units statistically suspect (as proven by the preponderance of untoward *p*-value distributions), however that even the well-behaved anti-conservative *p*-value units are likely to result in incompatible statistical inferences with the noticed low pattern sizes and subsequently, low energy. Moreover, the noticed affiliation with information evaluation platforms of each the *p*-value distributional courses and the *π*_{0}-s of anti-conservative *p*-value units strongly signifies that DE evaluation leads to the literature are considerably biased by the evaluation platform used.

From our meta-science research, we are able to conclude that the precise use of statistical instruments within the HT-seq DE area is inconsistent with a top quality of statistical inferences. Whereas our outcomes clearly present the pervasiveness of the issue, they by themselves can not reply the query: how a lot does it matter downstream for the standard of scientific inferences? Nevertheless, latest methodological work on particular person workflows permits us to offer a solution: it issues loads. Particularly, from simulation research of present workflows, it’s now changing into clear that each one fashionable *p*-value-based strategies that can be utilized with small samples usually fail within the one factor that they had been actually created for, in FDR management [30,34,36]. Our research sees the outcomes of this widespread failure to regulate FDR within the excessive prevalence of low *π*_{0} estimates, whose lack of dependence on pattern measurement defies the logic of statistical inference, thereby confirming that the poor FDR management noticed in simulations is massively carried over into outcomes of precise HT-seq DE information evaluation. Collectively, our research and the aforementioned simulation research point out that the sector is basically constructed on analyzing low-power experiments, that are unlikely to establish precise results, however nonetheless are likely to overestimate the impact sizes of each true and false DE-s of any statistically vital outcomes [35], whereas presenting an unknowable variety of false discoveries as statistically vital. It’s arduous to think about a worse state of affairs or that this is able to not considerably have an effect on the standard of related scientific conclusions.

Within the following, we are going to additional talk about particular facets of our work.

### The which means of the *p*-value distribution

The very first thing to note with the *p*-value distributional courses is that there are extraordinarily few uniform *p*-value distributions, suggesting comparatively few precise results. This surprising result’s made much more stunning by the low statistical energy (<40%) of most real-world RNA-seq DE experiments, which ought to enhance the chance of encountering uniform *p*-value distributions [37]. As a technical remark, it ought to be famous that the category assigned by us on a *p*-value histogram is dependent upon an arbitrarily set bin measurement. Our use of 40 bins results in histograms, the place an experiment with as much as round 100 true results out of 20,000 options might effectively result in a uniform histogram due to swamping of the lowermost bin (*p* < 0.025) with *p*-values emanating from true null results (S17 Fig). If the *p-*values had been calculated accurately, then the dearth of uniform *p*-value distributions would point out that (i) there are virtually no experiments submitted to GEO, the place the experimental remedy led to only some DE genes and (ii) that the precise energy of GEO-submitted experiments will not be low in any respect. We discover each potentialities arduous to consider.

Whereas there’s a optimistic temporal development for the growing fraction of anti-conservative *p*-value units, total, a considerable majority of them fall into shapes indicating that the assumptions behind the statistical exams, which produced these *p*-values, haven’t been met. Expressly, *p*-value distributions that aren’t anti-conservative/uniform relinquish statistical management of the experiment over sort I errors, on the presence of which later *p*-value changes are predicated [21]. Consequently, such *p*-value units have successfully relinquished their statistical which means and are thus problematic for scientific interpretation and additional evaluation, together with FDR/q worth calculations. Particularly, conservative *p*-value histograms are considered brought on by small samples, indicative of diminished energy and overly conservative management of FDR [30]. The conjecture of low energy is supported by our information in that experiments with conservative *p*-value distributions are the more than likely to have pattern sizes of 1 or 2 (Fig 5B).

In distinction, anti-conservative *p*-value distributions which have greater than anticipated small *p*-values, and thus led to diminished *π*_{0} estimates in our evaluation, are likely to result in lack of FDR management [30]. Basically, even a small deviation from the uniformity of the *p*-value distribution of true nulls is predicted to throw off the general FDR management, which, alas, is the primary cause for utilizing these strategies for the statistical evaluation of DE [30]. In reality, with out FDR management, any statistical methodology that converts steady results into statistical significance by binary selections can do extra hurt than good [11].

At present, most DE evaluation platforms use parametric exams accompanied by GLM regression. Nevertheless, it has been not too long ago proven for the often-used parametric exams that their FDR management will be exceedingly poor, whereas the nonparametric Wilcoxon rank sum check achieves higher outcomes [34]. Moreover, it’s changing into clear that information normalization can introduce bias into HT-seq DE evaluation, which can’t be corrected by lots of the at the moment extensively used strategies [38,39]. Certainly, extensively used DE evaluation instruments use RNA-seq information preprocessing methods, that are weak to conditions the place a big fraction of options change expression, and completely different samples exhibit differing whole RNA concentrations [25,40].

Basically, the evaluation instruments used for differential expression testing enable for a plethora of selections, together with distributional fashions, information transformations, and fundamental analytic methods, which might result in completely different outcomes by completely different trade-offs [41–43]. Our discovering of a excessive proportion of unruly *p*-value distributions means that the wide range of particular person information preprocessing/DE testing workflows, of their precise use by working scientists, leads to an total poor final result. Though this case has been steadily bettering over the past decade, at the very least for the limma, edgeR, and DESeq2 customers, there stays substantial room for additional enhancements.

### Attainable causes for evaluation platform-driven bias

What might trigger the systematic variations between evaluation platforms that we see in *p*-value distributions and *π*_{0} values? The numerous factors of divergence within the analytic pathway embrace uncooked sequence preprocessing, aligning sequences to a reference sequence, counts normalization, and differential expression testing [42,44]. Our incapability to abolish evaluation platform-driven bias by removing of low-count options means that the pre-filtering of information will not be a serious supply of this bias. Though in precept, any instrument can be utilized in lots of workflows, completely different instruments have a tendency to supply their customers completely different prompt workflows and completely different quantities of automation for information preprocessing and evaluation. For instance, Cuffdiff virtually totally automates the evaluation, together with the removing of biases [45], DESeq2 workflow suggests some preprocessing steps to speed-up computations however makes use of automation to take away biases from enter information [46], whereas edgeR [47] and limma-voom require extra interactive execution of separate preprocessing and evaluation steps [48,49]. We speculate that the recognition of Cuffdiff and DESeq2 partly lies of their automation, because the consumer is basically relieved from decision-making. Nevertheless, we discovered that cuffdiff is related to the smallest proportion of anti-conservative *p*-value histograms, whereas limma and edger, with their extra hands-on strategy, are related to the very best proportions of anti-conservative histograms. Apparently, limma and edgeR use very completely different distributional fashions for DE testing, supporting the notion that it could be information preprocessing relatively than the statistical check that has probably the most influence on the success of DE evaluation [25]. Nevertheless, limma and edgeR weren’t the highest performers on the *π*_{0}-metric, the place the very best efficiency is related to Cuffdiff, whose potential to offer anti-conservative *p*-value distributions was the least spectacular (word that the *π*_{0}-s had been calculated solely from the experiments with well-distributed *p*-values).

### Energy concerns

The prevalent pattern sizes of two to three in revealed HT-seq DE experiments are incompatible with our simulations and the literature by way of offering sufficient energy to the experiment to reliably discover most DE genes. For instance, with the at the moment favored pattern measurement of three, for many genes in isogenic yeast, impact sizes of at the very least 4-fold appear to be required for profitable evaluation, and the general minimal acceptable pattern measurement will be from 4 to six [37,50]. In animals, an affordable pattern measurement is predicted to be considerably bigger, over 10 for many genes, which aren’t extremely expressed, and for most cancers samples, it appears to be effectively over 20 [37,43,51–53]. This raises the likelihood that just about all experiments, which ought to have resulted in uniform distributions, had been someway shifted into different distributional courses. The low energy to detect DE of most genes additionally signifies that excluding most low-count genes (which suggests most genes) ought to be inspired as a result of it might enhance the facility to detect the extremely expressed genes, whereas DE of lower-expressed genes would have been missed anyway. After all, when the scientific purpose is to review the DE of genes that aren’t extremely expressed, there is no such thing as a actual different to considerably growing the pattern measurement.

### Estimated *π*_{0}—s counsel experimental design issues

The DE evaluation tool-specific technique of *π*_{0} values vary from 0.5 (instrument “unknown”) to 0.7 (instrument “cuffdiff”), displaying that by this criterion in a mean RNA-seq experiment about half of the mobile RNA-s are anticipated to vary their expression ranges (Fig 3B). In precept, a low *π*_{0} might replicate true differential expression as a causal response to experimental remedy, or it could possibly be an artefact of suboptimal information evaluation. To additional examine this, we individually analyzed 2 circumstances of research associated both to most cancers or transcription components, assuming that out there *p*-value units replicate the scenario the place DE is profiled in most cancers cells/tissues or after transcription issue interrogation, respectively. Because of large-scale rearrangements in most cancers genomes and the power of many TFs to vary the expression of many genes, these research are anticipated to result in decrease *π*_{0}-s, maybe within the 0.4 vary [31]. Nevertheless, in each circumstances, we noticed basically unchanging *π*_{0} distributions, which means that the low *π*_{0}-s in our dataset are indicative of problematic analytic workflows. This conclusion is additional strengthened by the noticed lack of damaging correlations between *π*_{0} and pattern measurement, which is a proxy of statistical energy.

It ought to be added that as most information normalization workflows assume that almost all genes should not DE (and that whole RNA focus are secure throughout experimental circumstances), any research that leads to a low *π*_{0} ought to attempt to explicitly tackle this reality within the experimental setup, information evaluation, and interpretation of outcomes. It has been argued that almost all genome-wide DE research ever performed, together with by HT-seq, have used experimental designs that might make it unimaginable to untangle such international results, at the very least in a quantitatively correct approach [54,55]. The problem lies in the usage of inside requirements in normalizing the counts for various RNA-s, which results in unsuitable interpretations if most genes change expression in 1 path. To beat this downside, one might use spike-in RNA requirements and compositional normalization [40]. Nevertheless, spike-in normalization requires nice care to work accurately in such excessive circumstances [39,56], and it’s hardly ever used exterior single-cell RNA-seq [40]. Thus, it appears probably that many or many of the low *π*_{0} experiments characterize technical failures, more than likely throughout information normalization.

### Concerns for enchancment of greatest practices

At present, at the very least 29 completely different DE RNA-seq evaluation platforms exist, which, amongst different variations, use 12 completely different distributional fashions for DE testing [50], whereas it’s changing into obvious that each one fashionable distributional fashions have hassle reaching promised FDR ranges [34]. At a minimal, a winnowing of advisable analytic selections is required. Nonetheless, design and evaluation of a selected DE RNA-seq experiment must be conscious of a number of components, together with (i) the research organism and its genetic background, pertaining to organic variation and thus pattern measurement; (ii) the anticipated *π*_{0}, pertaining to experimental design, like the usage of spike-in probes, and information evaluation; (iii) the variety of genes of curiosity, pertaining to pattern measurement by the a number of testing downside; (iv) the anticipated expression ranges and organic variation of genes of curiosity, and whether or not DE of gene isoforms is of curiosity, pertaining to pattern measurement and to sequencing depth; and (v) the construction of experiment pertaining to evaluation by way of GLM (is it multi-group, multi-center, multi-experimenter?). As no analytic workflow has been proven to outperform the others systematically [25,41–43], foolproof greatest practices are nonetheless out of attain.

One other class of basic nonparametric *p*-value calculation strategies, of which the Wilcoxon rank-sum check is the preferred, is freed from distributional assumptions and has good management of FDR, however they want pattern sizes of at the very least 10 to attain good energy [34]. We word that such pattern sizes are scarce within the HT-seq DE evaluation area. Lately, *p*-value-free approaches to FDR management that don’t rely upon distributional fashions have been proposed that look promising even in small samples [36,57]. These strategies would, in fact, transcend any issues brought on by *p*-values (in addition to the diagnostic worth of the p distribution), aside from the extra normal illness of impact measurement overestimation at low energy, which comes together with any methodology that dichotomizes steady information into “discoveries” and the remaining [11].

Whereas our outcomes go away little question in regards to the pervasiveness of issues within the HT-seq DE area, being meta-scientific by nature, they’ll supply comparatively little by way of solutions for particular enhancements to varied workflows used within the area. Though analyzing *p*-value histograms and *π*_{0}-s derived from them and extra stringently excluding low-count options ought to make their approach into each HT-seq DE evaluation, we now have no cause to suppose that these steps alone will remedy the sector of its illnesses. Additionally, whereas eradicating low-expressed genes earlier than mannequin becoming in DE evaluation can have a considerable optimistic impact on the sensitivity of detection of differentially expressed genes [58,59], this comes with the price of excluding about half the genes from evaluation as lowly expressed [60]. Whereas there is no such thing as a single greatest approach to exclude options from evaluation, an adaptive filtering process has been not too long ago proposed [61]. In deciding, what number of low-count options to exclude in a given research, each scientific and statistical concerns ought to play an element. Particularly, quantification of differential expression of low-expressed genes seems to be unreliable by present DE evaluation instruments [62], and the pattern measurement wanted for correct measurement of DE has a robust inverse relationship with the expression degree of the gene [43].

Rising the facility by growing the pattern sizes whereas designing the experiments to seize the complete extent of related organic variation in these samples can be one other apparent, if costly, advice. By way of DE evaluation, a big hazard appears to be overly lowering the within-data variation throughout information preprocessing in order that organic variation that could be current within the uncooked information is misplaced throughout the evaluation.

In conclusion, we don’t suppose there’s an obvious single repair to the issues afflicting a big area with fairly heterogeneous scientific questions and corresponding experimental designs, and it could be that the methodology that works for HT-seq DE remains to be sooner or later. We don’t suppose it unimaginable that fixing the RNA-seq area requires, as a primary step, the event of cheaper methods for conducting experiments, which might then make well-powered experiments virtually possible, which in flip might result in the event of analytic workflows that work.

## Strategies

### NCBI GEO database question and supplementary information

NCBI GEO database queries had been carried out utilizing Bio.Entrez Python bundle and by sending requests to NCBI Entrez public API. The precise question string to retrieve GEO HT-seq datasets was “expression profiling by high-throughput sequencing[DataSet Type] AND (“2000-01-01” [PDAT]: “2020 -12-31” [PDAT]).” Accession numbers of cancer-related datasets had been recognized by amending the unique question string with “AND (“neoplasms” [MeSH Terms] OR most cancers[All Fields]).” FTP hyperlinks from GEO datasets doc summaries had been used to obtain supplementary file names. Supplementary file names had been filtered for downloading, primarily based on file extensions, to maintain file names with “tab,” “xlsx,” “diff,” “tsv,” “xls,” “csv,” “txt,” “rtf,” and “tar” file extensions. We dropped the file names the place we didn’t look forward to finding *p*-values utilizing the common expression “filelist.txt|uncooked.tar$|readme|csfasta|(large)?wig|mattress(graph)?|(broad_)?lincs.”

### NCBI supplementary file processing

Downloaded information had been imported utilizing the Python pandas bundle and looked for unadjusted *p*-value units. Unadjusted *p*-value units and summarized expression degree of related genomic options had been recognized utilizing column names. *P*-value columns from imported tables had been recognized by common expression “p[^a-zA-Z]{0,4}val,” from these, adjusted *p*-value units had been recognized utilizing the common expression “adj|fdr|corr|thresh” and omitted from additional evaluation. We algorithmically examined the standard of recognized *p*-value units and faraway from additional evaluation apparently truncated or right-skewed units, *p*-value units that weren’t within the 0 to 1 vary, and *p*-value units that consisted solely of NaN values. Columns with expression ranges of genomic options had been recognized by utilizing the next common expressions: “basemean,” “worth,” “fpkm,” “logcpm,” “rpkm,” “aveexpr.” The place expression degree information had been current, uncooked *p*-values had been additional filtered to take away low-expression options utilizing the next thresholds: basemean = 10, logcpm = 1, rpkm = 1, fpkm = 1, aveexpr = 3.32. Basemean is a imply of library-size normalized counts of all samples, logcpm is the imply log2 counts per million, rpkm/fpkm is reads/fragments per kilobase of transcript size per million reads, aveexpr is a mean expression throughout all samples, in log2 CPM items, whereas CPM is counts per million. Row means had been calculated when there have been a number of expression degree columns (e.g., for every distinction or pattern) within the desk. Filtered *p*-value units had been saved and analyzed individually.

### Classification of *p*-value histograms

Uncooked *p*-value units had been categorised primarily based on their histogram form. The histogram form was decided primarily based on the presence and placement of peaks. *P*-value histogram peaks (bins) had been detected utilizing a high quality management threshold described in [27], a Bonferroni-corrected alpha-level quantile of the cumulative operate of the binomial distribution with measurement m and chance p. Histograms, the place not one of the bins was over QC-threshold, had been categorised as “uniform.” Histograms, the place bins over the QC threshold began both from the left or proper boundary and didn’t exceed 1/3 of the 0 to 1 vary, had been categorised as “anti-conservative” or “conservative,” respectively. Histograms with peaks or bumps within the center or with non-continuous left- or right-side peaks had been categorised as “different.” Lastly, histograms with left- and right-side peaks had been categorised as “bimodal.”

### Calculation of *π*_{0} statistic

Uncooked *p*-value units with an anti-conservative form had been used to calculate the *π*_{0} statistic. The *π*_{0} statistic was calculated utilizing the native FDR methodology carried out in limma::PropTrueNullByLocalFDR [48] and, independently, Storey’s international FDR smoother methodology [32] as carried out in gdsctools [63] Python bundle. Differential expression evaluation instruments information was collected from completely different sources—extracted from full-text articles by way of NCBI PubMed Central API, extracted from GEO summaries, extracted from supplementary file names, extracted from textual content appended to *p*-value set names, and at last, as auxiliary data, inferred from column names sample, by utilizing following heuristics, cuffdiff (column identify = “fpkm” and “p_value”) [45], DESeq/DESeq2 (column identify = “basemean” and “pval” or “pvalue”, respectively) [46], EdgeR (column identify = “logcpm”) [47], and limma (column identify = “aveexpr,” “p.worth,” and PDAT > 2014-01-01) [48], all units that remained unidentified units had been binned as “unknown.” We used following common expression to extract DE evaluation instrument names from lower-case transformed textual content “deseq2?|de(g|x)seq|rockhopper|cuff(diff|hyperlinks)|edger|clc(bio)?? genomics|igeak|bayseq|samseq|noiseq|ebseq|limma|voom|sleuth|partek|(nrsa|nascent rna seq)|median ratio norm|rmats|ballgown|biojupie|seurat|exdega”.

### Pattern measurement willpower

Pattern sizes had been algorithmically decided for tables containing a single column of *p*-values by dividing the variety of samples with the variety of *p*-value units + 1. Solely balanced samples had been retained, and all 4 or extra pattern sizes had been manually verified.

### Modeling

Bayesian modeling was achieved utilizing R libraries rstan vers. 2.21.3 [64] and brms vers. 2.16.3 [65]. Fashions had been specified utilizing prolonged R lme4 [66] system syntax carried out within the R brms bundle. We used weak priors to suit fashions. We run minimally 2,000 iterations and three chains to suit fashions. When prompt by brms, Stan NUTS management parameter adapt_delta was elevated to 0.95–0.99 and max_treedepth to 12–15.

### RNA-seq simulation

RNA-seq experiment simulation was achieved with polyester R bundle [67], and differential expression was assessed utilizing DESeq2 R bundle [46] utilizing default settings. Code and workflow used to run and analyze RNA-seq simulations are deposited in Zenodo with doi: 10.5281/zenodo.4463804 (https://doi.org/10.5281/zenodo.4463804). Processed information, uncooked information, and workflow of the RNA-seq simulations with enter fasta file is deposited in Zenodo with doi: 10.5281/zenodo.4463803 (https://doi.org/10.5281/zenodo.4463803). RNA-seq simulations to find out relationship between pattern measurement, *π*_{0} and energy had been achieved utilizing PROPER bundle [33].

## Supporting data

### S5 Fig. DE evaluation instrument conditional results from binomial logistic fashions for proportion of anti-conservative p worth histograms.

(A) Easy mannequin [anticons ~ de_tool], *N* = 4,616. The information file is in S23 Information. (B) Easy mannequin [anticons ~ de_tool] fitted on full information, *N* = 14,813. The information file is in S24 Information. (C) Mannequin conditioned on yr of GEO submission [anticons ~ year + de_tool], *N* = 4,616. The information file is in S25 Information. (D) Mannequin conditioned on studied organism (human/mouse/different) [anticons ~ organism + de_tool], *N* = 3,886. The information file is in S26 Information. (E) Various intercept mannequin [anticons ~ de_tool + (1 | model)] the place “mannequin” stands for sequencing instrument mannequin, *N* = 3,778. The information file is in S27 Information. (F) Various intercept and slope mannequin [anticons ~ de_tool + (de_tool | model)], *N* = 3,778. The information file is in S27 Information. Factors denote greatest match of linear mannequin. Thick and skinny strains denote 66% and 95% credible interval, respectively. The mannequin object associated to panel A will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/anticons_detool.rds. The mannequin object associated to panel B will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/anticons_detool_all.rds. The mannequin object associated to panel C will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.2/fashions/anticons_year_detool.rds. The mannequin object associated to panel D will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.2/fashions/anticons_organism_detool.rds. The mannequin object associated to panel E will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/anticons_detool__1_model.rds. The mannequin object associated to panel F will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/anticons_detool__detool_model.rds.

https://doi.org/10.1371/journal.pbio.3002007.s005

(TIFF)

### S6 Fig. DE evaluation tool-conditional results from beta regression modeling of π_{0}.

(A) Easy mannequin [pi0 ~ de_tool] fitted on pattern, *N* = 1,188. The information file is in S4 Information. (B) Easy mannequin [pi0 ~ de_tool] fitted on full information, *N* = 3,898. The information file is in S28 Information. (C) Mannequin conditioned on yr of GEO submission [pi0 ~ year + de_tool], *N* = 1,188. The information file is in S29 Information. (D) Mannequin conditioned on studied organism (human/mouse/different) [pi0 ~ organism + de_tool], *N* = 993. The information file is in S30 Information. (E) Various intercept mannequin [pi0 ~ de_tool + (1 | model)] the place “mannequin” stands for sequencing instrument mannequin, *N* = 959. The information file is in S31 Information. (F) Various intercept/slope mannequin [pi0 ~ de_tool + (de_tool | model)], *N* = 959. The information file is in S31 Information. Factors denote greatest match of linear mannequin. Thick and skinny strains denote 66% and 95% credible interval, respectively. The mannequin object associated to panel A will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool_sample.rds. The mannequin object associated to panel B will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool_full_data.rds. The mannequin object associated to panel C will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.2/fashions/pi0_year_detool.rds. The mannequin object associated to panel D will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.2/fashions/pi0_organism_detool.rds. The mannequin object associated to panel E will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool__1_model.rds. The mannequin object associated to panel F will be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/fashions/pi0_detool__detool_model.rds.

https://doi.org/10.1371/journal.pbio.3002007.s006

(TIFF)

### S7 Fig. Comparability of π0 values computed by 2 completely different strategies.

Native FDR methodology is from limma R bundle operate propTrueNull. Smoother methodology is from q worth R bundle. A, density histogram. B, scatter plot. Dashed line has intercept = 0 and slope = 1. The information file is in S32 Information.

https://doi.org/10.1371/journal.pbio.3002007.s007

(TIFF)

### S16 Fig. Elimination of low-count options leads to an growing proportion of anti-conservative p-value histograms.

(A) Anti-conservative *p*-value histogram proportions in uncooked and filtered *p*-value units for DE evaluation applications. Uncooked *p*-value information is similar as in S5A Fig. Filtered *p*-value information is from a easy Bernoulli mannequin [anticons ~ de_tool], *N* = 3,426. The information information are in S13 Information and in S14 Information (for uncooked information). (B) Impact measurement of low-count characteristic filtering to proportion of anti-conservative *p*-values. The information information are in S13 Information and in S14 Information (for uncooked information). (C) π_{0} estimates for uncooked and filtered *p*-value units. Uncooked *p*-value information is similar as in S6A Fig and filtered *p*-value information is from the beta mannequin [pi0 ~ de_tool], *N* = 2,042. The information information are in S15 Information and in S16 Information (for uncooked information). (D) Impact measurement of low-count characteristic filtering to π_{0}. The information information are in S15 Information and in S16 Information (for uncooked information). Factors denote mannequin greatest match. Thick and skinny strains denote 66% and 95% CIs, respectively.

https://doi.org/10.1371/journal.pbio.3002007.s016

(TIFF)

### S17 Fig. Simulated RNA-seq information reveals that histograms from p-value units with round 100 true results out of 20,000 options will be categorised as “uniform”.

RNA-seq information was simulated with polyester R bundle on 20,000 transcripts from human transcriptome utilizing grid of three, 6, and 10 replicates and 100, 200, 400, and 800 results for two teams. Fold modifications had been set to 0.5 and a pair of. Differential expression was assessed utilizing DESeq2 R bundle utilizing default settings and group 1 versus group 2 distinction. Results denotes in aspect labels the variety of true results and N denotes variety of replicates. Crimson line denotes QC threshold used for dividing p histograms into discrete courses. Code and workflow used to run these simulations is on the market on Github: https://github.com/rstats-tartu/simulate-rnaseq. Uncooked information of the determine is on the market on Zenodo https://zenodo.org with doi: 10.5281/zenodo.4463803.

https://doi.org/10.1371/journal.pbio.3002007.s017

(TIFF)