Rarefaction was performed using GUniFrac (version 1.1)72. In this case the ratio between the reference taxon chosen (denominator) and each taxon in that sample are compared across different sample groupings. Accordingly, we cannot definitively conclude that DA tools that require input data to be rarefied are less reliable in general. Abbreviations: prev., previous; TMM, trimmed mean of M-values; TMMwsp, trimmed mean of M-values with singleton pairing; rare, rarefied; CLR, center-log-ratio. For each replicate we tallied the number of times the genus was sampled across datasets. & Hochberg, Y. Here, we compare the performance of 14 differential abundance testing methods on 38 16S rRNA gene datasets with two sample groups. Using the main function ANCOM, all additive log-ratios for each taxon were then tested for significance using Wilcoxon rank-sum tests, and p-values were FDR-corrected using the BH method. In particular, these tools are frequently used interchangeably in the microbiome literature. For example, authors may want to present identified taxonomic markers in categories based on the tool characteristics presented within this paper or the number of tools that agree upon its identification. Pop, M. et al. Bioinformatics 20, 210221 (2019). For example, in the human-IBD dataset several tools found mean AUROCs of the ASVs they identified ranging from 0.80.9 using either CLR or relative abundances as input while both ALDEx2 and ANCOM-II failed to identify any significant ASVs. These datasets corresponded to a range of environments, including the human gut, plastisphere, freshwater, marine, soil, wastewater, and built environments (Supplementary Data1). Abbreviations: TMM, trimmed mean of M-values; TMMwsp, trimmed mean of M-values with singleton pairing; rare, rarefied; CLR, center-log-ratio. In addition, for the unfiltered analyses, we also computed Spearman correlations with the percent of ASVs below 10% prevalence in each dataset (i.e., the percent of ASVs that would be removed to produce the filtered datasets). Furthermore, our analysis shows various characteristics of DA tools that authors can use to evaluate published literature within the field. volume13, Articlenumber:342 (2022) Moderate exercise has limited but distinguishable effects on the mouse microbiome. Examining the data at a higher AUC threshold of 0.9 showed that all tools had relatively high recall scores, apart from some tools such as ANCOM-II, corncob, and t-test (rare) on CLR data (medians: 0.5, 0.5, and 0.20). J. Infect. For the unfiltered data, the main outliers are the limma voom methods, followed by Wilcoxon (CLR; Fig. [ 37 ] have proposed a new CoDA approach for microbiome analysis that is aimed to the identification of microbial signatures, groups of microbial taxa that are predictive of . B 44, 139177 (1982). Environ. To overcome this scenario, several normalizations and transformation methods have been developed to . Baxter, N. T., Ruffin, M. T., Rogers, M. A. M. & Schloss, P. D. Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Google Scholar. Using the phyloseq_to_edgeR function (https://joey711.github.io/phyloseq-extensions/edgeR.html), we added a pseudocount of 1 to the non-rarefied feature table and used the function calcNormFactors from the edgeR R package (version 3.28.1)9 to compute relative log expression normalization factors. This was replicated 100 times for each dataset and tool combination aside from ALDEx2, ANCOM-II, and Corncob. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Oberbeckmann, S., Osborn, A. M. & Duhaime, M. B. Bioinformatics 28, 21062113 (2012). Finally, using the results function, we obtained the resulting BH FDR-corrected p-values. This is largely because there are no gold standards to compare DA tool results. The processed data for these datasets was acquired from the MicrobiomeDB48 and the microbiomeHD23 databases, respectively. We acquired five datasets for this analysis representing the microbiome of individuals with diarrhea compared with individuals without diarrhea (see Methods). We then used the exactTest for negative binomial data9 to identify features that differ between the specified groups. Interestingly, in a few datasets, such as the Human-ASD and Human-OB (2) datasets, edgeR found a higher proportion of significant ASVs than any other tool. Stat. Specifically, some tools identified the most features in one dataset while identifying only an intermediate number in other datasets. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. PubMed By doing so we have highlighted the issues of using these tools interchangeably within the literature. Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies. apply relative transformation do CLR add pseudocount (2. apply relative transformation [this will be optional for now on] ) Do CLR CLR transformation: remove relative transformation #262 By default we assume that users already have the data in a suitable format for CLR, and minimal default processing is added This question arises in many high-throughput datasets, where the burden of correcting for many tests can greatly reduce statistical power. An alpha-value of 0.05 was chosen as our significance cutoff and FDR adjusted p-values (using Benjamini-Hochberg adjustment) were used for methods that output p-values (with the exception of LEfSe which does not output all p-values by default)73. 16S gut community of the Cameron County Hispanic Cohort. 3b). We specifically focused on diarrhea as a phenotype, which has been shown to exhibit a strong effect on the microbiome and to be relatively reproducible across studies23. Meja-Len, M. E., Petrosino, J. F., Ajami, N. J., Domnguez-Bello, M. G. & de la Barca, A. M. C. Fecal microbiota imbalance in Mexican children with type 1 diabetes. You are using a browser version with limited support for CSS. Even if the researchers are confident in their approach, these discrepancies should be made clear when the results are summarized. This step failed in some highly sparse abundance tables; in these cases, we instead chose the sample with the largest sum of square-root transformed feature abundances to be the reference sample. Establishing microbial composition measurement standards with reference frames. CAS To investigate this possibility, we identified the overlap between the 20 top-ranked ASVs per dataset (Supplementary Fig. Article We determined where the observed mean values lay on each corresponding distribution to calculate statistical significance. Google Scholar. Commun. PubMed Central Natl. Morton, J. T. et al. Certain observations have been reproducible, such as the higher FDR of edgeR and metagenomeSeq. The random expectation distribution is based on replicates of randomly selecting genera as significant and then computing the consistency across studies. On average only a small number of ASVs were amongst the top 20 ranked of all tools in both the filtered (mean: 0.21; SD: 0.62) and unfiltered (mean: 0.11; SD: 0.31) datasets (Supplementary Fig. Abbreviations: prev., previous; TMM, trimmed mean of M-values; TMMwsp, trimmed mean of M-values with singleton pairing; rare, rarefied; CLR, center-log-ratio. microbiome-package R package for microbiome studies Description Brief summary of the microbiome package . 37, 852857 (2019). McCormick, A. R. et al. PubMed MathSciNet Scher, J. U. et al. However, ALDEx2 and ANCOM-II once again produced significant ASVs that largely overlapped with most other tools. Based on our results, we do not recommend these tools as the sole methods used for data analysis, and instead would suggest that researchers use more conservative methods such as ALDEx2 and ANCOM-II. As these log-ratio-based transformations encourage normality . Although this cross-data consistency analysis was informative, it was interesting to note that not all environments and datasets are appropriate for this comparison. We first passed the non-rarefied feature tables to the DESeq function (version 1.26.0)8 with default settings, except that instead of the default relative log expression (also known as the median-of-ratios method) the estimation of size factors was set to use poscounts, which calculates a modified relative log expression that helps account for features missing in at least one sample. Given an . The exceptions were the two limma voom methods, which had high FDRs with unfiltered data, and edgeR and LEfSe, which had high FDRs on the filtered data. The same processing workflow was used for the supplementary obesity dataset comparison as well. We then used fitFeatureModel to fit normalized feature counts with zero-inflated log-normal models (with pseudo-counts of 1 added prior to log2 transformation) and perform empirical Bayes moderated t-tests, and MRfulltable to obtain BH FDR-corrected p-values. Specifically, ALDEx2 (mean: 1.4%; SD: 3.4%) and ANCOM-II (mean: 0.8%; SD: 1.8%) identified the fewest significant ASVs. Soc. We used the default options when running these algorithms.
It's all relative: analyzing microbiome data as compositions Accordingly, identifying only a few significant ASVs under this approach is not necessarily proof that a tool has a low FDR in practice. While it might be argued that differences in tool outputs are expected given that they test different hypotheses, we believe this perspective ignores how these tools are used in practice. The precision score of all tools at this threshold was low on both relative abundance (range: 00.01) and CLR data (range: 00.2). PLoS Comput. These plots are based on the mean inter-tool Jaccard distance across the 38 main datasets that we analyzed, computed by averaging over the inter-tool distance matrices for all individual datasets to weight each dataset equally. A pseudo count of 1 was then applied across the dataset to allow for log transformation. The processed datasets and metadata files are available at https://figshare.com/articles/dataset/16S_rRNA_Microbiome_Datasets/14531724. Mandal, S. et al. Next, we used cumNormStat and cumNorm to apply cumulative sum-scaling normalization, which attempts to normalize sequence counts based on the lower-quartile abundance of features. transform Transformation to apply. ALDEx2s conservative nature is most likely due to its Monte Carlo Dirichlet sampling approach which down weights low abundance ASVs. Primers were removed using cutadapt67 and stitched together using the QIIME 2 VSEARCH68 join-pairs plugin. Two additional problematic tools based on this analysis were edgeR and LEfSe. For each tool and study combination, we determined which genera were significantly different at an alpha of 0.05 (where relevant). mSystems 2, e0012716 (2017). In each dataset, only the most frequent sample group was chosen for analysis to help ensure similar composition among samples tested. These analyses provide insight into how similar the results of different tools are expected to be, which could be due to methodological similarities between them. We have also highlighted that these tools can significantly differ in the number of ASVs that they identify as being significantly different and that some tools are more consistent across datasets than others.
A zero inflated log-normal model for inference of sparse - PLOS Acad. This is because it has been highlighted that in many scenarios simulations can led to circular arguments where tools that are designed around specific parameters perform favorably on simulations using those parameters17. Although I get negative values after the transformation. We choose to run this tool with two different normalization functions as we found the standard TMM normalization technique to struggle with highly spare datasets despite it previously being shown to perform preferentially in DA testing. We compared the number of significant ASVs each tool identified in 38 different datasets. Nonetheless, despite the high variation across DA tool results, we were able to characterize several consistent patterns produced by various tools that researchers should keep in mind when assessing both their own results and results from published work. Thea Van Rossum, Pamela Ferretti, Peer Bork, Alejandra Escobar-Zepeda, Elizabeth Ernestina Godoy-Lozano, Alejandro Sanchez-Flores, Eric A. Franzosa, Lauren J. McIver, Curtis Huttenhower, Sandra Reitmeier, Thomas C. A. Hitch, Thomas Clavel, Reena Debray, Robin A. Herbert, Britt Koskella, Alessio Milanese, Daniel R Mende, Shinichi Sunagawa, Justin P. Shaffer, Louis-Flix Nothias, the Earth Microbiome Project 500 (EMP500) Consortium, Nature Communications However, both MaAsLin2 and Wilcoxon (rare) found no significant features in the majority of tested datasets (6/8 and 7/8 respectively). Nat. Unfortunately, the variation across tools implies that biological interpretations based on these questions will often drastically differ depending on which DA tool is considered. 8).
Analysis of microbial compositions: a review of normalization and The centered log-ratio (CLR) transformation is a CoDa approach that uses the geometric mean of the read counts of all taxa within a sample as the reference/denominator for that sample. To get another view of the data principal coordinate analysis plots were constructed using the mean inter-tool Jaccard distance across the 38 main datasets. Get the most important science stories of the day, free in your inbox. However, many investigators are either unaware of this or assume specific properties of the compositional data. Rep. 4, 3814 (2014). When the prevalence filter option was set, the script also generated new filtered rarefied tables based on an input rarefaction depth. ADS Source data are provided as a Source Data file. Nucleic Acids Res. Accordingly, in principle all of the tools could be identifying the same top ASVs and simply taking varying degrees of risk when identifying less clearly differential ASVs. https://doi.org/10.1038/s41467-022-28034-z, DOI: https://doi.org/10.1038/s41467-022-28034-z. Gut Microbiome 1, e3 (2020). We ran the non-rarefied feature table through the R ANCOM-II16,74 (https://github.com/FrederickHuangLin/ANCOM) (version 2.1) function feature_table_pre_process, which first examined the abundance table to identify outlier zeros and structural zeros74. The percentage of ASVs in the unfiltered datasets that were lower than 10% prevalence was also significantly associated with the output of several tools. Without addressing the variation in depth across samples by some approach, the richness can drastically differ between samples due to read depth alone. Furthermore, the TMMwsp method is highlighted within the edgeR package as an alternative for highly sparse data. For example, the centered log-ratio (CLR) transformation [2, 24] and phylogenetic isometric log-ratio (PhILR) transformation have been proposed to address the compositional nature of microbiome data, where PhILR further incorporates phylogenetic information into the transformed data. & Holmes, S. Waste not, want not: why rarefying microbiome data is inadmissible. We compared the consistency between different tools within all datasets by pooling all ASVs identified as being significant by at least one tool in the 38 different datasets. Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. The features in these datasets corresponded to both ASVs and clustered operational taxonomic units, but we refer to them all as ASVs below for simplicity. 1b). All other unfiltered datasets were run with 10 replicates due to computational limitations. Nearing, J. T., Comeau, A. M. & Langille, M. G. I. Identifying biases and their potential solutions in human microbiome studies. Sing, T., Sander, O., Beerenwinkel, N. & Lengauer, T. ROCR: visualizing classifier performance in R. Bioinformatics 21, 39403941 (2005). Rarefied tables were also produced for each dataset, where the rarefied read depth was taken to be the lowest read depth of any sample in the dataset over 2000 reads (with samples below this threshold discarded). It is also clear from our analysis that some tools designed for RNA-seq such as limma voom methods cannot deal with the much higher sparsity of microbiome data without including a data filtration step. MH designed Table1 and wrote parts of the Methods section. Herein we have compared the performance of commonly used DA tools on 16S rRNA gene datasets. These analyses provided insight into how similar the interpretations would be depending on which DA method was applied. To evaluate the false positive rates of each DA method, eight datasets were selected for analysis based on having the largest sample sizes, while also being from diverse environment types. Similarly, based on simulated datasets with spiked taxa it has been shown that these methods can drastically vary in statistical power17,18. Two clear outliers in the filtered data analyses were edgeR (mean: 0.6927.9%) and LEfSe (mean: 3.45.1%) which consistently identified more significant hits compared with other tools (Fig. 4a). Sci. 8, 2114 (2017). Papa, E. et al. Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Using raw feature abundances in the rarefied case, and CLR-transformed abundances (after applying a pseudocount of 1) in the non-rarefied case, we performed Wilcoxon rank-sum tests for each feature to compare the specified sample groupings. Impact of water chemistry, pipe material and stagnation on the building plumbing microbiome. Dranse, H. J. et al. This approach has been heavily criticized because excluding data could reduce statistical power and introduce biases. Note that to simplify this analysis we ignored the directionality of the significance (e.g., whether it was higher in case or control samples). Nevertheless, since there are more than 3,700 reef fish species in the Western and Central Pacific, and more than 600 coral species, a projected increase of microbial richness of even as low as 0. . Overall, we recommend that authors use the same tools when comparing results between specific studies and otherwise use a consensus approach based on several DA tools to help ensure results are robust to DA choice. and G.M.D. Raw abundance values were used as input and multiple optimal cut-off points were selected to produce ROCs comparing sensitivity to specificity. As the above analysis has the potential to penalize tools that call a higher number of ASVs that are of lower discriminatory values we also investigated the ability of the tested DA methods to identify ASVs above specific AUC thresholds (Supplementary Fig. Microplastic in surface waters of urban rivers: concentration, sources, and associated bacterial assemblages. Biotechnol. Preprint at https://doi.org/10.1101/074252 (2016).
Kernel-based genetic association analysis for microbiome phenotypes Because corncob models each of these simultaneously and performs both differential abundance and differential variability testing10, we set the null overdispersion model to be the same as the non-null model so that only taxa having differential abundances were identified. healthy humans) we would expect tools to not identify any ASVs as being differential abundant. It is possible that the question of whether to rarefy data has received disproportionate attention in the microbiome field: there are numerous other factors affecting an analysis pipeline that likely affect results more. For instance, two different workflows for running MaAslin2 are included, which produced similar outputs. This was particularly evident for both limma voom approaches and the Wilcoxon (CLR) approach.
Alpha and beta-diversities performance comparison between We can clearly recommend that users avoid using edgeR (a tool primarily intended for RNA-seq data) as well as LEfSe (without p-value correction) for conducting DA testing with 16S rRNA gene data. 2a) and 0.310.56 for filtered data (Fig. Spearman correlations between the percent of significant ASVs identified by a tool and the following dataset characteristics were computed using the cor.test function in R: sample size, Aitchisons distance effect size as computed using a PERMANOVA test (adonis; vegan, version 2.5.6)78, sparsity, mean sample ASV richness, median sample read depth, read depth range between samples and the coefficient of variation for read depth within a dataset. We found similar, although not as extreme, trends with LEfSe where in some datasets, such as the Human-T1D (1) dataset, the tool found a much higher percentage of significant hits (3.5%) compared with all other tools (00.4%). However, these significant ASVs tended to be more tool-specific in the unfiltered data and there was much more variation in the percentage of significant ASVs across tools. Technol. 10, 2719 (2019). This is crucial for providing accurate insight into how robust specific findings are expected to be across independent studies, which often use different DA approaches. Intestinal microbial communities associated with acute enteric infections and disease recovery.
2004 Polaris Sportsman 700 For Sale,
Dreams Macao Swim Out Room,
Things To Do In Hunter, Ny In Winter,
Wayne State Swimming Roster,
Articles C