Page 5 - i1052-5173-31-5
P. 5

databases such as The Paleobiology Database   Luckily, these (and other) issues can be   the data (Fig. S3, vi.). To calculate statistics
         (PBDB; Peters and McClennen, 2016),   addressed through careful processing and   from the data, multiple iterations of resam-
         Macrostrat (Peters et al., 2018), EarthChem   analysis, using well-established statistical   pling are required.
         (Walker et al., 2005), Georoc (Sarbas, 2008),   and computational techniques. Although
         and the Sedimentary Geochemistry and   such techniques have complications of their   CASE STUDY: THE SEDIMENTARY
         Paleoenvironments Project (SGP, this study).  own (e.g., a high degree of comfort with   GEOCHEMISTRY AND
          Of course, large amounts of data are not   programming often is required to run code   PALEOENVIRONMENTS PROJECT
         new to the Earth sciences, and, with respect   efficiently), they do provide a way to extract   The SGP project seeks to compile sedimen-
         to volume, many Earth history and geo-  meaningful trends from large datasets. No   tary geochemical data, made up of various
         chemistry compilations are small in compar-  one lab can generate enough data to cover   analytes (i.e., components that have been ana-
         ison to the datasets used in other subdisci-  Earth’s history densely enough (i.e., in time   lyzed), from throughout geologic time. We
         plines, including seismology (e.g., Nolet,   and space), but by leveraging compilations   applied our workflow to the SGP database  to
                                                                                                                2
         2012), climate science (e.g., Faghmous and   of  accumulated  knowledge,  and  using  a   extract coherent temporal trends in Al O  and
                                                                                                             2
                                                                                                              3
         Kumar, 2014), and hydrology (e.g., Chen and   well-developed computational pipeline,   U from siliciclastic mudstones. Al O  is rela-
                                                                                                          2
                                                                                                            3
         Wang, 2018). As a result, many Earth history   researchers can begin to ascertain a clearer   tively immobile and thus useful for constrain-
         compilations likely do not meet the criteria   picture of Earth’s past.  ing both the provenance and chemical weath-
         to be called “big data,” which is a term that                          ering history of ancient sedimentary deposits
         describes very large amounts of information   A PROPOSED WORKFLOW      (Young and Nesbitt, 1998). Conversely, U is
         that accumulate rapidly and which are   The process of transforming entries in a   highly sensitive to redox processes. In marine
         heterogeneous and unstructured in form   dataset into meaningful trends requires a   mudstones, U serves as both a local proxy for
         (Gandomi and Haider, 2015; or “if it fits in   series of steps, many with some degree of user   reducing conditions in the overlying water
         memory, it is small data”). That said, the tens   decision making. Our proposed workflow is   column (i.e., authigenic U enrichments only
         of thousands to millions of entries present in   designed with the express intent of removing   occur under low-oxygen or anoxic conditions
         such datasets do represent a new frontier for   unfit data while appropriately propagating   and/or very low sedimentation rates; see
         those interested in our planet’s past. For   uncertainties. First, a  compiled dataset  is   Algeo and Li, 2020) and a global proxy for the
         many Earth historians, however, and espe-  made or sourced (Fig. S3, i. [see footnote 1]).   areal extent of reducing conditions (i.e., the
         cially for geochemists (where most of the   Next, a researcher chooses between in-data-  magnitude of authigenic enrichments scales
         field’s efforts traditionally have focused on   base analysis and extracting data into another   in part with the global redox landscape; see
         analytical  measurements  rather  than  data   format, such as a text file (Fig. S3, ii.). This   Partin et al., 2013).
         analysis; see Sperling et al., 2019), this fron-  choice does nothing to the underlying data—  SGP data are stored in a PostgreSQL rela-
         tier requires new outlooks and toolkits.  its sole function is to recast information into a   tional  database  that  currently comprises a
          When using compilations to extract   digital format that the researcher is most com-  total of 82,579 samples (Fig. 1). The SGP
         global trends through time, it is important to   fortable with. Then, a decision must be made   database was created  by  merging  sample
         recognize that large datasets can have sev-  about whether to remove entries that are not   data  and  geological  context  information
         eral inherent issues. Observations may be   pertinent to the question at hand (Fig. S3, iii.).   from three separate sources, each with dif-
         unevenly distributed temporally and/or spa-  Using one or more metadata parameters (e.g.,   ferent foci  and methods for obtaining the
         tially, with large stretches of time (e.g., parts   in the case of rocks, lithological descriptions),   “best guess” age of a sample (i.e., the inter-
         of the Archean Eon) or space (e.g., much of   researchers can turn large compilations into   preted age  as well  as potential  maximum
         Africa; Fig. S1 ) lacking data. There may also   targeted datasets, which then can be used to   and  minimum  ages).  The  first  source  is
                   1
         be errors with entries—mislabeled values,   answer specific questions without the influ-  direct entry by SGP team members, which
         transposition issues, and missing metadata   ence of irrelevant data. Following this gross   focuses primarily on Neoproterozoic–
         can occur in even the most carefully curated   filtering, researchers must decide between   Paleozoic shale samples and has global cov-
         compilations. Even if data are pristine, they   removing outliers or keeping them in the data-  erage. Due to the direct involvement of
         may span decades of acquisition with evolv-  set (Fig. S3, iv.). Outliers have the potential to   researchers intimately familiar with their
         ing techniques, such that both analytical pre-  drastically skew results in misleading ways.   sample sets, these data have the most pre-
         cision and measurement uncertainty are non-  Ascertaining which values are outliers is a   cise (Fig. 1A)—and likely also most accu-
         uniform  across the  dataset  (Fig. S2 [see   non-trivial task, and all choices about outlier   rate—age  constraints.  Second,  the  SGP
         footnote 1]). Careful examination may dem-  exclusion must be clearly described when pre-  database has incorporated sedimentary
         onstrate that contemporaneous and co-located   senting results. Finally, samples are drawn   geochemical data from the United States
         observations do not agree. Additionally, data   from the filtered dataset (i.e., “resampling”)   Geological Survey (USGS) National Geo-
         often are not targeted, such that not every   using a weighting scheme that seeks to   chemical Database (NGDB), comprising
         entry may be necessary for (or even useful to)   address the spatial and temporal heterogene-  samples from  projects completed between
         answering a particular question.    ities—as well as analytical uncertainties—of   the 1960s and 1990s. These samples, which


         1 Supplemental Material: table of valid lithologies; map depicting sample locations; crossplot illustrating analytical uncertainty; flowchart of the proposed workflow;
         histograms showing the effects of progressive filtering, the distribution of spatial and age scales, and proximity and probability values; and results of sensitivity tests.
         Go to https://doi.org/10.1130/GSAT.S.14179976 to access the supplemental material; contact editing@geosociety.org with any questions.

         2 All code used in this study is located at https://github.com/akshaymehra/dataCompilationWorkflow.
                                                                                        www.geosociety.org/gsatoday  5
   1   2   3   4   5   6   7   8   9   10