Page 5 - i1052-5173-31-5
P. 5
databases such as The Paleobiology Database Luckily, these (and other) issues can be the data (Fig. S3, vi.). To calculate statistics
(PBDB; Peters and McClennen, 2016), addressed through careful processing and from the data, multiple iterations of resam-
Macrostrat (Peters et al., 2018), EarthChem analysis, using well-established statistical pling are required.
(Walker et al., 2005), Georoc (Sarbas, 2008), and computational techniques. Although
and the Sedimentary Geochemistry and such techniques have complications of their CASE STUDY: THE SEDIMENTARY
Paleoenvironments Project (SGP, this study). own (e.g., a high degree of comfort with GEOCHEMISTRY AND
Of course, large amounts of data are not programming often is required to run code PALEOENVIRONMENTS PROJECT
new to the Earth sciences, and, with respect efficiently), they do provide a way to extract The SGP project seeks to compile sedimen-
to volume, many Earth history and geo- meaningful trends from large datasets. No tary geochemical data, made up of various
chemistry compilations are small in compar- one lab can generate enough data to cover analytes (i.e., components that have been ana-
ison to the datasets used in other subdisci- Earth’s history densely enough (i.e., in time lyzed), from throughout geologic time. We
plines, including seismology (e.g., Nolet, and space), but by leveraging compilations applied our workflow to the SGP database to
2
2012), climate science (e.g., Faghmous and of accumulated knowledge, and using a extract coherent temporal trends in Al O and
2
3
Kumar, 2014), and hydrology (e.g., Chen and well-developed computational pipeline, U from siliciclastic mudstones. Al O is rela-
2
3
Wang, 2018). As a result, many Earth history researchers can begin to ascertain a clearer tively immobile and thus useful for constrain-
compilations likely do not meet the criteria picture of Earth’s past. ing both the provenance and chemical weath-
to be called “big data,” which is a term that ering history of ancient sedimentary deposits
describes very large amounts of information A PROPOSED WORKFLOW (Young and Nesbitt, 1998). Conversely, U is
that accumulate rapidly and which are The process of transforming entries in a highly sensitive to redox processes. In marine
heterogeneous and unstructured in form dataset into meaningful trends requires a mudstones, U serves as both a local proxy for
(Gandomi and Haider, 2015; or “if it fits in series of steps, many with some degree of user reducing conditions in the overlying water
memory, it is small data”). That said, the tens decision making. Our proposed workflow is column (i.e., authigenic U enrichments only
of thousands to millions of entries present in designed with the express intent of removing occur under low-oxygen or anoxic conditions
such datasets do represent a new frontier for unfit data while appropriately propagating and/or very low sedimentation rates; see
those interested in our planet’s past. For uncertainties. First, a compiled dataset is Algeo and Li, 2020) and a global proxy for the
many Earth historians, however, and espe- made or sourced (Fig. S3, i. [see footnote 1]). areal extent of reducing conditions (i.e., the
cially for geochemists (where most of the Next, a researcher chooses between in-data- magnitude of authigenic enrichments scales
field’s efforts traditionally have focused on base analysis and extracting data into another in part with the global redox landscape; see
analytical measurements rather than data format, such as a text file (Fig. S3, ii.). This Partin et al., 2013).
analysis; see Sperling et al., 2019), this fron- choice does nothing to the underlying data— SGP data are stored in a PostgreSQL rela-
tier requires new outlooks and toolkits. its sole function is to recast information into a tional database that currently comprises a
When using compilations to extract digital format that the researcher is most com- total of 82,579 samples (Fig. 1). The SGP
global trends through time, it is important to fortable with. Then, a decision must be made database was created by merging sample
recognize that large datasets can have sev- about whether to remove entries that are not data and geological context information
eral inherent issues. Observations may be pertinent to the question at hand (Fig. S3, iii.). from three separate sources, each with dif-
unevenly distributed temporally and/or spa- Using one or more metadata parameters (e.g., ferent foci and methods for obtaining the
tially, with large stretches of time (e.g., parts in the case of rocks, lithological descriptions), “best guess” age of a sample (i.e., the inter-
of the Archean Eon) or space (e.g., much of researchers can turn large compilations into preted age as well as potential maximum
Africa; Fig. S1 ) lacking data. There may also targeted datasets, which then can be used to and minimum ages). The first source is
1
be errors with entries—mislabeled values, answer specific questions without the influ- direct entry by SGP team members, which
transposition issues, and missing metadata ence of irrelevant data. Following this gross focuses primarily on Neoproterozoic–
can occur in even the most carefully curated filtering, researchers must decide between Paleozoic shale samples and has global cov-
compilations. Even if data are pristine, they removing outliers or keeping them in the data- erage. Due to the direct involvement of
may span decades of acquisition with evolv- set (Fig. S3, iv.). Outliers have the potential to researchers intimately familiar with their
ing techniques, such that both analytical pre- drastically skew results in misleading ways. sample sets, these data have the most pre-
cision and measurement uncertainty are non- Ascertaining which values are outliers is a cise (Fig. 1A)—and likely also most accu-
uniform across the dataset (Fig. S2 [see non-trivial task, and all choices about outlier rate—age constraints. Second, the SGP
footnote 1]). Careful examination may dem- exclusion must be clearly described when pre- database has incorporated sedimentary
onstrate that contemporaneous and co-located senting results. Finally, samples are drawn geochemical data from the United States
observations do not agree. Additionally, data from the filtered dataset (i.e., “resampling”) Geological Survey (USGS) National Geo-
often are not targeted, such that not every using a weighting scheme that seeks to chemical Database (NGDB), comprising
entry may be necessary for (or even useful to) address the spatial and temporal heterogene- samples from projects completed between
answering a particular question. ities—as well as analytical uncertainties—of the 1960s and 1990s. These samples, which
1 Supplemental Material: table of valid lithologies; map depicting sample locations; crossplot illustrating analytical uncertainty; flowchart of the proposed workflow;
histograms showing the effects of progressive filtering, the distribution of spatial and age scales, and proximity and probability values; and results of sensitivity tests.
Go to https://doi.org/10.1130/GSAT.S.14179976 to access the supplemental material; contact editing@geosociety.org with any questions.
2 All code used in this study is located at https://github.com/akshaymehra/dataCompilationWorkflow.
www.geosociety.org/gsatoday 5