Page 7 - i1052-5173-31-5
P. 7
improved machine learning algorithms, power parameter. In the case of the SGP three-step process: (1) we drew samples,
designed to classify unknown samples based data, we used two distance functions, spa- using calculated P values, with replacement
on their elemental composition, may provide tial (s) and temporal (t): (i.e., each draw considered all available
a more sophisticated means by which to gen- samples, regardless of whether a sample
)
( x,
erate the largest possible dataset of lithology- s = arcdistancex i , had already been drawn); (2) we multiplied
appropriate samples. scale spatial the assigned uncertainties discussed above
We then completed a preliminary screen- by a random draw from a normal distribu-
ing of the lithology filtered samples by age xx i ) , tion (µ = 0; s = 1) to produce an error value;
( −
checking if extant analyte values were out- t = scale and (3) we added these newly calculated
side of physically possible bounds (e.g., indi- age errors to the drawn temporal and analytical
vidual oxides with wt% less than 0 or greater values. Finally, we binned and plotted the
than 100), and, if so, setting them to NaN. where arcdistance refers to the distance resampled data.
Next, to reduce the number of mudstone between two points on a sphere, scale spatial Naturally, the reader may ask how we
samples with detrital or authigenic carbonate refers to a preselected arc distance value (in chose the values for scale and scale temporal
age
and phosphatic mineral phases, we excluded degrees; Fig. S5, inset [see footnote 1]), and and what, if any, impact those choices had
samples with greater than 10 wt% Ca and/or scale is a preselected age value (in million on the final results? Nominally, the values
age
more than 1 wt% P O (removing ~66.9% of years, Ma). In this case study, we chose a of scale and scale temporal are controlled by
age
2
5
the remaining data; Fig. S4 [see footnote 1]). scale spatial of 0.5 degrees and a scale of the size and age, respectively, of the fea-
age
Additionally, in order to ensure that our 10 Ma (see below for a discussion about tures that are being sampled. So, in the case
mudstone samples were not subject to sec- parameter values). of sedimentary rocks, those values should
ondary enrichment processes, such as ore For n samples, the proximity value w reflect the length scale and duration of a
mineralization, we queried the USGS NGDB assigned to each sample x is: typical sedimentary basin, such that many
to extract the recorded characteristics of samples from the same “spatiotemporal”
every sample with an associated USGS = in 1 1 . basin have lower P values than few samples
+
() =
NGDB identifier. We examined these char- wx ∑ (s 2 +1 ) (t 2 + ) 1 from distinct basins. Of course, it is debat-
acteristics for the presence of selected strings = i 1 able what “typical” means in the context
(i.e., “mineralized,” “mineralization pres- Essentially, the proximity value is a sum- of sedimentary basins, as both size and
ent,” “unknown mineralization,” and “radio- mation of the reciprocals of the distance age can vary over orders of magnitude
active”) and excluded any sample exhibiting measures made for each pair of the sample (Woodcock, 2004). Given this uncertainty,
one or more strings. Finally, as there were and a single other datum from the dataset. we subjected the SGP data to a series of
still several apparent outliers in the dataset, Accordingly, samples that are closer to sensitivity tests, where we varied both
we manually examined the log histograms of other data in both time and space will have scale and scale temporal , using logarithmi-
age
each element and oxide of interest. On each larger w values than those that are farther cally spaced values of each (Fig. S5 [see
histogram, we demarcated the 0.5th and away. Note that the additive term of 1 in the footnote 1]). While the uncertainty associ-
99.5th percentile bounds of the data, then denominator establishes a maximum value ated with results varied based on the choice
visually studied those histograms to exclude of 1 for each reciprocal distance measure. of the two parameters, the overall mean val-
“outlier populations,” or samples located We normalized the generated proximity ues were not appreciably different (Fig. S7
both well outside those percentile bounds values (Fig. S6 [see footnote 1]) to produce [see footnote 1]).
and not part of a continuum of values (remov- a probability value P. This normalization
ing ~5.7% of the remaining data; Fig. S4). was done such that the median proximity RESULTS
Following these filtering steps, we saved the value corresponded to a P of ~0.20 (i.e., a To study the impact of our methodology,
data in a .csv text file. 1 in 5 chance of being chosen): we present results for two geochemical
components, U and Al O (Fig. 2). Contents-
2
3
Data Resampling Px () = 1 . wise, the U and Al O data in the SGP data-
3
2
We implemented resampling based on wx () median 0.20 +1 base contain extreme outliers. Many of
inverse distance weighting (after Keller w these outliers were removed using the
and Schoene, 2012), in which samples lithology and Ca or P O screening (Figs.
5
2
closer together—that is, with respect to a This normalization results in an “inverse 2A and 2C); the final outlier filtering strat-
metric such as age or spatial distance—are proximity weighting,” such that samples egy discussed above handled any remaining
considered to be more alike than samples that are closer to other data (which have values of concern. In the case of U, our
that are further apart. The inverse weight- large w values) end up with a smaller P multi-step filtering reduced the range of
ing of an individual point, x, is based on value than those that are far away from concentrations by three orders of magni-
the basic form: other samples. Next, we assigned both ana- tude, from 0–500,000 ppm to 0–500 ppm.
lytical and temporal uncertainties to each
yx 1 , analyte to be resampled. Then, we culled DISCUSSION
() =
dx x(, ) p the dataset into an m by n matrix, where The illustrative examples we have pre-
i
each row corresponded to a sample and sented have implications for understanding
where d is a distance function, x is a second each column to an analyte. We resampled Earth’s history. Al O contents of ancient
i
2
3
sample, and p, which is greater than 0, is a this culled dataset 10,000 times using a mudstones appear relatively stable over the
www.geosociety.org/gsatoday 7