Page 7 - i1052-5173-31-5
P. 7

improved  machine  learning  algorithms,   power parameter. In the case of the SGP   three-step process: (1) we drew samples,
         designed to classify unknown samples based   data, we used two distance functions, spa-  using calculated P values, with replacement
         on their elemental composition, may provide   tial (s) and temporal (t):  (i.e., each draw considered all available
         a more sophisticated means by which to gen-                            samples, regardless of whether a sample
                                                                    )
                                                                ( x,
         erate the largest possible dataset of lithology-     s  =  arcdistancex  i ,   had already been drawn); (2) we multiplied
         appropriate samples.                             scale spatial         the assigned uncertainties discussed above
          We then completed a preliminary screen-                               by a random draw from a normal distribu-
         ing of the lithology filtered samples by        age xx  i )  ,         tion (µ = 0; s = 1) to produce an error value;
                                                            ( −
         checking if extant analyte values were out-  t  =  scale               and (3) we added these newly calculated
         side of physically possible bounds (e.g., indi-       age              errors to the drawn temporal and analytical
         vidual oxides with wt% less than 0 or greater                          values. Finally, we binned and plotted the
         than 100), and, if so, setting them to NaN.   where  arcdistance refers to the distance   resampled data.
         Next, to reduce the number of mudstone   between two points on a sphere, scale spatial    Naturally, the reader may ask how  we
         samples with detrital or authigenic carbonate   refers to a preselected arc distance value (in   chose the values for scale  and scale temporal
                                                                                                    age
         and phosphatic mineral phases, we excluded   degrees; Fig. S5, inset [see footnote 1]), and   and what, if any, impact those choices had
         samples with greater than 10 wt% Ca and/or   scale  is a preselected age value (in million   on the final results? Nominally, the values
                                                age
         more than 1 wt% P O  (removing ~66.9% of   years, Ma). In this case study, we chose a   of scale  and scale temporal  are controlled by
                                                                                       age
                       2
                         5
         the remaining data; Fig. S4 [see footnote 1]).   scale spatial  of 0.5 degrees and a scale  of   the size and age, respectively, of the fea-
                                                                          age
         Additionally, in order to ensure that our   10 Ma (see below for a discussion about   tures that are being sampled. So, in the case
         mudstone samples were not subject to sec-  parameter values).          of sedimentary rocks, those values should
         ondary enrichment processes, such as ore   For  n samples, the proximity value  w   reflect the length scale and duration of a
         mineralization, we queried the USGS NGDB   assigned to each sample x is:  typical sedimentary basin, such that many
         to extract the recorded characteristics of                             samples from the same “spatiotemporal”
         every sample with an associated USGS           = in  1    1   .        basin have lower P values than few samples
                                                                +
                                                   () =
         NGDB identifier. We examined these char-  wx  ∑  (s  2  +1 ) (t  2  +  ) 1  from distinct basins. Of course, it is debat-
         acteristics for the presence of selected strings   = i 1               able what “typical” means in the context
         (i.e., “mineralized,” “mineralization pres-  Essentially, the proximity value is a sum-  of sedimentary basins, as both size and
         ent,” “unknown mineralization,” and “radio-  mation of the  reciprocals of  the distance   age can vary over orders of magnitude
         active”) and excluded any sample exhibiting   measures made for each pair of the sample   (Woodcock, 2004). Given this uncertainty,
         one or more strings. Finally, as there were   and a single other datum from the dataset.   we subjected the SGP  data to a  series of
         still several apparent outliers in the dataset,   Accordingly, samples that are closer to   sensitivity tests, where we varied both
         we manually examined the log histograms of   other data in both time and space will have   scale  and scale temporal , using logarithmi-
                                                                                    age
         each element and oxide of interest. On each   larger w values than those that are farther   cally  spaced  values of  each  (Fig.  S5 [see
         histogram, we demarcated the 0.5th and   away. Note that the additive term of 1 in the   footnote 1]). While the uncertainty associ-
         99.5th  percentile  bounds  of  the  data,  then   denominator establishes a maximum value   ated with results varied based on the choice
         visually studied those histograms to exclude   of 1 for each reciprocal distance measure.  of the two parameters, the overall mean val-
         “outlier populations,” or samples  located   We normalized the generated proximity   ues were not appreciably different (Fig. S7
         both well outside those percentile bounds   values (Fig. S6 [see footnote 1]) to produce   [see footnote 1]).
         and not part of a continuum of values (remov-  a probability value P. This normalization
         ing ~5.7% of the remaining data; Fig. S4).   was done  such  that  the  median proximity   RESULTS
         Following these filtering steps, we saved the   value corresponded to a P of ~0.20 (i.e., a    To study the impact of our methodology,
         data in a .csv text file.           1 in 5 chance of being chosen):    we present results for two geochemical
                                                                                components, U and Al O  (Fig. 2). Contents-
                                                                                                 2
                                                                                                   3
         Data Resampling                        Px () =       1           .     wise, the U and Al O  data in the SGP data-
                                                                                                 3
                                                                                               2
          We  implemented  resampling  based  on     wx () median  0.20  +1     base contain  extreme outliers.  Many of
         inverse distance weighting (after Keller                 w             these outliers were removed using the
         and Schoene, 2012), in which samples                                   lithology and Ca or P O  screening (Figs.
                                                                                                    5
                                                                                                  2
         closer together—that is, with respect to a   This normalization results in an “inverse   2A and 2C); the final outlier filtering strat-
         metric such as age or spatial distance—are   proximity weighting,” such that samples   egy discussed above handled any remaining
         considered to be more alike than samples   that are closer to other data (which have   values of concern. In the case of U, our
         that are further apart. The inverse weight-  large w values) end up with a smaller  P   multi-step filtering reduced the range of
         ing of an individual point, x, is based on   value  than those that  are  far away from   concentrations by three orders of magni-
         the basic form:                     other samples. Next, we assigned both ana-  tude, from 0–500,000 ppm to 0–500 ppm.
                                             lytical  and  temporal  uncertainties  to  each
                  yx      1    ,             analyte to be resampled. Then, we culled   DISCUSSION
                   () =
                        dx x(, )  p          the  dataset  into  an  m  by  n  matrix,  where   The illustrative examples we have pre-
                            i
                                             each  row  corresponded  to  a  sample  and   sented have implications for understanding
         where d is a distance function, x  is a second   each column to an analyte. We resampled   Earth’s history. Al O  contents of ancient
                                 i
                                                                                                2
                                                                                                  3
         sample, and p, which is greater than 0, is a   this culled dataset 10,000 times using a   mudstones appear relatively stable over the
                                                                                        www.geosociety.org/gsatoday  7
   2   3   4   5   6   7   8   9   10   11   12