Page 5 - i1052-5173-32-11
P. 5

analyzes the  relationship  between citations   platform. The papers were placed into one   THE HYBRID MACHINE-HUMAN
         and their textual context (i.e., whether the   of four classes:  SUPPORT,  NEGATE,   APPROACH
         citation is used in a positive way or negative   NEGATE&SUPPORT, and UNRELATED   Below, we detail the three key components
         way). SCITLDR is used to create a short   (see Table 1). The annotations for these four   of our hybrid machine-human approach in
         summary of the given paper (without truly   classes were collected by two of the co-  this experiment.
         understanding what the underlying content   authors of this effort, who are domain
         means). Our work is complementary to these   experts (i.e., geoscientists). The two anno-  Contextualizing Findings: Time and
         directions, because we aim for deeper lan-  tators worked independently.  Site Identification
         guage understanding. That is, the purpose of   Next, we implemented a natural language   To analyze the relationship between volca-
         the proposed approach is to spatially and   processing (NLP) component for geosciences   nism and climate change at different times in
         temporally contextualize a given geoscience   that extracts two types of information. First,   the geological past and locations, we built a
         research question and to identify whether the   we contextualized individual publications by   custom Named Entity Recognizer to extract
         content of the papers analyzed supports or   extracting and normalizing the geospatial   spatial and temporal information from the
         negates it.                         and temporal contexts addressed in these   analyzed text. Named entity recognition
          For this purpose, we developed an appli-  papers (e.g.,  Pliocene,  4 million years ago,   (NER) is a common NLP task that aims to
         cation to geosciences to demonstrate the   and  Bering Sea). For example,  Tucson and   identify named entities within the given text
         potential  of  our  proposed  approach  to   Saguaro National Park can be considered as   and classify or categorize those entities under
         experiment with the limitations of this type   the same geographic location (for the pur-  various predefined classes. Our focus in this
         of literature and how they can be overcome.   poses of this analysis), even though they are   work is on the identification of locations and
         The application investigates the research   described differently in text. To facilitate the   geological eras and epochs, which are neces-
         question of whether there is a causal rela-  consolidation of findings, we normalized the   sary to contextualize the findings discussed
         tionship between volcanism and climate   geospatial contexts to absolute latitude/longi-  in the papers.
         change in the geologic record as seen   tude coordinates (see the next section for   Existing NER tools such as Stanford’s
         through the lens of published literature.   details). Similarly, temporal expressions such   CoreNLP (Manning et al., 2014) or spaCy
         Specifically, we ask whether volcanism   as 4 million years ago were converted to geo-  (Honnibal and Montani, 2017) focus on
         influenced climate change in the deep time   logical eras or epochs (e.g., Paleoproterozoic)   generic  locations, times, and dates rather
         geologic archive. We selected this question   to have a better overall understanding of the   than geoscience-specific ones. For exam-
         because several geological studies seem to   relationship between volcanism and climate   ple, when we fed the sample sentence
         support this link (e.g., Lee and Dee, 2019).   change on the geological time scale.  “Clay mineral assemblages and crystallini-
         Our results  indicate more variability on   Second, we built a document classifier that   ties in sediments from IODP Site 1340 in
         whether or not available studies on the sub-  is trained to determine whether any given   the Bering Sea were analyzed in order to
         ject actually support this research question.  paper  supports  the  observation  that  “volca-  trace sediment sources and reconstruct the
                                             nism affected climate change,” so that we   paleoclimatic history of the Bering Sea
         SYSTEMATIC MACHINE REVIEW OF        could make a prediction on new papers. The   since  Pliocene  (the  last  4.3  Ma)”  into  the
         GEOSCIENCE DATA                     results of these two components were   Stanford CoreNLP NER, the result was:
          Since there was no pre-built corpus for   aggregated into a publication knowledge   Clay mineral assemblages and crystal-
         this geosciences task, we extracted 1164   base, which contains the publication itself,   linities in sediments from IODP Site [1340]
         papers from the Web of Science website via   the prediction of our classifier (SUPPORT,   DATE in the [Bering Sea]LOCATION
         the University of Arizona’s library. These   NEGATE,  NEGATE&SUPPORT, and   were analyzed in order to trace sediment
         papers were selected because they contained   UNRELATED—see Table 1 for details), the   sources and reconstruct the [paleoclimatic]
         keywords relevant to the research question at   occurrence of geological eras and epochs   MISC  history  of  the  [Bering  Sea]
         hand, such as volcanism or magmatism, and   (e.g., the frequency of  Pliocene in a given   LOCATION since Pliocene (the last [4.3]
         climate change. This was implemented as   paper), and the occurrence of geological loca-  NUMBER Ma).
         the Boolean query: (volcanism OR magma-  tions (e.g., the frequency of Africa in a given   Even though the Stanford CoreNLP NER
         tism) AND “climate change,” where OR and   paper). We used this knowledge base to visu-  correctly identified  Bering Sea as a
         AND are the disjunctive and conjunctive   alize the evidence for the research question   LOCATION, it did not recognize geo-sci-
         Boolean operators, and quotes indicate that   investigated on the  world map to identify   ences- specific expressions, and, further, it
         the entire phrase must be present. This query   global temporal and geospatial patterns.  classified expressions  into the  incorrect
         extracted 1164 papers from the Web of
         Science. We then randomly chose 200 papers          TABLE 1. NAMES AND DESCRIPTIONS OF THE LABELS
         and extracted the abstract, introduction, and      USED DURING THE MACHINE CLASSIFICATION PROCESS*
         conclusion sections from each paper to be   Classification label            Definition
         manually annotated with the information if   Support  The given text supports the relationship between volcanism and climate change.
         they support or do not support the research   Negate  The given text negates the relationship between volcanism and climate change.
         question. Note that for this work we assume   Negate&Support  The same overall text both supports and negates the relationship between volcanism
         that the authors’ data, interpretations, and         and climate change, with different paragraphs discussing each relationship.
         conclusions are correct. The annotation task   Unrelated  The given text is unrelated to the topic at hand, i.e., the relationship between
         was conducted on FindingFive (https://www            volcanism and climate change.
         .findingfive.com), an online annotation   *See text footnote 1.

                                                                                          www.geosociety.org/gsatoday  5
   1   2   3   4   5   6   7   8   9   10