Input text data

Input text data#

The bibcat/data/ directory includes a script that prepares the input dataset for the LLM-based method This dataset combines the papertext (text) and papertrack (classification) data into a single input file.

To construct the combined input dataset file (combined_data*.json), two JSON sources are required: full-text data and corresponding classified label data. We refer to these as:

  • papertext – the full-text dataset

  • papertrack – the classification labels from MAST PaperTrack

Minimum required keys and structure for ``papertext``:

[
  {
    "bibcode": "...",
    "abstract": "...",
    "pubdate": "...",
    "title": ["..."],
    "body": "..."
  }
]

Required keys and structure for ``papertrack``:

[
  {
    "bibcode": "...",
    "searches": [
      {
        "search_key": "...",
        "ignored": false
      }
    ],
    "class_missions": [
      {
        "bibcode": "...",
        "papertype": "..."
      }
    ]
  }
]

Here:

  • search_key is search key, mast or library)

  • ignored is a boolean (true or false) which indicates whether the paper is unrelated to the mission,

  • papertype is the classification label (e.g., SCIENCE, DATA-INFLUENCED, MENTION)

Keys in ``required combined input dataset`` needed for the BibCat:

[
  "bibcode",
  "abstract",
  "pubdate",
  "title",
  "body",
  "class_missions",
  "papertype"
]

Keys in the final combined input dataset:

[
  "bibcode",
  "abstract",
  "author",
  "keyword",
  "keyword_norm",
  "pubdate",
  "title",
  "body",
  "class_missions",
  "papertype",
  "is_ignored_<mission>"
]
  • class_missions maps each paper to its classification(s) by mission.

  • is_ignored_<mission> is a boolean flag indicating whether the paper is unrelated to the specified mission.

  • mission is either mast or libary.

The example of the combined dataset JSON format is as follows.

{
  "bibcode": "3023Natur.111..123y",
  "abstract": "We report a newly discovered Type Ia supernova, SN 3023X.",
  "author": [
    "Lastname1, Firstname1",
    "Lastname2, Firstname2"
  ],
  "keyword": [
    "Astrophysics - High Energy Astrophysical Phenomena"
  ],
  "keyword_norm": [
    "-"
  ],
  "pubdate": "3023-12-00",
  "title": [
    "A discovery of a new peculiar Type Ia supernovae"
  ],
  "body": "We report the HST observation of SN 3023X, a Type Ia supernova exhibiting unusually slow decline rates and strong carbon absorption features pre-maximum—traits inconsistent with canonical models. Located in a passive elliptical galaxy at z = 0.034, its peak luminosity was 0.7 mag fainter than normal SNe Ia. Spectroscopic evolution suggests incomplete detonation or a hybrid progenitor. The anomaly challenges the standard candle assumption and may represent a new subclass. Continued photometric and spectroscopic monitoring is underway. SN 3023X offers a rare window into the diversity of thermonuclear explosions.",
  "class_missions": {
    "HST": {
      "bibcode": "3023Natur.111..123y",
      "papertype": "SCIENCE"
    }
  },
  "is_ignored_library": false,
  "is_ignored_mast": true
}