Input text data#
The bibcat/data/ directory includes a script that prepares the input dataset for the LLM-based method This dataset combines the papertext (text) and papertrack (classification) data into a single input file.
To construct the combined input dataset file (combined_data*.json), two JSON sources are required: full-text data and corresponding classified label data. We refer to these as:
papertext – the full-text dataset
papertrack – the classification labels from MAST PaperTrack
Minimum required keys and structure for ``papertext``:
[
{
"bibcode": "...",
"abstract": "...",
"pubdate": "...",
"title": ["..."],
"body": "..."
}
]
Required keys and structure for ``papertrack``:
[
{
"bibcode": "...",
"searches": [
{
"search_key": "...",
"ignored": false
}
],
"class_missions": [
{
"bibcode": "...",
"papertype": "..."
}
]
}
]
Here:
search_keyis search key, mast or library)ignoredis a boolean (true or false) which indicates whether the paper is unrelated to the mission,papertypeis the classification label (e.g., SCIENCE, DATA-INFLUENCED, MENTION)
Keys in ``required combined input dataset`` needed for the BibCat:
[
"bibcode",
"abstract",
"pubdate",
"title",
"body",
"class_missions",
"papertype"
]
Keys in the final combined input dataset:
[
"bibcode",
"abstract",
"author",
"keyword",
"keyword_norm",
"pubdate",
"title",
"body",
"class_missions",
"papertype",
"is_ignored_<mission>"
]
class_missionsmaps each paper to its classification(s) by mission.is_ignored_<mission>is a boolean flag indicating whether the paper is unrelated to the specified mission.missionis either mast or libary.
The example of the combined dataset JSON format is as follows.
{
"bibcode": "3023Natur.111..123y",
"abstract": "We report a newly discovered Type Ia supernova, SN 3023X.",
"author": [
"Lastname1, Firstname1",
"Lastname2, Firstname2"
],
"keyword": [
"Astrophysics - High Energy Astrophysical Phenomena"
],
"keyword_norm": [
"-"
],
"pubdate": "3023-12-00",
"title": [
"A discovery of a new peculiar Type Ia supernovae"
],
"body": "We report the HST observation of SN 3023X, a Type Ia supernova exhibiting unusually slow decline rates and strong carbon absorption features pre-maximum—traits inconsistent with canonical models. Located in a passive elliptical galaxy at z = 0.034, its peak luminosity was 0.7 mag fainter than normal SNe Ia. Spectroscopic evolution suggests incomplete detonation or a hybrid progenitor. The anomaly challenges the standard candle assumption and may represent a new subclass. Continued photometric and spectroscopic monitoring is underway. SN 3023X offers a rare window into the diversity of thermonuclear explosions.",
"class_missions": {
"HST": {
"bibcode": "3023Natur.111..123y",
"papertype": "SCIENCE"
}
},
"is_ignored_library": false,
"is_ignored_mast": true
}