bibcat.data package#
Submodules#
bibcat.data.build_dataset module#
- title:
build_dataset.py
This module will produce the input corpus data in JSON format by combining the MAST papertrack JSON file and the ADS full text JSON file.
Run example: bibcat train
- bibcat.data.build_dataset.build_dataset() None[source]#
Building the source dataset
This data is used for transformer models or llm models by combining the papertrack data and the ADS full papertext data.
- bibcat.data.build_dataset.combine_datasets(trimmed_papertext_data: list[dict], papertrack_data: list[dict])[source]#
Combine the papertrack and papertext data
Combines two datasets into a source dataset to be used for llm models or transformer training models.
- Parameters:
- Returns:
a tuple of the list of the dictionary of the combined data, the list of the papertrack bibcodes not in the papertext data, the list of the papertext bibcodes not in the papertrack data, the list of the dictionary of the papertext not in the papertrack data
- Return type:
- bibcat.data.build_dataset.extract_papertext_info(dataset: list[dict]) tuple[list[str], list[str]][source]#
Extract the papertext bibcodes and publish dates
Extracts and returns the papertext
bibcodesandpubdates.
- bibcat.data.build_dataset.extract_papertrack_info(dataset: list[dict]) tuple[list[None | dict], list[None | str], list[None | dict]][source]#
Extract papertrack info
Extracts and returns the papertrack values: searches, bibcodes, and missions and papertypes.
- bibcat.data.build_dataset.file_exists(filelist: list) bool[source]#
Check if any file exists among the list of files
- bibcat.data.build_dataset.load_datasets(path_papertext: Path, path_papertrack: Path) tuple[list[dict], list[dict]][source]#
Load the papertrack and papertext JSON datasets
Loads the papertrack and papertext datasets and returns a tuple of the lists of dictionaries.
- bibcat.data.build_dataset.load_source_dataset()[source]#
Load the original source dataset that is a combined set of papertrack classification and ADS full text. Return a dictionary of the JSON content.
- bibcat.data.build_dataset.missing_bibcodes_in_papertext(bibcodes_papertext: list[str], bibcodes_papertrack: list[str]) list[str] | None[source]#
Return the papertrack bibcodes are not in the papertext
Returns the list of the papertrack bibcodes not in the papertext.
- bibcat.data.build_dataset.save_text_files(missing_papertext_bibcodes: list, missing_papertrack_bibcodes: list) None[source]#
Save the text files of the missing bibcodes
Save the missing bibcodes in the papertext and papertrack files as text files.
Parameters:#
- missing_papertext_bibcodes: list
the list of the missing papertext bibcodes
- missing_papertrack_bibcodes: list
the list of the missing papertrack bibcodes
Returns: None
Module contents#
This module contains the script to create the input dataset.