bibcat.data package#

Submodules#

bibcat.data.build_dataset module#

title:

build_dataset.py

This module will produce the input corpus data in JSON format by combining the MAST papertrack JSON file and the ADS full text JSON file.

Run example: bibcat train

bibcat.data.build_dataset.build_dataset() None[source]#

Building the source dataset

This data is used for transformer models or llm models by combining the papertrack data and the ADS full papertext data.

bibcat.data.build_dataset.combine_datasets(trimmed_papertext_data: list[dict], papertrack_data: list[dict])[source]#

Combine the papertrack and papertext data

Combines two datasets into a source dataset to be used for llm models or transformer training models.

Parameters:
  • trimmed_papertext_data (list[dict]) – the trimmed papertext data with only necessary keys

  • papertrack_data (list[dict]) – the papertrack data

Returns:

a tuple of the list of the dictionary of the combined data, the list of the papertrack bibcodes not in the papertext data, the list of the papertext bibcodes not in the papertrack data, the list of the dictionary of the papertext not in the papertrack data

Return type:

tuple

bibcat.data.build_dataset.extract_papertext_info(dataset: list[dict]) tuple[list[str], list[str]][source]#

Extract the papertext bibcodes and publish dates

Extracts and returns the papertext bibcodes and pubdates.

Parameters:

dataset (list[dict]) – the papertext dataset

Returns:

the tuple of a list of the bibbodes dict and the pubdates dict

Return type:

tuple[list[str], list[str]]

bibcat.data.build_dataset.extract_papertrack_info(dataset: list[dict]) tuple[list[None | dict], list[None | str], list[None | dict]][source]#

Extract papertrack info

Extracts and returns the papertrack values: searches, bibcodes, and missions and papertypes.

Parameters:

dataset (list[dict]) – the papertrack dataset

Returns:

the tuple of a list of the searches dict, the bibcode dict, and the missions_and_papertypes dict

Return type:

tuple[list[None | dict], list[None | str], list[None | dict]]

Raises:

ValueError – when the set of the bibcodes is different from the number of all the bibcodes

bibcat.data.build_dataset.file_exists(filelist: list) bool[source]#

Check if any file exists among the list of files

bibcat.data.build_dataset.load_datasets(path_papertext: Path, path_papertrack: Path) tuple[list[dict], list[dict]][source]#

Load the papertrack and papertext JSON datasets

Loads the papertrack and papertext datasets and returns a tuple of the lists of dictionaries.

Parameters:
  • path_papertext (Path) – the path to the papertext data file

  • path_papertrack (Path) – the path to the papertrack data file

Returns:

the tuple of the lists of the papertext and papertrack datasets

Return type:

tuple[list[dict], list[dict]]

bibcat.data.build_dataset.load_source_dataset()[source]#

Load the original source dataset that is a combined set of papertrack classification and ADS full text. Return a dictionary of the JSON content.

bibcat.data.build_dataset.missing_bibcodes_in_papertext(bibcodes_papertext: list[str], bibcodes_papertrack: list[str]) list[str] | None[source]#

Return the papertrack bibcodes are not in the papertext

Returns the list of the papertrack bibcodes not in the papertext.

Parameters:
  • bibcodes_papertext (list[str]) – the bibcodes in papertext

  • bibcodes_papertrack (list[str]) – the bibcodes in papertrack

Returns:

the list of the bibcodes are not in the papertext

Return type:

list[str] | None

bibcat.data.build_dataset.save_text_file(path_filename: Path, bibcodes: list[str]) None[source]#
bibcat.data.build_dataset.save_text_files(missing_papertext_bibcodes: list, missing_papertrack_bibcodes: list) None[source]#

Save the text files of the missing bibcodes

Save the missing bibcodes in the papertext and papertrack files as text files.

Parameters:#

missing_papertext_bibcodes: list

the list of the missing papertext bibcodes

missing_papertrack_bibcodes: list

the list of the missing papertrack bibcodes

Returns: None

bibcat.data.build_dataset.trim_dict(dataset: list[dict], keys: list) list[dict][source]#

Trim the papertext data with the only required keys

Trims the papertext data so that the dataset only has the values of [abstract, author, bibcode, body, keyword, keyword_norm, pubdate, title].

Parameters:
  • dataset (list[dict]) – the papertext data

  • keys (list) – the list of the necessary keys

Returns:

the list of the only dictionary required for the source dataset

Return type:

list[dict]

Module contents#

This module contains the script to create the input dataset.