bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.io.adjust_model(batch_file: Path, orig: str, model: str)[source]#

Adjust the model in the jsonl batch file.

This function replaces the original model with the new model in the specified batch file.

Parameters:
  • batch_file (Path) – The path to the batch file to modify.

  • orig (str) – The original model name to replace.

  • model (str) – The new model name to use.

Returns:

The path to the modified batch file.

Return type:

Path

bibcat.llm.io.get_file(filepath: str | None = None, bibcode: str | None = None, index: int | None = None) str[source]#

Get a file path for paper data

Get a file path of a paper to upload to an LLM. If a file path is provided, e.g. a local pdf file, it is returned. If a bibcode or index is provided, retrieves the source dataset and writes it out to a temporary json file. The name of the temporary file is temp_****_[bibcode].json, prefixed with temp_ and suffixed with the bibcode of the paper.

Parameters:
  • filepath (str, optional) – a local filepath to a paper, by default None

  • bibcode (str, optional) – the bibcode of a source paper, by default None

  • index (int, optional) – the list index of a source paper, by default None

Returns:

the file path to the paper data

Return type:

str

bibcat.llm.io.get_llm_prompt(prompt_type: str) str[source]#

Get an LLM prompt

Retrieve a user or agent prompt for an LLM from a file or the config. A user prompt is the text to be used as the input to the LLM, while the agent, or system, prompt is the text that defines the instructions or behavior of the LLM Agent to follow. The agent prompt is only used when creating a new agent for the first time.

You can define a custom user or agent prompt as a text file, located at $BIBCAT_DATA_DIR/llm_[prompt_type]_prompt.txt. For example, place your custom user prompt at $BIBCAT_DATA_DIR/llm_user_prompt.txt. This file takes precendence. If no custom prompt file is found, the default user prompt will come from the config file field: llms.user_prompt.

To set an agent prompt, create a file at $BIBCAT_DATA_DIR/llm_agent_prompt.txt, and add your instructions for the agent. If no custom agent prompt is found, a default agent prompt will be used. The default agent prompt will either come from the config file field: llms.agent_prompt or from the default file at etc/default_agent_prompt.txt.

Parameters:

prompt_type (str) – The type of prompt to retrieve, either ‘user’ or ‘agent’

Returns:

the text prompt

Return type:

str

Raises:

ValueError – when an invalid prompt type is provided

bibcat.llm.io.get_source(bibcode: str | None = None, index: int | None = None, body_only: bool = False) dict | str[source]#

Get the source dataset for a given bibcode or index.

Retrieve the entry from the combined source dataset for a given bibcode or list index.

Parameters:
  • bibcode (str, optional) – the paper bibcode to retrieve, by default None

  • index (int, optional) – the list item index to retrieve, by default None

  • body_only (bool, optional) – Flag to only return the text body, by default False

Returns:

a row from the source dataset

Return type:

dict | str

bibcat.llm.io.read_output(bibcode: str | None = None, filename: str | Path | None = None) list[source]#

Read in the output for a given bibcode

Returns the content from the output JSON file for the given bibcode.

Parameters:
  • bibcode (str, optional) – The paper bibcode, by default None

  • filename (Path, optional) – The prompt output file path

Returns:

The output data from the LLM response

Return type:

list

bibcat.llm.io.write_output(paper_key: str, response: dict)[source]#

Write the output response to a file

Writes the output json response to a file, located at $BIBCAT_OUTPUT/output/llms/openai_[config.llms.openai.model]/[config.llms.prompt_output_file]

The output JSON file is organized by the filename or bibcode of the input file, with each prompt response appended in the relevant section.

Parameters:
  • paper_key (str) – the JSON key to append the response to, e.g. the bibcode or filename

  • response (dict) – the response from the llm agent

bibcat.llm.io.write_summary(output: dict, output_path: str | None = None)[source]#

Write the evaluation summary output to a file

Write the output summary statistics and info from evaluation into a JSON file.

Parameters:
  • output (dict) – the output summary data

  • output_path (str, optional) – optional output directory path

Return type:

None

bibcat.llm.metrics module#

bibcat.llm.metrics.append_human_labels_with_mapped_papertype(human_data: dict[str, str], mission: str, human_labels: list[str]) None[source]#

Append human papertype to the human_labels list after mapping it to the allowed papertype

Parameters:
  • human_data (dict[str]) – human classification data per bibcode in summary_output. e.g., “human”: {“GALEX”: “SCIENCE”, “HST”: “DATA-INFLUENCED”}

  • mission (str) – mission name, e.g., ROMAN

  • human_labels (list[str]) – list of human papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.append_llm_labels_with_mapped_papertype(llm_data: list[dict], mission: str, llm_labels: list[str]) None[source]#

Append llm papertype to the llm_labels list after mapping it to the allowed papertype

Parameters:
  • llm_data (list[dict]) – llm classification data per bibcode in summary_output. e.g., “llm”: [{“JWST”: “SCIENCE”}, {“ROMAN”: “SUPERMENTION”}, {“HST”: “SCIENCE”}]

  • mission (str) – mission name, e.g., ROMAN

  • llm_labels (list[str]) – list of llm papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.collect_confusion_matrix_cell_entries(human_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], llm_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], label_raws: list[dict], n_classes: int) dict[source]#

Collect bibcode + raw-label dicts for confusion matrix cells.

Parameters:
  • human_labels_encoded (NDArray[np.int64]) – encoded human labels

  • llm_labels_encoded (NDArray[np.int64]) – encoded llm labels

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, human_raw and llm_raw

  • n_classes (int) – number of classes

Returns:

  • entries (dict) – a dict with keys ‘tn’,’fp’,’fn’,’tp’ each mapping to a list of entry dicts.

  • Each entry dict contains following variables

  • bibcode (str) – bibcode

  • human_raw (str) – raw human label before mapping

  • llm_raw (str) – raw llm label before mapping

bibcat.llm.metrics.compute_and_save_metrics(metrics_data: dict[str], output_ascii_path: str | Path = 'metrics_summary.txt', output_json_path: str | Path = 'metrics_summary.json')[source]#

Compute llm performance metrics (accuracy, f1, precision, and recall scores) and other stats and save results to an ascii file

Parameters:
  • metrics_data (dict[str]) – contains various metrics

  • variables (metrics_data contains following)

  • threshold (float) – threshold

  • n_bibcodes (int) – The number of bibcodes (papers)

  • n_human_callouts (int) – The number of callouts by human classification in the whole dataset

  • n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset

  • n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset

  • n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset

  • list[str] (non_mast_missions;) – Non MAST missions called out by llm in the whole dataset

  • sorted – Non MAST missions called out by llm in the whole dataset

  • n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions

  • n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false

  • human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions

  • human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”]

  • llm_labels (list[str]) – Predicted labels by llm

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys ‘human_raw’ and ‘llm_raw’

  • output_ascii_path (str | Path) – output file path to save the metrics summary in .txt

  • output_json_path (str | Path) – output file path to save the metrics summary in .json

Return type:

None

bibcat.llm.metrics.extract_eval_data(data: dict, missions: list[str]) dict[str, Any][source]#

Extract the evaluation data for confusion matrix and stats related to mission call-outs, and save to files.

Extract the human/llm labels and other stats related to valid MAST mission and non MAST mission call-outs from the evaluation json file, config.llms.eval_output_file (summary_output.json). This function is called when plotting a confusion matrix plot in bibcat.llm.plots.py

Parameters:
  • data (dict) – the dict of the evaluation data of config.llms.eval_output_file (*summary_output.json)

  • missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

  • metrics_data (dict[str]) – contains various metrics

  • metrics_data contains following variables

  • threshold (float) – threshold

  • n_bibcodes (int) – The number of bibcodes (papers)

  • n_human_callouts (int) – The number of callouts by human classification in the whole dataset

  • n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset

  • n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset

  • n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset

  • non_mast_missions (list[str], sorted) – The non-MAST missions called out by llm in the whole dataset

  • n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions

  • n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false

  • human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions

  • human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”] after mapping

  • llm_labels (list[str]) – Predicted labels by llm after mapping

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, ‘human_raw’ and ‘llm_raw’

bibcat.llm.metrics.extract_labels(missions: list[str], human_labels: list[str], llm_labels: list[str], human_llm_mission_callouts: list[str], ignored_papertype: str, item: dict[str, dict[str, Any]], n_human_llm_hallucination: int) tuple[list[str], list[str], int][source]#

Extract human and llm papertype labels when the summary output of a bibcode has classification items other than “error”

This function extracts human and llm papertype labels from the summary_output for constructing confusion matrix, then map the papertypes to the allowed papertypes (for instance, MENTION maps to NONSCIENCE). Because the summary_output provides only human and llm callouts of only relevant missions, not all MAST missions, we need to extract the relevant labels depending on the following various conditions:

  1. When both human and LLM call out a given mission and their papertypes, assign them to the relevant papertypes.

  2. When human calls out the mission but LLM ignores the paper, assign the human label to its relevant papertype but the LLM label to ignored_papertype.

  3. When human ignores the paper but LLM calls out with a papertype, assign the LLM label to its relevant papertype but the human label to ignored_papertype.

  4. When both human and LLM ignore the paper for the mission, assign both to ignored_papertype.

Parameters:
  • missions (list[str]) – MAST missions of interest

  • human_labels (list[str]) – human papertypes before this bibcode

  • llm_labels (list[str]) – llm papertypes before this bibcode

  • human_llm_mission_callouts (list[str]) – Missions called out by both human and llm

  • ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, NONSCIENCE

  • item (dict[str, dict[str, Any]]) – bibcode dictionary item

  • n_human_llm_hallucination (int) – the number of hallucinations before the current bibcode

Returns:

  • human_labels (list[str]) – human papertype labels updated after the current bibcode

  • llm_labels (list[str]) – llm papertype labels updated after the current bibcode

  • n_human_llm_hallucination (int) – the number of hallucinations updated after the current bibcode

bibcat.llm.metrics.extract_roc_data(data: dict[str, dict[str, Any]], missions: list[str])[source]#

Extract the human and llm classification labels and confidences

Extract the human classes and confidence values from the evaluation json file, config.llms.eval_output_file (summary_output.json). You can extract data from only a single mission or a list of missions. The human labels (ground truth) and llm confidence values will be used to create a ROC curve.

Parameters:
  • data (dict[str, dict[str, Any]]) – the dict of the evaluation data of config.llms.eval_output_file (summary_output.json)

  • missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

  • tuple – A tuple of the list of human labels, llm labels, and the hreshold value for verdict acceptance.

  • human_labels (list[str]) – True labels by human, a list of papertypes, .e.g, “SCIENCE” or “MENTION”(or “NONSCIENCE”), see the allowed classifications in config.llms.papertypes For example, [“SCIENCE”, “MENTION”]

  • llm_confidences (list[list[float]]) – A list of confidence score sets for all verdicts ([[p_science, p_mention],]) where p_science and p_mention represent confidence values of “SCIENCE” and “MENTION”(or “NONSCIENCE”) respectively. For example: [[0.9, 0.1], [0.4, 0.6]]

  • human_llm_missions (list[str], sorted) – A set of missions, each containing both human- and LLM-classified paper types, used for evaluation plots.

bibcat.llm.metrics.get_roc_metrics(llm_confidences: ndarray[tuple[int, ...], dtype[float64]], binarized_human_labels: ndarray[tuple[int, ...], dtype[int64]], n_papertype: int)[source]#

Compute ROC curve and ROC AUC (area under curve)

Parameters:
  • llm_confidences (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – the numpy array of llm_confidences

  • binarized_true_labels (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – binarized_human_labels, e.g., [[0] [1] [1] [0] [0]] if the binary

Returns:

  • tuple – a tuple of false positive rate(fpr), true positive rate(tpr), and roc_auc

  • fpr (float) – false positive rate

  • tpr (float) – true positive rate

  • roc_auc (float) – ROC area under curve

  • macro_roc_auc_ovr (float) – Macro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

  • micro_roc_auc_ovr (float) – Micro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

bibcat.llm.metrics.human_labels_when_no_llm_output(missions, human_data, human_labels, ignored_papertype)[source]#

Assign human labels when human classifications exist even with no llm output

Parameters:
  • missions (list[str]) – list of missions

  • human_data (dict[str, str]) – dictionary values of item[“human”], e.g., {“JWST”: “SCIENCE”}

  • human_labels (list[str]) – True labels by human, a list of papertypes before papertype mapping, For example, [“SCIENCE”, “MENTION”]

  • ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, “NONSCIENCE”

Returns:

human_labels – updated human labels based on the presence of human classifications

Return type:

list[str]

bibcat.llm.metrics.map_papertype(papertype: str) str | None[source]#

Map a classified papertype to an allowed papertypes, for instance, if papertype is “SUPERMENTION” or “IGNORE”, it will returns “NONSCIENCE” or a custom papertype.

Parameters:

papertype (str, uppercase) – human or llm classified papertype, e.g., “SCIENCE”, “DATA_INFLUENCED”

Returns:

mapped_papertype – mapped papertype follwing config.llms.map_papertypes, e.g., “MENTION” if papertype is “SUPERMENTION”

Return type:

str, uppercase

bibcat.llm.metrics.prepare_roc_inputs(human_labels: list[str], llm_confidences: list[list[float]])[source]#

Prepare input data for ROC and AUC (area under curve)

Parameters:
  • human_labels (list[str]) – True labels by human, a list papertypes, .e.g, “SCIENCE” or “MENTION”, see the allowed classifications in config.llms.papertypes

  • llm_confidences (list[list[float]]) – Predicted labels by llm, a list of confidence score pairs for all verdicts.

Returns:

  • tuple – A tuple of confidences, binarized_human_labels, and n_classes.

  • binarized_human_labels (NDArray[np.int64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. Binarized human labels as ROC input, e.g.,[[0][1]..],[[0 1 0 0][0 1 0 0]…]

  • llm_confidences (NDArray[np.float64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. A list of confidence score pairs for all verdicts. Each inner list contains two floats: the first for “SCIENCE” and the second for “MENTION”. For example: [[0.9 0.1] [0.4 0.6]]

  • n_papertype (int) – the number of available papertypes

  • n_verdicts (int) – the number of MAST mission papertype verdicts by LLM

bibcat.llm.openai module#

bibcat.llm.plots module#

bibcat.llm.plots.confusion_matrix_plot(summary_output_path: str | Path, missions: list[str]) None[source]#

Create a confusion matrix figure

Create confusion matrix plots (counts and normalized) given a threshold value.

Parameters:
  • summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json

  • missions (list[str]) – list of the mission names to extract the classification labels.

bibcat.llm.plots.roc_plot(summary_output_path: str | Path, missions: list[str]) None[source]#

Create a Receiver Operating Characteristic (ROC) curve plot

Parameters:
  • summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json

  • missions (list[str]) – list of the mission names to extract the classification labels.

bibcat.llm.stats module#

bibcat.llm.stats.analyze_missions(human_item: dict[str, str], llm_item: list[dict[str, Any]]) tuple[dict[str, str], int][source]#

Analyze and compare LLM classifications against human classifications.

Parameters:
  • human_item (dict[str, str]) – human classification of mission and papertype, e.g., {“HST”: “MENTION”, “JWST”: “SUPERMENTION”}

  • llm_item (list[dict[str, Any]]) – list of llm classifications

Returns:

  • failure (dict) – dictionary of failured cases

  • n_matched_classifications (int) – number of matched classifications

bibcat.llm.stats.audit_summary(audit_results: dict) dict[str, int][source]#

Create the summary of the inconsistent classifications

Parameters:

audit_results (dict) – the breakdown bibcode list of inconsistent llm classifications e.g., “bibcodes”: {“2018A&A…610A..11I”: {“failures”: “GALEX”: “false_positive”},}

Returns:

summary_counts – various count summary

Return type:

dict[str, int]

bibcat.llm.stats.group_by_agg(confidence_name: str, threshold_acceptance: float, threshold_inspection: float, df: DataFrame)[source]#

Group DataFrame by mission and papertype and aggregate other properties.

Parameters:
  • confidence_name (str) – The Key name for LLM confidences, e.g, “llm_confidences” for paper_output.json or “mean_llm_confidences” for summary_output.json

  • threshold_acceptance (float) – Threshold value to accept LLM papertype

  • threshold_inspection (float) – Threshold value to filter papers required for human inspection

  • df (pd.DataFrame) – Dataframe

Return type:

pd.DataFrame

bibcat.llm.stats.inconsistent_classifications(input_path: str | Path, output_path: str | Path)[source]#

Save falsely classified bibcodes to a json file for investigation

This code will check if llm classification is different from human classification or incorrectly ignore the mission and save the results to a json file.

Parameters:
  • input_path (str | pathlib.Path) – Input eval_output file name/path for statistics

  • output_path (str | pathlib.Path) – File name/path to save the JSON file

Return type:

None

bibcat.llm.stats.save_evaluation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#

Generate acceptance and inspection statistics and identify classification inconsistencies between humans and the LLM for evaluation summary data

This function performs the following actions:
  • Creates a statistics file containing:
    • Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.

    • Human Inspection Requirements: Number of papers requiring human inspection

    • Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.

    • Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.

Parameters:
  • input_path (str | pathlib.Path) – Input paper_output file name/path for statistics

  • output_path (str | pathlib.Path) – File name/path to save the JSON file

  • threshold_acceptance (float) – Threshold value to accept LLM papertype

  • threshold_inspection (float) – Threshold value to filter papers required for human inspection

Return type:

None

Raises:

Exception – For any other exceptions that occur during DataFrame creation or file operations.

bibcat.llm.stats.save_operation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#

Generate acceptance and inspection statistics from operational classifications

This function performs the following actions:
  • Creates a statistics file containing:
    • Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.

    • Human Inspection Requirements: Number of papers requiring human inspection

    • Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.

    • Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.

Parameters:
  • input_path (str | pathlib.Path) – Input paper_output filename/path for statistics

  • output_path (str | pathlib.Path) – File name/path to save the JSON file

  • threshold_acceptance (float) – Threshold value to accept LLM papertype

  • threshold_inspection (float) – Threshold value to filter papers required for human inspection

Return type:

None

Raises:

Exception – For any other exceptions that occur during DataFrame creation or file operations.

bibcat.llm.stats.write_stats(output_path, threshold_acceptance, threshold_inspection, grouped_df)[source]#

Write the satistics into a JSON file.

Parameters:
  • output_path (pathlib.Path) – Filename path to save the stats results.

  • threshold_acceptance (float) – Threshold value to accept LLM papertype.

  • threshold_inspection (float) – Threshold value to filter papers required for human inspection.

  • grouped_df (pd.DataFrame) – Grouped DataFrame

Return type:

None

Module contents#

This module contains the scripts for LLM classification.