bibcat.llm package

bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.io.adjust_model(batch_file: Path, orig: str, model: str)[source]#

Adjust the model in the jsonl batch file.

This function replaces the original model with the new model in the specified batch file.

Parameters:

batch_file (Path) – The path to the batch file to modify.
orig (str) – The original model name to replace.
model (str) – The new model name to use.

Returns:

The path to the modified batch file.

Return type:

Path

bibcat.llm.io.get_file(filepath: str | None = None, bibcode: str | None = None, index: int | None = None) → str[source]#

Get a file path for paper data

Get a file path of a paper to upload to an LLM. If a file path is provided, e.g. a local pdf file, it is returned. If a bibcode or index is provided, retrieves the source dataset and writes it out to a temporary json file. The name of the temporary file is temp_****_[bibcode].json, prefixed with temp_ and suffixed with the bibcode of the paper.

Parameters:

filepath (str, optional) – a local filepath to a paper, by default None
bibcode (str, optional) – the bibcode of a source paper, by default None
index (int, optional) – the list index of a source paper, by default None

Returns:

the file path to the paper data

Return type:

str

bibcat.llm.io.get_llm_prompt(prompt_type: str) → str[source]#

Get an LLM prompt

Retrieve a user or agent prompt for an LLM from a file or the config. A user prompt is the text to be used as the input to the LLM, while the agent, or system, prompt is the text that defines the instructions or behavior of the LLM Agent to follow. The agent prompt is only used when creating a new agent for the first time.

You can define a custom user or agent prompt as a text file, located at $BIBCAT_DATA_DIR/llm_[prompt_type]_prompt.txt. For example, place your custom user prompt at $BIBCAT_DATA_DIR/llm_user_prompt.txt. This file takes precendence. If no custom prompt file is found, the default user prompt will come from the config file field: llms.user_prompt.

To set an agent prompt, create a file at $BIBCAT_DATA_DIR/llm_agent_prompt.txt, and add your instructions for the agent. If no custom agent prompt is found, a default agent prompt will be used. The default agent prompt will either come from the config file field: llms.agent_prompt or from the default file at etc/default_agent_prompt.txt.

Parameters:: prompt_type (str) – The type of prompt to retrieve, either ‘user’ or ‘agent’
Returns:: the text prompt
Return type:: str
Raises:: ValueError – when an invalid prompt type is provided

bibcat.llm.io.get_source(bibcode: str | None = None, index: int | None = None, body_only: bool = False) → dict | str[source]#

Get the source dataset for a given bibcode or index.

Retrieve the entry from the combined source dataset for a given bibcode or list index.

Parameters:

bibcode (str, optional) – the paper bibcode to retrieve, by default None
index (int, optional) – the list item index to retrieve, by default None
body_only (bool, optional) – Flag to only return the text body, by default False

Returns:

a row from the source dataset

Return type:

dict | str

bibcat.llm.io.read_output(bibcode: str | None = None, filename: str | Path | None = None) → list[source]#

Read in the output for a given bibcode

Returns the content from the output JSON file for the given bibcode.

Parameters:

bibcode (str, optional) – The paper bibcode, by default None
filename (Path, optional) – The prompt output file path

Returns:

The output data from the LLM response

Return type:

list

bibcat.llm.io.write_output(paper_key: str, response: dict)[source]#

Write the output response to a file

Writes the output json response to a file, located at $BIBCAT_OUTPUT/output/llms/openai_[config.llms.openai.model]/[config.llms.prompt_output_file]

The output JSON file is organized by the filename or bibcode of the input file, with each prompt response appended in the relevant section.

Parameters:

paper_key (str) – the JSON key to append the response to, e.g. the bibcode or filename
response (dict) – the response from the llm agent

bibcat.llm.io.write_summary(output: dict, output_path: str | None = None)[source]#

Write the evaluation summary output to a file

Write the output summary statistics and info from evaluation into a JSON file.

Parameters:

output (dict) – the output summary data
output_path (str, optional) – optional output directory path

Return type:

None

bibcat.llm.metrics module#

bibcat.llm.metrics.append_human_labels_with_mapped_papertype(human_data: dict[str, str], mission: str, human_labels: list[str]) → None[source]#

Append human papertype to the human_labels list after mapping it to the allowed papertype

Parameters:

human_data (dict[str]) – human classification data per bibcode in summary_output. e.g., “human”: {“GALEX”: “SCIENCE”, “HST”: “DATA-INFLUENCED”}
mission (str) – mission name, e.g., ROMAN
human_labels (list[str]) – list of human papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.append_llm_labels_with_mapped_papertype(llm_data: list[dict], mission: str, llm_labels: list[str]) → None[source]#

Append llm papertype to the llm_labels list after mapping it to the allowed papertype

Parameters:

llm_data (list[dict]) – llm classification data per bibcode in summary_output. e.g., “llm”: [{“JWST”: “SCIENCE”}, {“ROMAN”: “SUPERMENTION”}, {“HST”: “SCIENCE”}]
mission (str) – mission name, e.g., ROMAN
llm_labels (list[str]) – list of llm papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.collect_confusion_matrix_cell_entries(human_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], llm_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], label_raws: list[dict], n_classes: int) → dict[source]#

Collect bibcode + raw-label dicts for confusion matrix cells.

Parameters:

human_labels_encoded (NDArray[np.int64]) – encoded human labels
llm_labels_encoded (NDArray[np.int64]) – encoded llm labels
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, human_raw and llm_raw
n_classes (int) – number of classes

Returns:

entries (dict) – a dict with keys ‘tn’,’fp’,’fn’,’tp’ each mapping to a list of entry dicts.
Each entry dict contains following variables
bibcode (str) – bibcode
human_raw (str) – raw human label before mapping
llm_raw (str) – raw llm label before mapping

bibcat.llm.metrics.compute_and_save_metrics(metrics_data: dict[str], output_ascii_path: str | Path = 'metrics_summary.txt', output_json_path: str | Path = 'metrics_summary.json')[source]#

Compute llm performance metrics (accuracy, f1, precision, and recall scores) and other stats and save results to an ascii file

Parameters:

metrics_data (dict[str]) – contains various metrics
variables (metrics_data contains following)
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
list[str] (non_mast_missions;) – Non MAST missions called out by llm in the whole dataset
sorted – Non MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”]
llm_labels (list[str]) – Predicted labels by llm
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys ‘human_raw’ and ‘llm_raw’
output_ascii_path (str | Path) – output file path to save the metrics summary in .txt
output_json_path (str | Path) – output file path to save the metrics summary in .json

Return type:

None

bibcat.llm.metrics.extract_eval_data(data: dict, missions: list[str]) → dict[str, Any][source]#

Extract the evaluation data for confusion matrix and stats related to mission call-outs, and save to files.

Extract the human/llm labels and other stats related to valid MAST mission and non MAST mission call-outs from the evaluation json file, config.llms.eval_output_file (summary_output.json). This function is called when plotting a confusion matrix plot in bibcat.llm.plots.py

Parameters:

data (dict) – the dict of the evaluation data of config.llms.eval_output_file (*summary_output.json)
missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

metrics_data (dict[str]) – contains various metrics
metrics_data contains following variables
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
non_mast_missions (list[str], sorted) – The non-MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”] after mapping
llm_labels (list[str]) – Predicted labels by llm after mapping
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, ‘human_raw’ and ‘llm_raw’

bibcat.llm.metrics.extract_labels(missions: list[str], human_labels: list[str], llm_labels: list[str], human_llm_mission_callouts: list[str], ignored_papertype: str, item: dict[str, dict[str, Any]], n_human_llm_hallucination: int) → tuple[list[str], list[str], int][source]#

Extract human and llm papertype labels when the summary output of a bibcode has classification items other than “error”

This function extracts human and llm papertype labels from the summary_output for constructing confusion matrix, then map the papertypes to the allowed papertypes (for instance, MENTION maps to NONSCIENCE). Because the summary_output provides only human and llm callouts of only relevant missions, not all MAST missions, we need to extract the relevant labels depending on the following various conditions:

When both human and LLM call out a given mission and their papertypes, assign them to the relevant papertypes.
When human calls out the mission but LLM ignores the paper, assign the human label to its relevant papertype but the LLM label to ignored_papertype.
When human ignores the paper but LLM calls out with a papertype, assign the LLM label to its relevant papertype but the human label to ignored_papertype.
When both human and LLM ignore the paper for the mission, assign both to ignored_papertype.

Parameters:

missions (list[str]) – MAST missions of interest
human_labels (list[str]) – human papertypes before this bibcode
llm_labels (list[str]) – llm papertypes before this bibcode
human_llm_mission_callouts (list[str]) – Missions called out by both human and llm
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, NONSCIENCE
item (dict[str, dict[str, Any]]) – bibcode dictionary item
n_human_llm_hallucination (int) – the number of hallucinations before the current bibcode

Returns:

human_labels (list[str]) – human papertype labels updated after the current bibcode
llm_labels (list[str]) – llm papertype labels updated after the current bibcode
n_human_llm_hallucination (int) – the number of hallucinations updated after the current bibcode

bibcat.llm.metrics.extract_roc_data(data: dict[str, dict[str, Any]], missions: list[str])[source]#

Extract the human and llm classification labels and confidences

Extract the human classes and confidence values from the evaluation json file, config.llms.eval_output_file (summary_output.json). You can extract data from only a single mission or a list of missions. The human labels (ground truth) and llm confidence values will be used to create a ROC curve.

Parameters:

data (dict[str, dict[str, Any]]) – the dict of the evaluation data of config.llms.eval_output_file (summary_output.json)
missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

tuple – A tuple of the list of human labels, llm labels, and the hreshold value for verdict acceptance.
human_labels (list[str]) – True labels by human, a list of papertypes, .e.g, “SCIENCE” or “MENTION”(or “NONSCIENCE”), see the allowed classifications in config.llms.papertypes For example, [“SCIENCE”, “MENTION”]
llm_confidences (list[list[float]]) – A list of confidence score sets for all verdicts ([[p_science, p_mention],]) where p_science and p_mention represent confidence values of “SCIENCE” and “MENTION”(or “NONSCIENCE”) respectively. For example: [[0.9, 0.1], [0.4, 0.6]]
human_llm_missions (list[str], sorted) – A set of missions, each containing both human- and LLM-classified paper types, used for evaluation plots.

bibcat.llm.metrics.get_roc_metrics(llm_confidences: ndarray[tuple[int, ...], dtype[float64]], binarized_human_labels: ndarray[tuple[int, ...], dtype[int64]], n_papertype: int)[source]#

Compute ROC curve and ROC AUC (area under curve)

Parameters:

llm_confidences (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – the numpy array of llm_confidences
binarized_true_labels (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – binarized_human_labels, e.g., [[0] [1] [1] [0] [0]] if the binary

Returns:

tuple – a tuple of false positive rate(fpr), true positive rate(tpr), and roc_auc
fpr (float) – false positive rate
tpr (float) – true positive rate
roc_auc (float) – ROC area under curve
macro_roc_auc_ovr (float) – Macro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)
micro_roc_auc_ovr (float) – Micro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

bibcat.llm.metrics.human_labels_when_no_llm_output(missions, human_data, human_labels, ignored_papertype)[source]#

Assign human labels when human classifications exist even with no llm output

Parameters:

missions (list[str]) – list of missions
human_data (dict[str, str]) – dictionary values of item[“human”], e.g., {“JWST”: “SCIENCE”}
human_labels (list[str]) – True labels by human, a list of papertypes before papertype mapping, For example, [“SCIENCE”, “MENTION”]
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, “NONSCIENCE”

Returns:

human_labels – updated human labels based on the presence of human classifications

Return type:

list[str]

bibcat.llm.metrics.map_papertype(papertype: str) → str | None[source]#

Map a classified papertype to an allowed papertypes, for instance, if papertype is “SUPERMENTION” or “IGNORE”, it will returns “NONSCIENCE” or a custom papertype.

Parameters:: papertype (str, uppercase) – human or llm classified papertype, e.g., “SCIENCE”, “DATA_INFLUENCED”
Returns:: mapped_papertype – mapped papertype follwing config.llms.map_papertypes, e.g., “MENTION” if papertype is “SUPERMENTION”
Return type:: str, uppercase

bibcat.llm.metrics.prepare_roc_inputs(human_labels: list[str], llm_confidences: list[list[float]])[source]#

Prepare input data for ROC and AUC (area under curve)

Parameters:

human_labels (list[str]) – True labels by human, a list papertypes, .e.g, “SCIENCE” or “MENTION”, see the allowed classifications in config.llms.papertypes
llm_confidences (list[list[float]]) – Predicted labels by llm, a list of confidence score pairs for all verdicts.

Returns:

tuple – A tuple of confidences, binarized_human_labels, and n_classes.
binarized_human_labels (NDArray[np.int64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. Binarized human labels as ROC input, e.g.,[[0][1]..],[[0 1 0 0][0 1 0 0]…]
llm_confidences (NDArray[np.float64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. A list of confidence score pairs for all verdicts. Each inner list contains two floats: the first for “SCIENCE” and the second for “MENTION”. For example: [[0.9 0.1] [0.4 0.6]]
n_papertype (int) – the number of available papertypes
n_verdicts (int) – the number of MAST mission papertype verdicts by LLM

bibcat.llm.openai module#

bibcat.llm.plots module#

bibcat.llm.plots.confusion_matrix_plot(summary_output_path: str | Path, missions: list[str]) → None[source]#

Create a confusion matrix figure

Create confusion matrix plots (counts and normalized) given a threshold value.

Parameters:

summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json
missions (list[str]) – list of the mission names to extract the classification labels.

bibcat.llm.plots.roc_plot(summary_output_path: str | Path, missions: list[str]) → None[source]#

Create a Receiver Operating Characteristic (ROC) curve plot

Parameters:

summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json
missions (list[str]) – list of the mission names to extract the classification labels.

bibcat.llm.stats module#

bibcat.llm.stats.analyze_missions(human_item: dict[str, str], llm_item: list[dict[str, Any]]) → tuple[dict[str, str], int][source]#

Analyze and compare LLM classifications against human classifications.

Parameters:

human_item (dict[str, str]) – human classification of mission and papertype, e.g., {“HST”: “MENTION”, “JWST”: “SUPERMENTION”}
llm_item (list[dict[str, Any]]) – list of llm classifications

Returns:

failure (dict) – dictionary of failured cases
n_matched_classifications (int) – number of matched classifications

bibcat.llm.stats.audit_summary(audit_results: dict) → dict[str, int][source]#

Create the summary of the inconsistent classifications

Parameters:: audit_results (dict) – the breakdown bibcode list of inconsistent llm classifications e.g., “bibcodes”: {“2018A&A…610A..11I”: {“failures”: “GALEX”: “false_positive”},}
Returns:: summary_counts – various count summary
Return type:: dict[str, int]

bibcat.llm.stats.group_by_agg(confidence_name: str, threshold_acceptance: float, threshold_inspection: float, df: DataFrame)[source]#

Group DataFrame by mission and papertype and aggregate other properties.

Parameters:

confidence_name (str) – The Key name for LLM confidences, e.g, “llm_confidences” for paper_output.json or “mean_llm_confidences” for summary_output.json
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection
df (pd.DataFrame) – Dataframe

Return type:

pd.DataFrame

bibcat.llm.stats.inconsistent_classifications(input_path: str | Path, output_path: str | Path)[source]#

Save falsely classified bibcodes to a json file for investigation

This code will check if llm classification is different from human classification or incorrectly ignore the mission and save the results to a json file.

Parameters:

input_path (str | pathlib.Path) – Input eval_output file name/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file

Return type:

None

bibcat.llm.stats.save_evaluation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#

Generate acceptance and inspection statistics and identify classification inconsistencies between humans and the LLM for evaluation summary data

This function performs the following actions:

Creates a statistics file containing:
- Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.
- Human Inspection Requirements: Number of papers requiring human inspection
- Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.
- Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.

Parameters:

input_path (str | pathlib.Path) – Input paper_output file name/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection

Return type:

None

Raises:

Exception – For any other exceptions that occur during DataFrame creation or file operations.

bibcat.llm.stats.save_operation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#

Generate acceptance and inspection statistics from operational classifications

This function performs the following actions:

Creates a statistics file containing:
- Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.
- Human Inspection Requirements: Number of papers requiring human inspection
- Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.
- Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.

Parameters:

input_path (str | pathlib.Path) – Input paper_output filename/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection

Return type:

None

Raises:

Exception – For any other exceptions that occur during DataFrame creation or file operations.

bibcat.llm.stats.write_stats(output_path, threshold_acceptance, threshold_inspection, grouped_df)[source]#

Write the satistics into a JSON file.

Parameters:

output_path (pathlib.Path) – Filename path to save the stats results.
threshold_acceptance (float) – Threshold value to accept LLM papertype.
threshold_inspection (float) – Threshold value to filter papers required for human inspection.
grouped_df (pd.DataFrame) – Grouped DataFrame

Return type:

None

Module contents#

This module contains the scripts for LLM classification.

bibcat.llm package

Contents

bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.metrics module#

bibcat.llm.openai module#

bibcat.llm.plots module#

bibcat.llm.stats module#

Module contents#