bibcat.llm package#
Submodules#
bibcat.llm.chunker module#
bibcat.llm.evaluate module#
bibcat.llm.io module#
- bibcat.llm.io.adjust_model(batch_file: Path, orig: str, model: str)[source]#
Adjust the model in the jsonl batch file.
This function replaces the original model with the new model in the specified batch file.
- bibcat.llm.io.get_file(filepath: str | None = None, bibcode: str | None = None, index: int | None = None) str[source]#
Get a file path for paper data
Get a file path of a paper to upload to an LLM. If a file path is provided, e.g. a local pdf file, it is returned. If a bibcode or index is provided, retrieves the source dataset and writes it out to a temporary json file. The name of the temporary file is temp_****_[bibcode].json, prefixed with temp_ and suffixed with the bibcode of the paper.
- bibcat.llm.io.get_llm_prompt(prompt_type: str) str[source]#
Get an LLM prompt
Retrieve a user or agent prompt for an LLM from a file or the config. A user prompt is the text to be used as the input to the LLM, while the agent, or system, prompt is the text that defines the instructions or behavior of the LLM Agent to follow. The agent prompt is only used when creating a new agent for the first time.
You can define a custom user or agent prompt as a text file, located at $BIBCAT_DATA_DIR/llm_[prompt_type]_prompt.txt. For example, place your custom user prompt at $BIBCAT_DATA_DIR/llm_user_prompt.txt. This file takes precendence. If no custom prompt file is found, the default user prompt will come from the config file field:
llms.user_prompt.To set an agent prompt, create a file at $BIBCAT_DATA_DIR/llm_agent_prompt.txt, and add your instructions for the agent. If no custom agent prompt is found, a default agent prompt will be used. The default agent prompt will either come from the config file field:
llms.agent_promptor from the default file at etc/default_agent_prompt.txt.- Parameters:
prompt_type (str) – The type of prompt to retrieve, either ‘user’ or ‘agent’
- Returns:
the text prompt
- Return type:
- Raises:
ValueError – when an invalid prompt type is provided
- bibcat.llm.io.get_source(bibcode: str | None = None, index: int | None = None, body_only: bool = False) dict | str[source]#
Get the source dataset for a given bibcode or index.
Retrieve the entry from the combined source dataset for a given bibcode or list index.
- Parameters:
- Returns:
a row from the source dataset
- Return type:
- bibcat.llm.io.read_output(bibcode: str | None = None, filename: str | Path | None = None) list[source]#
Read in the output for a given bibcode
Returns the content from the output JSON file for the given bibcode.
- bibcat.llm.io.write_output(paper_key: str, response: dict)[source]#
Write the output response to a file
Writes the output json response to a file, located at $BIBCAT_OUTPUT/output/llms/openai_[config.llms.openai.model]/[config.llms.prompt_output_file]
The output JSON file is organized by the filename or bibcode of the input file, with each prompt response appended in the relevant section.
bibcat.llm.metrics module#
- bibcat.llm.metrics.append_human_labels_with_mapped_papertype(human_data: dict[str, str], mission: str, human_labels: list[str]) None[source]#
Append human papertype to the human_labels list after mapping it to the allowed papertype
- Parameters:
human_data (dict[str]) – human classification data per bibcode in summary_output. e.g., “human”: {“GALEX”: “SCIENCE”, “HST”: “DATA-INFLUENCED”}
mission (str) – mission name, e.g., ROMAN
human_labels (list[str]) – list of human papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]
- Return type:
None
- bibcat.llm.metrics.append_llm_labels_with_mapped_papertype(llm_data: list[dict], mission: str, llm_labels: list[str]) None[source]#
Append llm papertype to the llm_labels list after mapping it to the allowed papertype
- Parameters:
llm_data (list[dict]) – llm classification data per bibcode in summary_output. e.g., “llm”: [{“JWST”: “SCIENCE”}, {“ROMAN”: “SUPERMENTION”}, {“HST”: “SCIENCE”}]
mission (str) – mission name, e.g., ROMAN
llm_labels (list[str]) – list of llm papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]
- Return type:
None
- bibcat.llm.metrics.collect_confusion_matrix_cell_entries(human_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], llm_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], label_raws: list[dict], n_classes: int) dict[source]#
Collect bibcode + raw-label dicts for confusion matrix cells.
- Parameters:
- Returns:
entries (dict) – a dict with keys ‘tn’,’fp’,’fn’,’tp’ each mapping to a list of entry dicts.
Each entry dict contains following variables
bibcode (str) – bibcode
human_raw (str) – raw human label before mapping
llm_raw (str) – raw llm label before mapping
- bibcat.llm.metrics.compute_and_save_metrics(metrics_data: dict[str], output_ascii_path: str | Path = 'metrics_summary.txt', output_json_path: str | Path = 'metrics_summary.json')[source]#
Compute llm performance metrics (accuracy, f1, precision, and recall scores) and other stats and save results to an ascii file
- Parameters:
variables (metrics_data contains following)
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
list[str] (non_mast_missions;) – Non MAST missions called out by llm in the whole dataset
sorted – Non MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”]
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys ‘human_raw’ and ‘llm_raw’
output_ascii_path (str | Path) – output file path to save the metrics summary in .txt
output_json_path (str | Path) – output file path to save the metrics summary in .json
- Return type:
None
- bibcat.llm.metrics.extract_eval_data(data: dict, missions: list[str]) dict[str, Any][source]#
Extract the evaluation data for confusion matrix and stats related to mission call-outs, and save to files.
Extract the human/llm labels and other stats related to valid MAST mission and non MAST mission call-outs from the evaluation json file, config.llms.eval_output_file (summary_output.json). This function is called when plotting a confusion matrix plot in bibcat.llm.plots.py
- Parameters:
- Returns:
metrics_data (dict[str]) – contains various metrics
metrics_data contains following variables
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
non_mast_missions (list[str], sorted) – The non-MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”] after mapping
llm_labels (list[str]) – Predicted labels by llm after mapping
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, ‘human_raw’ and ‘llm_raw’
- bibcat.llm.metrics.extract_labels(missions: list[str], human_labels: list[str], llm_labels: list[str], human_llm_mission_callouts: list[str], ignored_papertype: str, item: dict[str, dict[str, Any]], n_human_llm_hallucination: int) tuple[list[str], list[str], int][source]#
Extract human and llm papertype labels when the summary output of a bibcode has classification items other than “error”
This function extracts human and llm papertype labels from the summary_output for constructing confusion matrix, then map the papertypes to the allowed papertypes (for instance, MENTION maps to NONSCIENCE). Because the summary_output provides only human and llm callouts of only relevant missions, not all MAST missions, we need to extract the relevant labels depending on the following various conditions:
When both human and LLM call out a given mission and their papertypes, assign them to the relevant papertypes.
When human calls out the mission but LLM ignores the paper, assign the human label to its relevant papertype but the LLM label to ignored_papertype.
When human ignores the paper but LLM calls out with a papertype, assign the LLM label to its relevant papertype but the human label to ignored_papertype.
When both human and LLM ignore the paper for the mission, assign both to ignored_papertype.
- Parameters:
human_labels (list[str]) – human papertypes before this bibcode
human_llm_mission_callouts (list[str]) – Missions called out by both human and llm
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, NONSCIENCE
n_human_llm_hallucination (int) – the number of hallucinations before the current bibcode
- Returns:
human_labels (list[str]) – human papertype labels updated after the current bibcode
llm_labels (list[str]) – llm papertype labels updated after the current bibcode
n_human_llm_hallucination (int) – the number of hallucinations updated after the current bibcode
- bibcat.llm.metrics.extract_roc_data(data: dict[str, dict[str, Any]], missions: list[str])[source]#
Extract the human and llm classification labels and confidences
Extract the human classes and confidence values from the evaluation json file, config.llms.eval_output_file (summary_output.json). You can extract data from only a single mission or a list of missions. The human labels (ground truth) and llm confidence values will be used to create a ROC curve.
- Parameters:
- Returns:
tuple – A tuple of the list of human labels, llm labels, and the hreshold value for verdict acceptance.
human_labels (list[str]) – True labels by human, a list of papertypes, .e.g, “SCIENCE” or “MENTION”(or “NONSCIENCE”), see the allowed classifications in config.llms.papertypes For example, [“SCIENCE”, “MENTION”]
llm_confidences (list[list[float]]) – A list of confidence score sets for all verdicts ([[p_science, p_mention],]) where p_science and p_mention represent confidence values of “SCIENCE” and “MENTION”(or “NONSCIENCE”) respectively. For example: [[0.9, 0.1], [0.4, 0.6]]
human_llm_missions (list[str], sorted) – A set of missions, each containing both human- and LLM-classified paper types, used for evaluation plots.
- bibcat.llm.metrics.get_roc_metrics(llm_confidences: ndarray[tuple[int, ...], dtype[float64]], binarized_human_labels: ndarray[tuple[int, ...], dtype[int64]], n_papertype: int)[source]#
Compute ROC curve and ROC AUC (area under curve)
- Parameters:
llm_confidences (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – the numpy array of llm_confidences
binarized_true_labels (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – binarized_human_labels, e.g., [[0] [1] [1] [0] [0]] if the binary
- Returns:
tuple – a tuple of false positive rate(fpr), true positive rate(tpr), and roc_auc
fpr (float) – false positive rate
tpr (float) – true positive rate
roc_auc (float) – ROC area under curve
macro_roc_auc_ovr (float) – Macro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)
micro_roc_auc_ovr (float) – Micro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)
- bibcat.llm.metrics.human_labels_when_no_llm_output(missions, human_data, human_labels, ignored_papertype)[source]#
Assign human labels when human classifications exist even with no llm output
- Parameters:
human_data (dict[str, str]) – dictionary values of item[“human”], e.g., {“JWST”: “SCIENCE”}
human_labels (list[str]) – True labels by human, a list of papertypes before papertype mapping, For example, [“SCIENCE”, “MENTION”]
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, “NONSCIENCE”
- Returns:
human_labels – updated human labels based on the presence of human classifications
- Return type:
- bibcat.llm.metrics.map_papertype(papertype: str) str | None[source]#
Map a classified papertype to an allowed papertypes, for instance, if papertype is “SUPERMENTION” or “IGNORE”, it will returns “NONSCIENCE” or a custom papertype.
- bibcat.llm.metrics.prepare_roc_inputs(human_labels: list[str], llm_confidences: list[list[float]])[source]#
Prepare input data for ROC and AUC (area under curve)
- Parameters:
- Returns:
tuple – A tuple of confidences, binarized_human_labels, and n_classes.
binarized_human_labels (NDArray[np.int64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. Binarized human labels as ROC input, e.g.,[[0][1]..],[[0 1 0 0][0 1 0 0]…]
llm_confidences (NDArray[np.float64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. A list of confidence score pairs for all verdicts. Each inner list contains two floats: the first for “SCIENCE” and the second for “MENTION”. For example: [[0.9 0.1] [0.4 0.6]]
n_papertype (int) – the number of available papertypes
n_verdicts (int) – the number of MAST mission papertype verdicts by LLM
bibcat.llm.openai module#
bibcat.llm.plots module#
- bibcat.llm.plots.confusion_matrix_plot(summary_output_path: str | Path, missions: list[str]) None[source]#
Create a confusion matrix figure
Create confusion matrix plots (counts and normalized) given a threshold value.
- Parameters:
summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json
missions (list[str]) – list of the mission names to extract the classification labels.
- bibcat.llm.plots.roc_plot(summary_output_path: str | Path, missions: list[str]) None[source]#
Create a Receiver Operating Characteristic (ROC) curve plot
- Parameters:
summary_output_path (str | pathlib.Path) – the filepath of the evaluation *summary_output.json
missions (list[str]) – list of the mission names to extract the classification labels.
bibcat.llm.stats module#
- bibcat.llm.stats.analyze_missions(human_item: dict[str, str], llm_item: list[dict[str, Any]]) tuple[dict[str, str], int][source]#
Analyze and compare LLM classifications against human classifications.
- Parameters:
- Returns:
failure (dict) – dictionary of failured cases
n_matched_classifications (int) – number of matched classifications
- bibcat.llm.stats.audit_summary(audit_results: dict) dict[str, int][source]#
Create the summary of the inconsistent classifications
- bibcat.llm.stats.group_by_agg(confidence_name: str, threshold_acceptance: float, threshold_inspection: float, df: DataFrame)[source]#
Group DataFrame by mission and papertype and aggregate other properties.
- Parameters:
confidence_name (str) – The Key name for LLM confidences, e.g, “llm_confidences” for paper_output.json or “mean_llm_confidences” for summary_output.json
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection
df (pd.DataFrame) – Dataframe
- Return type:
pd.DataFrame
- bibcat.llm.stats.inconsistent_classifications(input_path: str | Path, output_path: str | Path)[source]#
Save falsely classified bibcodes to a json file for investigation
This code will check if llm classification is different from human classification or incorrectly ignore the mission and save the results to a json file.
- Parameters:
input_path (str | pathlib.Path) – Input eval_output file name/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file
- Return type:
None
- bibcat.llm.stats.save_evaluation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#
Generate acceptance and inspection statistics and identify classification inconsistencies between humans and the LLM for evaluation summary data
- This function performs the following actions:
- Creates a statistics file containing:
Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.
Human Inspection Requirements: Number of papers requiring human inspection
Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.
Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.
- Parameters:
input_path (str | pathlib.Path) – Input paper_output file name/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection
- Return type:
None
- Raises:
Exception – For any other exceptions that occur during DataFrame creation or file operations.
- bibcat.llm.stats.save_operation_stats(input_path: str | Path, output_path: str | Path, threshold_acceptance: float, threshold_inspection: float)[source]#
Generate acceptance and inspection statistics from operational classifications
- This function performs the following actions:
- Creates a statistics file containing:
Accepted LLM Classifications: Number of papers with classifications accepted by the LLM based on a specified threshold value for each combination of mission and paper type.
Human Inspection Requirements: Number of papers requiring human inspection
Accepted Bibcodes: Bibcodes corresponding to the accepted classifications.
Inspection-Required Bibcodes: Bibcodes that need human inspection due to ambiguous confidence values.
- Parameters:
input_path (str | pathlib.Path) – Input paper_output filename/path for statistics
output_path (str | pathlib.Path) – File name/path to save the JSON file
threshold_acceptance (float) – Threshold value to accept LLM papertype
threshold_inspection (float) – Threshold value to filter papers required for human inspection
- Return type:
None
- Raises:
Exception – For any other exceptions that occur during DataFrame creation or file operations.
- bibcat.llm.stats.write_stats(output_path, threshold_acceptance, threshold_inspection, grouped_df)[source]#
Write the satistics into a JSON file.
- Parameters:
output_path (pathlib.Path) – Filename path to save the stats results.
threshold_acceptance (float) – Threshold value to accept LLM papertype.
threshold_inspection (float) – Threshold value to filter papers required for human inspection.
grouped_df (pd.DataFrame) – Grouped DataFrame
- Return type:
None
Module contents#
This module contains the scripts for LLM classification.