bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.metrics module#

bibcat.llm.metrics.append_human_labels_with_mapped_papertype(human_data: dict[str, str], mission: str, human_labels: list[str]) → None[source]#

Append human papertype to the human_labels list after mapping it to the allowed papertype

Parameters:

human_data (dict[str]) – human classification data per bibcode in summary_output. e.g., “human”: {“GALEX”: “SCIENCE”, “HST”: “DATA-INFLUENCED”}
mission (str) – mission name, e.g., ROMAN
human_labels (list[str]) – list of human papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.append_llm_labels_with_mapped_papertype(llm_data: list[dict], mission: str, llm_labels: list[str]) → None[source]#

Append llm papertype to the llm_labels list after mapping it to the allowed papertype

Parameters:

llm_data (list[dict]) – llm classification data per bibcode in summary_output. e.g., “llm”: [{“JWST”: “SCIENCE”}, {“ROMAN”: “SUPERMENTION”}, {“HST”: “SCIENCE”}]
mission (str) – mission name, e.g., ROMAN
llm_labels (list[str]) – list of llm papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.collect_confusion_matrix_cell_entries(human_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], llm_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], label_raws: list[dict], n_classes: int) → dict[source]#

Collect bibcode + raw-label dicts for confusion matrix cells.

Parameters:

human_labels_encoded (NDArray[np.int64]) – encoded human labels
llm_labels_encoded (NDArray[np.int64]) – encoded llm labels
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, human_raw and llm_raw
n_classes (int) – number of classes

Returns:

entries (dict) – a dict with keys ‘tn’,’fp’,’fn’,’tp’ each mapping to a list of entry dicts.
Each entry dict contains following variables
bibcode (str) – bibcode
human_raw (str) – raw human label before mapping
llm_raw (str) – raw llm label before mapping

bibcat.llm.metrics.compute_and_save_metrics(metrics_data: dict[str], output_ascii_path: str | Path = 'metrics_summary.txt', output_json_path: str | Path = 'metrics_summary.json')[source]#

Compute llm performance metrics (accuracy, f1, precision, and recall scores) and other stats and save results to an ascii file

Parameters:

metrics_data (dict[str]) – contains various metrics
variables (metrics_data contains following)
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
list[str] (non_mast_missions;) – Non MAST missions called out by llm in the whole dataset
sorted – Non MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”]
llm_labels (list[str]) – Predicted labels by llm
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys ‘human_raw’ and ‘llm_raw’
output_ascii_path (str | Path) – output file path to save the metrics summary in .txt
output_json_path (str | Path) – output file path to save the metrics summary in .json

Return type:

None

bibcat.llm.metrics.extract_eval_data(data: dict, missions: list[str]) → dict[str, Any][source]#

Extract the evaluation data for confusion matrix and stats related to mission call-outs, and save to files.

Extract the human/llm labels and other stats related to valid MAST mission and non MAST mission call-outs from the evaluation json file, config.llms.eval_output_file (summary_output.json). This function is called when plotting a confusion matrix plot in bibcat.llm.plots.py

Parameters:

data (dict) – the dict of the evaluation data of config.llms.eval_output_file (*summary_output.json)
missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

metrics_data (dict[str]) – contains various metrics
metrics_data contains following variables
threshold (float) – threshold
n_bibcodes (int) – The number of bibcodes (papers)
n_human_callouts (int) – The number of callouts by human classification in the whole dataset
n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset
n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset
n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset
non_mast_missions (list[str], sorted) – The non-MAST missions called out by llm in the whole dataset
n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions
n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false
human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions
human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”] after mapping
llm_labels (list[str]) – Predicted labels by llm after mapping
label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, ‘human_raw’ and ‘llm_raw’

bibcat.llm.metrics.extract_labels(missions: list[str], human_labels: list[str], llm_labels: list[str], human_llm_mission_callouts: list[str], ignored_papertype: str, item: dict[str, dict[str, Any]], n_human_llm_hallucination: int) → tuple[list[str], list[str], int][source]#

Extract human and llm papertype labels when the summary output of a bibcode has classification items other than “error”

This function extracts human and llm papertype labels from the summary_output for constructing confusion matrix, then map the papertypes to the allowed papertypes (for instance, MENTION maps to NONSCIENCE). Because the summary_output provides only human and llm callouts of only relevant missions, not all MAST missions, we need to extract the relevant labels depending on the following various conditions:

When both human and LLM call out a given mission and their papertypes, assign them to the relevant papertypes.
When human calls out the mission but LLM ignores the paper, assign the human label to its relevant papertype but the LLM label to ignored_papertype.
When human ignores the paper but LLM calls out with a papertype, assign the LLM label to its relevant papertype but the human label to ignored_papertype.
When both human and LLM ignore the paper for the mission, assign both to ignored_papertype.

Parameters:

missions (list[str]) – MAST missions of interest
human_labels (list[str]) – human papertypes before this bibcode
llm_labels (list[str]) – llm papertypes before this bibcode
human_llm_mission_callouts (list[str]) – Missions called out by both human and llm
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, NONSCIENCE
item (dict[str, dict[str, Any]]) – bibcode dictionary item
n_human_llm_hallucination (int) – the number of hallucinations before the current bibcode

Returns:

human_labels (list[str]) – human papertype labels updated after the current bibcode
llm_labels (list[str]) – llm papertype labels updated after the current bibcode
n_human_llm_hallucination (int) – the number of hallucinations updated after the current bibcode

bibcat.llm.metrics.extract_roc_data(data: dict[str, dict[str, Any]], missions: list[str])[source]#

Extract the human and llm classification labels and confidences

Extract the human classes and confidence values from the evaluation json file, config.llms.eval_output_file (summary_output.json). You can extract data from only a single mission or a list of missions. The human labels (ground truth) and llm confidence values will be used to create a ROC curve.

Parameters:

data (dict[str, dict[str, Any]]) – the dict of the evaluation data of config.llms.eval_output_file (summary_output.json)
missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

tuple – A tuple of the list of human labels, llm labels, and the hreshold value for verdict acceptance.
human_labels (list[str]) – True labels by human, a list of papertypes, .e.g, “SCIENCE” or “MENTION”(or “NONSCIENCE”), see the allowed classifications in config.llms.papertypes For example, [“SCIENCE”, “MENTION”]
llm_confidences (list[list[float]]) – A list of confidence score sets for all verdicts ([[p_science, p_mention],]) where p_science and p_mention represent confidence values of “SCIENCE” and “MENTION”(or “NONSCIENCE”) respectively. For example: [[0.9, 0.1], [0.4, 0.6]]
human_llm_missions (list[str], sorted) – A set of missions, each containing both human- and LLM-classified paper types, used for evaluation plots.

bibcat.llm.metrics.get_roc_metrics(llm_confidences: ndarray[tuple[int, ...], dtype[float64]], binarized_human_labels: ndarray[tuple[int, ...], dtype[int64]], n_papertype: int)[source]#

Compute ROC curve and ROC AUC (area under curve)

Parameters:

llm_confidences (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – the numpy array of llm_confidences
binarized_true_labels (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – binarized_human_labels, e.g., [[0] [1] [1] [0] [0]] if the binary

Returns:

tuple – a tuple of false positive rate(fpr), true positive rate(tpr), and roc_auc
fpr (float) – false positive rate
tpr (float) – true positive rate
roc_auc (float) – ROC area under curve
macro_roc_auc_ovr (float) – Macro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)
micro_roc_auc_ovr (float) – Micro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

bibcat.llm.metrics.human_labels_when_no_llm_output(missions, human_data, human_labels, ignored_papertype)[source]#

Assign human labels when human classifications exist even with no llm output

Parameters:

missions (list[str]) – list of missions
human_data (dict[str, str]) – dictionary values of item[“human”], e.g., {“JWST”: “SCIENCE”}
human_labels (list[str]) – True labels by human, a list of papertypes before papertype mapping, For example, [“SCIENCE”, “MENTION”]
ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, “NONSCIENCE”

Returns:

human_labels – updated human labels based on the presence of human classifications

Return type:

list[str]

bibcat.llm.metrics.map_papertype(papertype: str) → str | None[source]#

Map a classified papertype to an allowed papertypes, for instance, if papertype is “SUPERMENTION” or “IGNORE”, it will returns “NONSCIENCE” or a custom papertype.

Parameters:: papertype (str, uppercase) – human or llm classified papertype, e.g., “SCIENCE”, “DATA_INFLUENCED”
Returns:: mapped_papertype – mapped papertype follwing config.llms.map_papertypes, e.g., “MENTION” if papertype is “SUPERMENTION”
Return type:: str, uppercase

bibcat.llm.metrics.prepare_roc_inputs(human_labels: list[str], llm_confidences: list[list[float]])[source]#

Prepare input data for ROC and AUC (area under curve)

Parameters:

human_labels (list[str]) – True labels by human, a list papertypes, .e.g, “SCIENCE” or “MENTION”, see the allowed classifications in config.llms.papertypes
llm_confidences (list[list[float]]) – Predicted labels by llm, a list of confidence score pairs for all verdicts.

Returns:

tuple – A tuple of confidences, binarized_human_labels, and n_classes.
binarized_human_labels (NDArray[np.int64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. Binarized human labels as ROC input, e.g.,[[0][1]..],[[0 1 0 0][0 1 0 0]…]
llm_confidences (NDArray[np.float64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. A list of confidence score pairs for all verdicts. Each inner list contains two floats: the first for “SCIENCE” and the second for “MENTION”. For example: [[0.9 0.1] [0.4 0.6]]
n_papertype (int) – the number of available papertypes
n_verdicts (int) – the number of MAST mission papertype verdicts by LLM

bibcat.llm package

Contents

bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.metrics module#

bibcat.llm.openai module#

bibcat.llm.plots module#

bibcat.llm.stats module#

Module contents#