bibcat.llm package#

Submodules#

bibcat.llm.chunker module#

bibcat.llm.evaluate module#

bibcat.llm.io module#

bibcat.llm.metrics module#

bibcat.llm.metrics.append_human_labels_with_mapped_papertype(human_data: dict[str, str], mission: str, human_labels: list[str]) None[source]#

Append human papertype to the human_labels list after mapping it to the allowed papertype

Parameters:
  • human_data (dict[str]) – human classification data per bibcode in summary_output. e.g., “human”: {“GALEX”: “SCIENCE”, “HST”: “DATA-INFLUENCED”}

  • mission (str) – mission name, e.g., ROMAN

  • human_labels (list[str]) – list of human papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.append_llm_labels_with_mapped_papertype(llm_data: list[dict], mission: str, llm_labels: list[str]) None[source]#

Append llm papertype to the llm_labels list after mapping it to the allowed papertype

Parameters:
  • llm_data (list[dict]) – llm classification data per bibcode in summary_output. e.g., “llm”: [{“JWST”: “SCIENCE”}, {“ROMAN”: “SUPERMENTION”}, {“HST”: “SCIENCE”}]

  • mission (str) – mission name, e.g., ROMAN

  • llm_labels (list[str]) – list of llm papertype labels for confusion matrix, e.g., [“SCIENCE”,”NONSCIENCE”,”SCIENCE”]

Return type:

None

bibcat.llm.metrics.collect_confusion_matrix_cell_entries(human_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], llm_labels_encoded: ndarray[tuple[int, ...], dtype[int64]], label_raws: list[dict], n_classes: int) dict[source]#

Collect bibcode + raw-label dicts for confusion matrix cells.

Parameters:
  • human_labels_encoded (NDArray[np.int64]) – encoded human labels

  • llm_labels_encoded (NDArray[np.int64]) – encoded llm labels

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, human_raw and llm_raw

  • n_classes (int) – number of classes

Returns:

  • entries (dict) – a dict with keys ‘tn’,’fp’,’fn’,’tp’ each mapping to a list of entry dicts.

  • Each entry dict contains following variables

  • bibcode (str) – bibcode

  • human_raw (str) – raw human label before mapping

  • llm_raw (str) – raw llm label before mapping

bibcat.llm.metrics.compute_and_save_metrics(metrics_data: dict[str], output_ascii_path: str | Path = 'metrics_summary.txt', output_json_path: str | Path = 'metrics_summary.json')[source]#

Compute llm performance metrics (accuracy, f1, precision, and recall scores) and other stats and save results to an ascii file

Parameters:
  • metrics_data (dict[str]) – contains various metrics

  • variables (metrics_data contains following)

  • threshold (float) – threshold

  • n_bibcodes (int) – The number of bibcodes (papers)

  • n_human_callouts (int) – The number of callouts by human classification in the whole dataset

  • n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset

  • n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset

  • n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset

  • list[str] (non_mast_missions;) – Non MAST missions called out by llm in the whole dataset

  • sorted – Non MAST missions called out by llm in the whole dataset

  • n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions

  • n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false

  • human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions

  • human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”]

  • llm_labels (list[str]) – Predicted labels by llm

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys ‘human_raw’ and ‘llm_raw’

  • output_ascii_path (str | Path) – output file path to save the metrics summary in .txt

  • output_json_path (str | Path) – output file path to save the metrics summary in .json

Return type:

None

bibcat.llm.metrics.extract_eval_data(data: dict, missions: list[str]) dict[str, Any][source]#

Extract the evaluation data for confusion matrix and stats related to mission call-outs, and save to files.

Extract the human/llm labels and other stats related to valid MAST mission and non MAST mission call-outs from the evaluation json file, config.llms.eval_output_file (summary_output.json). This function is called when plotting a confusion matrix plot in bibcat.llm.plots.py

Parameters:
  • data (dict) – the dict of the evaluation data of config.llms.eval_output_file (*summary_output.json)

  • missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

  • metrics_data (dict[str]) – contains various metrics

  • metrics_data contains following variables

  • threshold (float) – threshold

  • n_bibcodes (int) – The number of bibcodes (papers)

  • n_human_callouts (int) – The number of callouts by human classification in the whole dataset

  • n_llm_callouts (int) – The number of callouts by llm classification in the whole dataset

  • n_non_mast_callouts (int) – The number of non-MAST missions by llm in the whole dataset

  • n_missing_ouptput_bibcodes (int) – The number of bibcodes missing output in the whole dataset

  • non_mast_missions (list[str], sorted) – The non-MAST missions called out by llm in the whole dataset

  • n_human_llm_mission_callouts (int) – The number of mission callouts by both human and llm in the given missions

  • n_human_llm_hallucination (int) – The number of apparent hallucination by both human and llm in the given missions when “mission_in_text” = false

  • human_llm_missions (list[str]) – The missions called out by both human and llm in the given missions

  • human_labels (list[str]) – True labels, human classified labels like [“SCIENCE”, “MENTION”] after mapping

  • llm_labels (list[str]) – Predicted labels by llm after mapping

  • label_raws (list[dict]) – list of raw labels (before mapping): dict with keys bibcode, mission, ‘human_raw’ and ‘llm_raw’

bibcat.llm.metrics.extract_labels(missions: list[str], human_labels: list[str], llm_labels: list[str], human_llm_mission_callouts: list[str], ignored_papertype: str, item: dict[str, dict[str, Any]], n_human_llm_hallucination: int) tuple[list[str], list[str], int][source]#

Extract human and llm papertype labels when the summary output of a bibcode has classification items other than “error”

This function extracts human and llm papertype labels from the summary_output for constructing confusion matrix, then map the papertypes to the allowed papertypes (for instance, MENTION maps to NONSCIENCE). Because the summary_output provides only human and llm callouts of only relevant missions, not all MAST missions, we need to extract the relevant labels depending on the following various conditions:

  1. When both human and LLM call out a given mission and their papertypes, assign them to the relevant papertypes.

  2. When human calls out the mission but LLM ignores the paper, assign the human label to its relevant papertype but the LLM label to ignored_papertype.

  3. When human ignores the paper but LLM calls out with a papertype, assign the LLM label to its relevant papertype but the human label to ignored_papertype.

  4. When both human and LLM ignore the paper for the mission, assign both to ignored_papertype.

Parameters:
  • missions (list[str]) – MAST missions of interest

  • human_labels (list[str]) – human papertypes before this bibcode

  • llm_labels (list[str]) – llm papertypes before this bibcode

  • human_llm_mission_callouts (list[str]) – Missions called out by both human and llm

  • ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, NONSCIENCE

  • item (dict[str, dict[str, Any]]) – bibcode dictionary item

  • n_human_llm_hallucination (int) – the number of hallucinations before the current bibcode

Returns:

  • human_labels (list[str]) – human papertype labels updated after the current bibcode

  • llm_labels (list[str]) – llm papertype labels updated after the current bibcode

  • n_human_llm_hallucination (int) – the number of hallucinations updated after the current bibcode

bibcat.llm.metrics.extract_roc_data(data: dict[str, dict[str, Any]], missions: list[str])[source]#

Extract the human and llm classification labels and confidences

Extract the human classes and confidence values from the evaluation json file, config.llms.eval_output_file (summary_output.json). You can extract data from only a single mission or a list of missions. The human labels (ground truth) and llm confidence values will be used to create a ROC curve.

Parameters:
  • data (dict[str, dict[str, Any]]) – the dict of the evaluation data of config.llms.eval_output_file (summary_output.json)

  • missions (list[str]) – list of the mission names to extract the classification labels.

Returns:

  • tuple – A tuple of the list of human labels, llm labels, and the hreshold value for verdict acceptance.

  • human_labels (list[str]) – True labels by human, a list of papertypes, .e.g, “SCIENCE” or “MENTION”(or “NONSCIENCE”), see the allowed classifications in config.llms.papertypes For example, [“SCIENCE”, “MENTION”]

  • llm_confidences (list[list[float]]) – A list of confidence score sets for all verdicts ([[p_science, p_mention],]) where p_science and p_mention represent confidence values of “SCIENCE” and “MENTION”(or “NONSCIENCE”) respectively. For example: [[0.9, 0.1], [0.4, 0.6]]

  • human_llm_missions (list[str], sorted) – A set of missions, each containing both human- and LLM-classified paper types, used for evaluation plots.

bibcat.llm.metrics.get_roc_metrics(llm_confidences: ndarray[tuple[int, ...], dtype[float64]], binarized_human_labels: ndarray[tuple[int, ...], dtype[int64]], n_papertype: int)[source]#

Compute ROC curve and ROC AUC (area under curve)

Parameters:
  • llm_confidences (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – the numpy array of llm_confidences

  • binarized_true_labels (array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case) – binarized_human_labels, e.g., [[0] [1] [1] [0] [0]] if the binary

Returns:

  • tuple – a tuple of false positive rate(fpr), true positive rate(tpr), and roc_auc

  • fpr (float) – false positive rate

  • tpr (float) – true positive rate

  • roc_auc (float) – ROC area under curve

  • macro_roc_auc_ovr (float) – Macro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

  • micro_roc_auc_ovr (float) – Micro-averaged One-vs-Rest ROC AUC score for the multiclass case (only when n_papertypes > 2)

bibcat.llm.metrics.human_labels_when_no_llm_output(missions, human_data, human_labels, ignored_papertype)[source]#

Assign human labels when human classifications exist even with no llm output

Parameters:
  • missions (list[str]) – list of missions

  • human_data (dict[str, str]) – dictionary values of item[“human”], e.g., {“JWST”: “SCIENCE”}

  • human_labels (list[str]) – True labels by human, a list of papertypes before papertype mapping, For example, [“SCIENCE”, “MENTION”]

  • ignored_papertype (str, uppercase) – config.llms.map_papertypes.ignore.upper(), for instance, “NONSCIENCE”

Returns:

human_labels – updated human labels based on the presence of human classifications

Return type:

list[str]

bibcat.llm.metrics.map_papertype(papertype: str) str | None[source]#

Map a classified papertype to an allowed papertypes, for instance, if papertype is “SUPERMENTION” or “IGNORE”, it will returns “NONSCIENCE” or a custom papertype.

Parameters:

papertype (str, uppercase) – human or llm classified papertype, e.g., “SCIENCE”, “DATA_INFLUENCED”

Returns:

mapped_papertype – mapped papertype follwing config.llms.map_papertypes, e.g., “MENTION” if papertype is “SUPERMENTION”

Return type:

str, uppercase

bibcat.llm.metrics.prepare_roc_inputs(human_labels: list[str], llm_confidences: list[list[float]])[source]#

Prepare input data for ROC and AUC (area under curve)

Parameters:
  • human_labels (list[str]) – True labels by human, a list papertypes, .e.g, “SCIENCE” or “MENTION”, see the allowed classifications in config.llms.papertypes

  • llm_confidences (list[list[float]]) – Predicted labels by llm, a list of confidence score pairs for all verdicts.

Returns:

  • tuple – A tuple of confidences, binarized_human_labels, and n_classes.

  • binarized_human_labels (NDArray[np.int64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. Binarized human labels as ROC input, e.g.,[[0][1]..],[[0 1 0 0][0 1 0 0]…]

  • llm_confidences (NDArray[np.float64]) – Array-like of shape (n_samples,) if the binary case or (n_samples, n_classes) if the multi-class case. A list of confidence score pairs for all verdicts. Each inner list contains two floats: the first for “SCIENCE” and the second for “MENTION”. For example: [[0.9 0.1] [0.4 0.6]]

  • n_papertype (int) – the number of available papertypes

  • n_verdicts (int) – the number of MAST mission papertype verdicts by LLM

bibcat.llm.openai module#

bibcat.llm.plots module#

bibcat.llm.stats module#

Module contents#

This module contains the scripts for LLM classification.