Documentation Index
Fetch the complete documentation index at: https://wb-21fd5541-kb-refresh.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
W&B handles NLP experiments well through scalar metric logging for corpus-level scores (BLEU, ROUGE, perplexity) and wandb.Table for per-example text comparisons.
Logging scalar NLP metrics
Log BLEU, ROUGE, perplexity, and any other scalar scores the same way you log loss:
import wandb
from sacrebleu.metrics import BLEU
from rouge_score import rouge_scorer
wandb.init(project="nmt-project")
bleu = BLEU()
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"])
for epoch in range(num_epochs):
train(model)
hypotheses, references = evaluate(model, val_set)
bleu_score = bleu.corpus_score(hypotheses, [references])
rouge_scores = [scorer.score(ref, hyp) for ref, hyp in zip(references, hypotheses)]
wandb.log({
"epoch": epoch,
"val/bleu": bleu_score.score,
"val/rouge1": sum(s.rouge1.fmeasure for s in rouge_scores) / len(rouge_scores),
"val/rougeL": sum(s.rougeL.fmeasure for s in rouge_scores) / len(rouge_scores),
"val/perplexity": compute_perplexity(model, val_loader),
})
wandb.finish()
Logging text predictions as a table
Track model outputs alongside source and reference translations to spot qualitative changes across epochs:
text_table = wandb.Table(columns=["source", "reference", "hypothesis", "bleu"])
for src, ref, hyp in zip(sources[:50], references[:50], hypotheses[:50]):
sent_bleu = bleu.sentence_score(hyp, [ref]).score
text_table.add_data(src, ref, hyp, round(sent_bleu, 2))
wandb.log({"val/text_outputs": text_table})
In the UI, sort the table by bleu ascending to surface the worst-performing examples.
Logging token-level probabilities
For language model analysis, log per-token log-probabilities as a custom chart:
token_table = wandb.Table(columns=["token", "log_prob", "position"])
for pos, (tok, lp) in enumerate(zip(tokens, log_probs)):
token_table.add_data(tok, float(lp), pos)
wandb.log({"token_probs": token_table})
Tracking vocabulary and data statistics
Log dataset characteristics as config values so they are searchable across runs:
wandb.init(project="lm-project", config={
"vocab_size": tokenizer.vocab_size,
"max_seq_len": 512,
"train_tokens": total_train_tokens,
"dataset": "c4-en",
})
Using W&B with Hugging Face evaluate
The evaluate library from Hugging Face computes many NLP metrics and returns dicts that map cleanly to wandb.log():
import evaluate
bleu_metric = evaluate.load("sacrebleu")
rouge_metric = evaluate.load("rouge")
bleu_result = bleu_metric.compute(predictions=hypotheses, references=[[r] for r in references])
rouge_result = rouge_metric.compute(predictions=hypotheses, references=references)
wandb.log({
"val/bleu": bleu_result["score"],
"val/rouge1": rouge_result["rouge1"],
"val/rougeL": rouge_result["rougeL"],
})
Experiments
Metrics
Runs