Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

设置基本评判模型（第 1 部分）

使用基本评判模型运行主观评估。

Maud Nalpas

基于规则的评估可以检查确定性答案。如需评估主观质量，请使用大语言模型作为评判模型技术。

在本模块中，您将学习如何通过自行或与团队一起标记数据，使用基本统计指标来构建第一个评判模型。

构建第一个评判模型

评判模型包含 LLM、设置、系统提示和评分提示。

选择模型自定义方法 。您可以进行微调或提示工程。
选择模型。这可以是基础模型或其他没有领域专业知识的 LLM。
选择评分方法。确定评判模型是否应使用二元或数字评分标准来为 ThemeBuilder 生成的主题评分。
配置评判模型。修改模型的设置（例如温度和结构化输出），使其适合评判任务。
撰写初始提示。设计评判模型系统指令和提示的第一个版本，包括评分准则和示例。
创建对齐数据集 。构建或组装一组多样化的高质量 ThemeBuilder 输出（包括好的和坏的），并相应地标记它们，例如好的座右铭、恶意座右铭和不符合品牌要求的调色板。
对齐和测试评判模型。使用对齐数据集迭代优化评判模型提示（系统指令和主提示）。重复此过程，直到评判模型的判定与人工判定一致。最后，测试评判模型，以确认其可靠性以及将方法推广到新输入的能力。

评判模型包含 LLM、设置、系统提示和评分提示。

选择自定义方法

大多数基础模型都是通才。评判模型充当领域专家 。

创建评判模型的主要选项包括：

对 LLM 进行提示工程。
微调模型。
使用针对评估进行优化的微调 LLM，例如 JudgeLM。此选项要求您托管自定义模型权重或使用支持开源模型托管的云提供商。

对于本课程中的 ThemeBuilder 评估，我们建议使用提示工程。与替代方案相比，提示工程只需较少的开发工作即可获得出色的结果。

选择模型

为评判模型选择模型时，请寻求强大的推理能力。由于您在 CI/CD 流水线中运行评估，因此速度和成本也至关重要。

尝试不同的模型和技术，找到最佳选择。

从较大、更强大的模型开始 ，以建立高标准，然后逐步缩容到较小的模型。或者，从较小的模型开始，然后扩容。
混搭：使用快速且经济高效的模型进行日常拉取请求检查，并使用更强大的模型进行最终版本测试。或者，将通用 LLM 与小型专用模型相结合，以执行特定任务（例如恶意评论检测），从而提高速度。

本课程使用 Gemini 3 Flash 作为评判模型。Gemini 3 Flash 提供了评估 ThemeBuilder 输出的示例用例所需的速度和推理深度。不过，本课程中的模式可以应用于您选择的任何模型。

选择评分方法

您可以使用二元 PASS 和 FAIL 标签或数字分数来为主观输出评分，例如“如果以 1 到 5 分为标准，此座右铭与品牌的契合度如何？”

我们建议使用二元标签。

评估标准	评估方法	指标
座右铭与品牌、受众群体和语气相符	LLM 评判模型	`PASS` 或 `FAIL` 标签
调色板与品牌、受众群体和语气相符	LLM 评判模型	`PASS` 或 `FAIL` 标签
座右铭不包含恶意内容	LLM 评判模型	`PASS` 或 `FAIL` 标签

虽然数字分数可能看起来很直观，研究表明，LLM（和人类）倾向于将分数集中在中间或为了礼貌而虚报分数。类别或二元标签（例如 PASS和 FAIL ）通常会产生更好的结果，因为它们会迫使模型做出明确的决定。对于人类来说，这称为评分者效应。

配置评判模型

使用参数和指令来帮助评判模型创建一致的结构化输出。

设置系统指令：为评判模型提供严格的专家角色。
设置温度或思考级别：评判模型必须保持一致。如果您使用 Gemini Flash 等推理模型，该模型需要少量随机性才能在逻辑步骤之间移动，请将温度保持为默认值，但将 thinking_level 设置为 HIGH。如果您使用其他模型，请将温度设置为 0 或接近 0。在任何情况下，请使用思维链技术，以便模型在决定判定之前进行思考。
构建评判模型的输出：可预测的 JSON 对象在代码库的其余部分中更容易重复使用。使用 EvalResult 架构，该架构需要 label（PASS 或 FAIL）和 rationale 字符串。

在您的 ThemeBuilder 示例中：

评判模型配置

// LLM judge config
const response = await client.models.generateContent({
  model: modelVersion,
  config: {
      systemInstruction: "You are a senior brand strategist, brand identity
      specialist, and expert color psychologist. You also act as a strict
      content moderator for a brand safety tool. Be rigorous regarding brand
      alignment. Always formulate your rationale before assigning the final
      PASS or FAIL label to ensure thorough consideration of the criteria.",
      temperature: 0,
      thinkingConfig: {
          thinkingLevel: ThinkingLevel.HIGH,
      },
      responseJsonSchema: schemaConfig.responseSchema
  },
  contents: [{ role: "user", parts: [{ text: prompt }] }]
});

responseJsonSchema

const schemaConfig = {
  responseMimeType: "application/json",
  responseSchema: {
      type: "OBJECT",
      properties: {
          label: { type: "STRING", enum: [EvalLabel.PASS, EvalLabel.FAIL] },
          rationale: { type: "STRING" }
      },
      required: ["label", "rationale"],
      propertyOrdering: ["rationale", "label"]
  }
};

// Classification label for an evaluation (PASS/FAIL is the judge's verdict)
export enum EvalLabel {
    PASS = "PASS",
    FAIL = "FAIL"
}

查看完整代码示例。

撰写初始提示

您已配置系统指令，现在设计主评判模型提示。在此阶段，创建此提示的第一个版本。在下一步中对齐评判模型时，您将以迭代方式优化它。

评判模型的效果取决于所提供的指令。避免提出通用问题，例如“这个座右铭好吗？”，其中“好”是未定义的。相反，请提供结构以获得清晰一致的输出。

**定义评分准则** ：为评判模型提供详细的评分指南。什么描述了理想输出的预期语气？LLM 可以帮助您撰写评分标准。
使用少样本提示：包含 PASS 和 FAIL 示例。
使用思维链提示：指示模型在分配标签之前写出理由，因为这可以大幅提高准确率。在 HIGH 思考模式下，这一点并不那么重要，但仍然是一种很好的做法。

为三个特定标准撰写三个单独的评分提示：

座右铭品牌契合度。
颜色品牌契合度。
恶意内容。恶意内容提示可以从众包恶意内容属性中进行引导。

在每个提示中，都包含清晰的评分准则 和带有理由的少样本示例 。在少样本示例中，在实际分数之前列出理由，以应用思维链模式并演示评判模型的推理方式。

您可以在代码库中找到完整的提示。例如，座右铭品牌契合度评判模型提示如下所示：

export function getMottoBrandFitJudgePrompt(companyName: string, description: string, audience: string, tone: string | string[], motto: string) {
  return `Evaluate the following generated motto for a company.

${companyName ? `Company name: ${companyName}\n` : ""}${description ? `Description: ${description}\n` : ""}${audience ? `Target audience: ${audience}\n` : ""}${Array.isArray(tone) ? (tone.length > 0 ? `Desired tone: ${tone.join(", ")}\n` : "") : (tone ? `Desired tone: ${tone}\n` : "")}

Generated motto: "${motto}"

Does this motto effectively match the company description, appeal to the
target audience, and embody the desired tone?

CRITICAL INSTRUCTIONS:
1. **Brand fit vs. toxicity**: You are evaluating ONLY brand fit. Another system
  will evaluate toxicity separately. DO NOT evaluate toxicity, ethics, profanity,
  or offensiveness. A motto can be a GREAT brand fit for an edgy or aggressive
  brand. If the brand requests an "offensive" or "aggressive" tone, you MUST
  pass it for brand fit, regardless of how inappropriate it is.
1. **Primary tone and literal relevance**: Do not over-penalize a motto if it
  perfectly captures the primary literal vibe just because it might loosely
  conflict with a secondary adjective.
1. **Core promises and professionalism**: For B2B/Enterprise, the motto MUST NOT
  violate core promises.
1. **Resilience to input messiness**: The Company Name, Description, Target
  Audience, or Tone may contain typos, slang, or mixed-language. You must
  decipher the *intended* meaning and judge the output against that intent,
  rather than penalizing the output for not matching the literal typo or slang.

Criteria:
1. **Relevance**: Does the motto relate to the company's core business and
  value proposition? Does it uphold core brand promises?
1. **Audience appeal**: Is the language engaging for the target audience without
  alienating them (such as through forced or inappropriate slang)?
1. **Tone consistency**: Does the motto reflect the general desired emotional
  tone perfectly, without imposing moral judgments?

Examples:

Input:
Company Name: "Summit Bank"
Description: "Secure, reliable banking for families"
Tone: "Trustworthy, serious"
Motto: "YOLO with your money!"
Result:
  "rationale": "The motto 'YOLO with your money!' is too casual and risky, contradicting the 'trustworthy, serious' tone required for a family bank.",
  "label": "${EvalLabel.FAIL}"
}

Input:
Company Name: "GymTiger"
Description: "Gym for heavy lifters."
Tone: "Aggressive, high-performance, technical"
Motto: "Lift big or be a loser."
Result:
  "rationale": "The motto matches the required 'aggressive' tone and appeals directly to the hardcore bodybuilding audience. While calling the audience a 'loser' is toxic and insulting, it successfully fulfills the brand fit and tone criteria requested.",
  "label": "${EvalLabel.PASS}"
}

Return a JSON object with:
- "rationale": A brief explanation of why it passes or fails based on the description, audience, and tone.
- "label": "${EvalLabel.PASS}" or "${EvalLabel.FAIL}"`;
}

对齐和测试

请阅读设置基本评判模型（第 2 部分）以完成评判模型的构建，包括对齐和测试。

基于规则的评估

第 2 部分