- 1.121.0 (latest)
- 1.120.0
- 1.119.0
- 1.118.0
- 1.117.0
- 1.116.0
- 1.115.0
- 1.114.0
- 1.113.0
- 1.112.0
- 1.111.0
- 1.110.0
- 1.109.0
- 1.108.0
- 1.107.0
- 1.106.0
- 1.105.0
- 1.104.0
- 1.103.0
- 1.102.0
- 1.101.0
- 1.100.0
- 1.99.0
- 1.98.0
- 1.97.0
- 1.96.0
- 1.95.1
- 1.94.0
- 1.93.1
- 1.92.0
- 1.91.0
- 1.90.0
- 1.89.0
- 1.88.0
- 1.87.0
- 1.86.0
- 1.85.0
- 1.84.0
- 1.83.0
- 1.82.0
- 1.81.0
- 1.80.0
- 1.79.0
- 1.78.0
- 1.77.0
- 1.76.0
- 1.75.0
- 1.74.0
- 1.73.0
- 1.72.0
- 1.71.1
- 1.70.0
- 1.69.0
- 1.68.0
- 1.67.1
- 1.66.0
- 1.65.0
- 1.63.0
- 1.62.0
- 1.60.0
- 1.59.0
- 1.58.0
- 1.57.0
- 1.56.0
- 1.55.0
- 1.54.1
- 1.53.0
- 1.52.0
- 1.51.0
- 1.50.0
- 1.49.0
- 1.48.0
- 1.47.0
- 1.46.0
- 1.45.0
- 1.44.0
- 1.43.0
- 1.39.0
- 1.38.1
- 1.37.0
- 1.36.4
- 1.35.0
- 1.34.0
- 1.33.1
- 1.32.0
- 1.31.1
- 1.30.1
- 1.29.0
- 1.28.1
- 1.27.1
- 1.26.1
- 1.25.0
- 1.24.1
- 1.23.0
- 1.22.1
- 1.21.0
- 1.20.0
- 1.19.1
- 1.18.3
- 1.17.1
- 1.16.1
- 1.15.1
- 1.14.0
- 1.13.1
- 1.12.1
- 1.11.0
- 1.10.0
- 1.9.0
- 1.8.1
- 1.7.1
- 1.6.2
- 1.5.0
- 1.4.3
- 1.3.0
- 1.2.0
- 1.1.1
- 1.0.1
- 0.9.0
- 0.8.0
- 0.7.1
- 0.6.0
- 0.5.1
- 0.4.0
- 0.3.1
API documentation for evaluation package.
Classes
CustomMetric
The custom evaluation metric.
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key.
EvalResult
Evaluation result.
EvalTask
A class representing an EvalTask.
An Evaluation Tasks is defined to measure the model's ability to perform a certain task in response to specific prompts or inputs. Evaluation tasks must contain an evaluation dataset, and a list of metrics to evaluate. Evaluation tasks help developers compare propmpt templates, track experiments, compare models and their settings, and assess the quality of the model's generated text.
Dataset details: Default dataset column names:
- content_column_name: "content"
- reference_column_name: "reference"
- response_column_name: "response"
Requirement for different use cases:
- Bring your own prediction: A
responsecolumn is required. Response column name can be customized by providingresponse_column_nameparameter. - Without prompt template: A column representing the input prompt to the
model is required. If
content_column_nameis not specified, the eval dataset requirescontentcolumn by default. The response column is not used if present and new responses from the model are generated with the content column and used for evaluation. - With prompt template: Dataset must contain column names corresponding to
the placeholder names in the prompt template. For example, if prompt
template is "Instruction: {instruction}, context: {context}", the
dataset must contain
instructionandcontextcolumn.
- Bring your own prediction: A
Metrics Details: The supported metrics, metric bundle descriptions, grading rubrics, and the required input fields can be found on the Vertex AI public documentation.
Usage:
To perform bring your own prediction evaluation, provide the model responses in the response column in the dataset. The response column name is "response" by default, or specify
response_column_nameparameter to customize.eval_dataset = pd.DataFrame({ "reference": [...], "response" : [...], }) eval_task = EvalTask( dataset=eval_dataset, metrics=["bleu", "rouge_l_sum", "coherence", "fluency"], experiment="my-experiment", ) eval_result = eval_task.evaluate( experiment_run_name="eval-experiment-run" )To perform evaluation with built-in Gemini model inference, specify the
modelparameter with a GenerativeModel instance. The default query column name to the model iscontent.eval_dataset = pd.DataFrame({ "reference": [...], "content" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["exact_match", "bleu", "rouge_1", "rouge_2", "rouge_l_sum"], experiment="my-experiment", ).evaluate( model=GenerativeModel("gemini-pro"), experiment_run_name="gemini-pro-eval-run" )If a
prompt_templateis specified, thecontentcolumn is not required. Prompts can be assembled from the evaluation dataset, and all placeholder names must be present in the dataset columns.eval_dataset = pd.DataFrame({ "context" : [...], "instruction": [...], "reference" : [...], }) result = EvalTask( dataset=eval_dataset, metrics=["summarization_quality"], ).evaluate( model=model, prompt_template="{instruction}. Article: {context}. Summary:", )To perform evaluation with custom model inference, specify the
modelparameter with a custom prediction function. Thecontentcolumn in the dataset is used to generate predictions with the custom model function for evaluation.def custom_model_fn(input: str) -> str: response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": input} ] ) return response.choices[0].message.content eval_dataset = pd.DataFrame({ "content" : [...], "reference": [...], }) result = EvalTask( dataset=eval_dataset, metrics=["text_generation_similarity","text_generation_quality"], experiment="my-experiment", ).evaluate( model=custom_model_fn, experiment_run_name="gpt-eval-run" )
PromptTemplate
A prompt template for creating prompts with placeholders.
The PromptTemplate class allows users to define a template string with
placeholders represented in curly braces {placeholder}. The placeholder
names cannot contain spaces. These placeholders can be replaced with specific
values using the assemble method, providing flexibility in generating
dynamic prompts.
Example Usage:
```
template_str = "Hello, {name}! Today is {day}. How are you?"
prompt_template = PromptTemplate(template_str)
completed_prompt = prompt_template.assemble(name="John", day="Monday")
print(completed_prompt)
```
A set of placeholder names from the template string.
Packages Functions
make_metric
make_metric(
name: str,
metric_function: typing.Callable[
[typing.Dict[str, typing.Any]], typing.Dict[str, typing.Any]
],
) -> vertexai.preview.evaluation.metrics._base.CustomMetricMakes a custom metric.
| Parameters | |
|---|---|
| Name | Description |
name |
The name of the metric |
metric_function |
The evaluation function. Must use the dataset row/instance as the metric_function input. Returns per-instance metric result as a dictionary. The metric score must mapped to the CustomMetric.name as key. |