Model Evaluation provides a wizard-based evaluation task submission method for LLM evaluation. Tencent Cloud TI-ONE Platform (TI-ONE) supports two evaluation methods, namely manual evaluation and automated evaluation.
Manual evaluation: provides the manual scoring feature after the inference of the model to be evaluated and supports manual evaluation for the model effect.
Automated evaluation: No manual participation is required during the entire process. TI-ONE performs automated evaluation based on built-in open-source evaluation sets and automatic metrics (such as pass@1, ROUGE, and F1) or user-defined uploaded open-source evaluation sets and user-defined metrics. Automated evaluation also allows you to conduct a quick test on the effect of the trained model during training, as well as service deployment for checkpoint models. You can also interact with an LLM in a dialog box to evaluate its effect.
Note:
Manual evaluation can be usually used together with automated evaluation when best practices are implemented. For example, during the model development process, you can perform automated evaluation first based on open-source data sets (which can be extended to the proprietary standardized data sets accumulated within the current enterprise), to obtain a relatively good result. Then, you can use manual evaluation to verify the model effect again in the final stage before model release, or evaluate its effect at any time after the model has been launched.
For models after manual evaluation and automated evaluation, a visual comparison of model effects is supported. Multiple models and metrics can be horizontally compared through a radar chart to provide an intuitive display of effect comparison.