Over the past few weeks, I have dedicated my efforts to enhancing my Role-Playing (RP) evaluation suite, focusing on refining the main script. Leveraging my real-life teaching experience, I found myself in familiar territory as I crafted tests designed to assess AI capabilities. This endeavor has unexpectedly benefited from my background in education, highlighting the invaluable crossover skills between teaching humans and training AI models.
I've established a preliminary framework consisting of six distinct categories:
Classic Logic Problems:
These traditional logic puzzles, despite their lack of grounding in either reality or fictional worlds, have been included for their potential to gauge reasoning abilities. An example question might be, "If three killers are in a room and a person enters and kills one, how many killers are now in the room?"
Role-playing Logic Problems:
Created to assess the understanding of what is feasible within the constraints of real or fantasy environments, this category essentially tests for "Common Sense". It includes scenarios similar to running with restrained legs or beheading without a weapon.Game Mastering Skills:
This category evaluates the model's proficiency in creating characters, non-player characters (NPCs), consistent settings, and appropriately setting challenges for rolls. Later on, it might also aim to gauge the model's ability to moderate abusive player decisions.Interpretation of Results:
Aimed at future applications where the model could serve as a backend Game Master, interpreting dice rolls according to tabletop RPG rules. This required its very own category due to the depth of criteria needed for adequate evaluation.
Narrative Skills:
This challenging category assesses how well the model constructs a plot, manages scene transitions, and avoids overstepping into the player's domain of action.
English:
Currently, this category is more indicative of whether a model is functioning correctly rather than a measure of its linguistic proficiency. A score below 85% suggests potential instability, while scores above 90% indicate stability. However, this category is still a work in progress and requires further refinement.
Overall...
The evaluation tool has proven immensely useful, particularly in demonstrating the accuracy of the evaluation process through the merging of models with varying scores in specific categories and observing the expected average outcomes.
Example :
- Merging a model A with 50% in Classic Logic Problems
- with a model B with 30% in Classic Logic Problems
- results in a model C with exactly 40% (exact average from the two parents models) in Classic Logic Problems, proving the accuracy of the evaluation.
In my next update, I will delve into how these metrics have facilitated my research into model merging.
To be done...
For the moment, the AutoRP-Eval.py remains a command-line tool with its share of hardcoded elements. Despite its rudimentary state, it offers valuable insights into the performance of evaluated models.
While there's a temptation to improve this testing suite, the time investment required encourages a focus on developing a new dataset instead.
No comments:
Post a Comment