Sunday, February 4, 2024

The State of AutoRP-Eval.py

 Over the past few weeks, I have dedicated my efforts to enhancing my Role-Playing (RP) evaluation suite, focusing on refining the main script. Leveraging my real-life teaching experience, I found myself in familiar territory as I crafted tests designed to assess AI capabilities. This endeavor has unexpectedly benefited from my background in education, highlighting the invaluable crossover skills between teaching humans and training AI models.

I've established a preliminary framework consisting of six distinct categories:

Classic Logic Problems: 

These traditional logic puzzles, despite their lack of grounding in either reality or fictional worlds, have been included for their potential to gauge reasoning abilities. An example question might be, "If three killers are in a room and a person enters and kills one, how many killers are now in the room?"

Role-playing Logic Problems: 

Created to assess the understanding of what is feasible within the constraints of real or fantasy environments, this category essentially tests for "Common Sense". It includes scenarios similar to running with restrained legs or beheading without a weapon.

Game Mastering Skills: 

This category evaluates the model's proficiency in creating characters, non-player characters (NPCs), consistent settings, and appropriately setting challenges for rolls. Later on, it might also aim to gauge the model's ability to moderate abusive player decisions.

Interpretation of Results: 

Aimed at future applications where the model could serve as a backend Game Master, interpreting dice rolls according to tabletop RPG rules. This required its very own category due to the depth of criteria needed for adequate evaluation.

Narrative Skills:  

This challenging category assesses how well the model constructs a plot, manages scene transitions, and avoids overstepping into the player's domain of action.

English:

Currently, this category is more indicative of whether a model is functioning correctly rather than a measure of its linguistic proficiency. A score below 85% suggests potential instability, while scores above 90% indicate stability. However, this category is still a work in progress and requires further refinement.

 

Overall...

The evaluation tool has proven immensely useful, particularly in demonstrating the accuracy of the evaluation process through the merging of models with varying scores in specific categories and observing the expected average outcomes.

Example :

- Merging a model A with 50% in Classic Logic Problems

- with a model B with 30% in Classic Logic Problems

- results in a model C with exactly 40% (exact average from the two parents models) in Classic Logic Problems, proving the accuracy of the evaluation.

In my next update, I will delve into how these metrics have facilitated my research into model merging.

To be done...

For the moment, the AutoRP-Eval.py remains a command-line tool with its share of hardcoded elements. Despite its rudimentary state, it offers valuable insights into the performance of evaluated models.

While there's a temptation to improve this testing suite, the time investment required encourages a focus on developing a new dataset instead.

Wednesday, January 17, 2024

The Path to Evaluation

Merging Fun!

Having explored various techniques for creating new Language Models (LLMs), I found great enjoyment in both common merging and the more adventurous "franken-merging." Merging involves combining two LLMs, aiming to retain the strengths of both models. However, this process feels somewhat like a gamble—success is not guaranteed.

Hugginface provides a platform where different authors take intuitive chances, merging various models to climb the leaderboard. Surprisingly, it occasionally pays off! Franken-merging takes it a step further, involving dissecting models and creatively reassembling their components.

In simpler terms, with a bit of luck and no deep expertise, an enthusiastic hobbyist can outperform seasoned data scientists!

Apart from the thrill, these methods offer cost-effective solutions. Building a potent LLM from scratch is a costly endeavor reserved for companies. In contrast, merging only requires affordable RAM, is fast, and can be accomplished within minutes. Additionally, there's no need to construct a dataset, a lengthy and complex task.

The Truth

Speaking of datasets, a post I read (apologies for forgetting the author) asserted that "all merges were memes!" This sentiment is understandable, considering the significant role luck plays in the process. The lack of control over the dataset means a lack of control over the model's exact capabilities and possibilities.

While building a genuine dataset remains crucial in LLM development (and is my ultimate goal), merging has established itself as a dark art in crafting AI, offering an easy performance boost.

Suddenly Facing the Obvious

Motivated by this, I embarked on building several models through merging and franken-merging. It seemed straightforward: edit a config file, run a script, and voila! However, as my "models" folder expanded, I realized that evaluating myself their performance in role-playing scenarios was the wrong approach.

I needed a tool for evaluation. Although existing tools existed, I wasn't satisfied and opted to create my own for two reasons:

  1. Contamination: Existing tests are known and sometimes used to train LLMs, compromising their integrity. "Decontaminating" LLMs has become necessary for some scientists.
  2. Specificity: Most tests are too general for my needs. I aim to create a narrator and a game-master, requiring specific skill assessments.

AutoRP-Eval.py

Creating AutoRP-Eval.py wasn't overly challenging, especially with AI assistance. While it needs some refinement (e.g., the model selection is still hard-coded), it effectively evaluates models. It includes a folder named "items" where I can place text files, each containing a question and an evaluation criterion. The script compels the evaluated AI to answer each question, and the responses are automatically assessed by another LLM.

More Tools

Ensuring correct and consistent corrections posed a challenge, requiring extensive prompt engineering attempts. A dedicated tool, rubric-eval.py, built in Gradio, allows me to generate or write answers and verify if they align with the rubric, using a specifically designed LLM named Proctora.

Almost Done

While the script needs some final touches, most efforts will now focus on generating test questions in various interesting areas for my purposes and completing the test suite. After this, the true craft of AI creation will commence.


Sunday, January 14, 2024

Unveiling the Realm: Journey into the Fusion of AI and RPG with Karkomagor

 

Hello, I am Karkomagor – a teacher by profession and an IT enthusiast by experience. I am a passionate explorer of the realms of Role-Playing Games (RPGs) and Artificial Intelligence (AI) as well. Today, I invite you on a fascinating journey into the crossroads of these two seemingly disparate worlds, where my passion for storytelling and technology converges into an exciting frontier.

The Beginnings

My journey has been as diverse as the characters that inhabit the worlds of RPGs. Having worked in IT in the past, I've always found myself drawn to the intricate tapestry of code and the endless possibilities it holds. Yet, my heart beats for a different kind of creation – the creation of worlds, stories, and characters that breathe life into the imagination.

A Teacher's Odyssey

As a teacher, I've always seen education as a gateway to new dimensions. I realized that the power of AI could be harnessed not only for solving complex real-world problems but also for crafting immersive and dynamic RPG experiences. Thus, began my journey into the fusion of these two passions.

Mastering the Craft

Diving headfirst into this uncharted territory, I continue to dedicate my time to mastering the new skills emerging in the art of AI. From fine-tuning to delving into the intricacies of merging and "frankenmerging"! Lastly, I learned quantization, refining the balance between computational efficiency and model performance.

The Love for RPGs

Why AI and RPG, you might wonder? For me, RPGs are not just games – they are art forms that allow us to step into alternate universes, to become storytellers and creators. It's a journey into the unknown, guided by the choices we make and the narratives we weave. Incorporating AI into this mix is not just a technical challenge; it's a pursuit of understanding the very essence of storytelling.

A Real Challenge

RPGs with AI present a real challenge: it's about crafting narratives that respond dynamically to the actions of players, creating a living and breathing world within the digital domain. If we can solve the challenge of merging AI seamlessly into RPG sessions, we're not just enhancing fictional realities – we're on the cusp of solving real-world challenges with unparalleled creativity.

The Potential Beyond

Imagine indeed a world where AI doesn't just power games but becomes a tool for problem-solving, creativity, and innovation. The bridge between fictional and real realities is thinner than we think, and mastering the complexities of RPGs with AI is a key to unlocking unprecedented potential.

So, join me on this odyssey into the unexplored territories of AI-enhanced RPGs. Together, let's unravel the mysteries, confront the challenges, and witness the birth of a new era where storytelling and technology merge seamlessly. The adventure has just begun, and the possibilities are as limitless as the imagination itself.

Are you ready to embark on this journey? The dice are cast, and the story awaits...

The State of AutoRP-Eval.py

 Over the past few weeks, I have dedicated my efforts to enhancing my Role-Playing (RP) evaluation suite, focusing on refining the main scri...