AI Travel Notes: The Path to Evaluation

Merging Fun!

Having explored various techniques for creating new Language Models (LLMs), I found great enjoyment in both common merging and the more adventurous "franken-merging." Merging involves combining two LLMs, aiming to retain the strengths of both models. However, this process feels somewhat like a gamble—success is not guaranteed.

Hugginface provides a platform where different authors take intuitive chances, merging various models to climb the leaderboard. Surprisingly, it occasionally pays off! Franken-merging takes it a step further, involving dissecting models and creatively reassembling their components.

In simpler terms, with a bit of luck and no deep expertise, an enthusiastic hobbyist can outperform seasoned data scientists!

Apart from the thrill, these methods offer cost-effective solutions. Building a potent LLM from scratch is a costly endeavor reserved for companies. In contrast, merging only requires affordable RAM, is fast, and can be accomplished within minutes. Additionally, there's no need to construct a dataset, a lengthy and complex task.

The Truth

Speaking of datasets, a post I read (apologies for forgetting the author) asserted that "all merges were memes!" This sentiment is understandable, considering the significant role luck plays in the process. The lack of control over the dataset means a lack of control over the model's exact capabilities and possibilities.

While building a genuine dataset remains crucial in LLM development (and is my ultimate goal), merging has established itself as a dark art in crafting AI, offering an easy performance boost.

Suddenly Facing the Obvious

Motivated by this, I embarked on building several models through merging and franken-merging. It seemed straightforward: edit a config file, run a script, and voila! However, as my "models" folder expanded, I realized that evaluating myself their performance in role-playing scenarios was the wrong approach.

I needed a tool for evaluation. Although existing tools existed, I wasn't satisfied and opted to create my own for two reasons:

Contamination: Existing tests are known and sometimes used to train LLMs, compromising their integrity. "Decontaminating" LLMs has become necessary for some scientists.
Specificity: Most tests are too general for my needs. I aim to create a narrator and a game-master, requiring specific skill assessments.

AutoRP-Eval.py

Creating AutoRP-Eval.py wasn't overly challenging, especially with AI assistance. While it needs some refinement (e.g., the model selection is still hard-coded), it effectively evaluates models. It includes a folder named "items" where I can place text files, each containing a question and an evaluation criterion. The script compels the evaluated AI to answer each question, and the responses are automatically assessed by another LLM.

More Tools

Ensuring correct and consistent corrections posed a challenge, requiring extensive prompt engineering attempts. A dedicated tool, rubric-eval.py, built in Gradio, allows me to generate or write answers and verify if they align with the rubric, using a specifically designed LLM named Proctora.

Almost Done

While the script needs some final touches, most efforts will now focus on generating test questions in various interesting areas for my purposes and completing the test suite. After this, the true craft of AI creation will commence.

AI Travel Notes

Wednesday, January 17, 2024

The Path to Evaluation

Merging Fun!

The Truth

Suddenly Facing the Obvious

AutoRP-Eval.py

More Tools

Almost Done

No comments:

Post a Comment

The State of AutoRP-Eval.py

Report Abuse

Labels