Activity - It provides an entirely new framework for analyzing skills in LLMs. Do you mean...

kromem OP , 5 months ago (edited 5 months ago)

It provides an entirely new framework for analyzing skills in LLMs. Do you mean the article doesn't provide new insights, or that the research doesn't?

As for my own interest, in addition to this providing a more rigorous framework for analyzing what I'd already gotten a sense of with the world model research papers over the last year, I can see a number of important nuances.

First off, there's the obvious point of emergent capabilities being a hotly debated topic in research circles, which you likely know if you've followed it at all.

In particular, the approach here compliments the paper out of Stanford disputing emergent capabilities because other measurements of improvement are linear as size increases. Here, linear improvements in next token prediction directly tie into emergent skills, so it's promising that the model fits neatly with one of the more notable counter-point nuances in the past year.

I also think this is an exciting approach if the same framework were remapped to the way Anthropic's research was looking at functional layers as opposed to individual network nodes. By mapping either side of the graph to functional layers it may allow for more successful introspection into larger models than we've had before.

A framework around a controversial research topic that generates testable predictions and then sees those predictions met is generally worth recognizing too.

Finally, I think that Skill-Mix may offer a useful framework for evaluating models, particularly around transmission of skills from larger models to smaller models using synthetic data, which has probably been the most significant research trend in the domain over the past year.

So it's noteworthy in a number of ways and I could see it having similar impact to the CoT paper within research circles (where it becomes a component of much of the work that follows and builds on top of it), even if not quite as broad an impact outside of them.

I've generally felt the field is doing a poor job at evaluating models, falling deeper and deeper into Goodhart's Law, and this is a promising breath of fresh air.

As they say opening their paper on it:

Sizeable differences exist among model capabilities that are not captured by their ranking on popular LLM leaderboards ("cramming for the leaderboard"). Furthermore, simple probability calculations indicate that GPT-4's reasonable performance on k=5 is suggestive of going beyond "stochastic parrot" behavior (Bender et al., 2021), i.e., it combines skills in ways that it had not seen during training.
We sketch how the methodology can lead to a Skill-Mix based eco-system of open evaluations for AI capabilities of future models.

It's about time we move on to something better than the current evaluation metrics which we're just trying to game with surface fine tuning.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...