ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

1 месяц назад

20,751 Просмотров

Комментарии:

@MyCiaoatutti - 02.05.2024 12:10

"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.

Ответить

@gauranshsoni4011 - 02.05.2024 10:33

Keep them comin

Ответить

@Basant5911 - 02.05.2024 05:59

I found this more endearing than netflix 🥹❤️.

Ответить

@rectomgris - 02.05.2024 05:14

makes me think of PPO

Ответить

@wwkk4964 - 02.05.2024 03:43

What's going on, is it a yannic bonanza time of the year! Loving these addicting videos

Ответить

@justheuristic - 02.05.2024 00:37

The main loss function (7) looks like it can be meaningfully simplified with school-level math.

Recall that loss function (7) is
Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a))
Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax)

Then, plugging in sigmoid, you get
Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) )
Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify:
Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) )
Finally, multiply both numerator and denominator by odds(y_l|x) to get

Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l)) )

Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ).
If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).

Ответить

@lone0017 - 01.05.2024 23:58

6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.

Ответить

@meselfobviouslyme6292 - 01.05.2024 21:23

Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model

Ответить

@I-0-0-I - 01.05.2024 20:49

Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.

Ответить

@thunder89 - 01.05.2024 20:15

The comparison in the end between OR and PR should also discuss the influence of the log sigmoid, or? And, more importantly, how the gradients for the winning and loosing output actually would look like with these simulated pars... It feels a bit handweavy why the logsigmoid of the OR should be the target ...

Ответить

@Embassy_of_Jupiter - 01.05.2024 19:52

why hat, indeed

Ответить

@axelmarora6743 - 01.05.2024 19:23

great! now apply ORPO to a reward model and round we go!

Ответить

@amber9040 - 01.05.2024 19:23

I feel like AI models have gotten more stale and same-y ever since RLHF became the norm. Playing around with GPT-3 was wild times. Hopefully alignment moves in a direction with more diverse ranges of responses in the future, and less censorship in domains where it's not needed.

Ответить