TL;DR: Fanboying JvN, then a nuts-and-bolts description of the von-Neumann-Morgenstern theorem. A connection to reward modeling, some simulations, and pretty figures!
I'm not sure I understand what the neural net stuff at the end is about. If you are doing noisy sorting, that doesn't require neural networks, and there are lots of known algorithms: https://gwern.net/Resorter#noisy-sorting (You are right that noisy comparison sorting turns out to be much the same, complexity wise, as noise-free comparison sorting.) You can also make sorting and ranking directly differentiable (https://gwern.net/GPT-2-preference-learning#grover-et-al-2019), which would probably be much more efficient than trying to brute force it by iterating a MLP (wouldn't that approach bloat any NN you tried to do that inside of as a component? You'd have to unroll it for a small fixed number of iterations, I guess, and weight-tie it).
Hi there, sorry, only found this comment now. Those references are useful, thanks for sharing!
I'm not surprised that there are more straightforward methods for noisy sorting. I was more curious about learning the type of noisy sorting that reward modelling does (the set-up I arrive at is the same that is used in RLHF). A priori it wasn't clear to me whether a neural network would fail gracefully when confronted with non-transitive preferences (or what failing gracefully even means exactly).
I'm not sure I understand what the neural net stuff at the end is about. If you are doing noisy sorting, that doesn't require neural networks, and there are lots of known algorithms: https://gwern.net/Resorter#noisy-sorting (You are right that noisy comparison sorting turns out to be much the same, complexity wise, as noise-free comparison sorting.) You can also make sorting and ranking directly differentiable (https://gwern.net/GPT-2-preference-learning#grover-et-al-2019), which would probably be much more efficient than trying to brute force it by iterating a MLP (wouldn't that approach bloat any NN you tried to do that inside of as a component? You'd have to unroll it for a small fixed number of iterations, I guess, and weight-tie it).
Hi there, sorry, only found this comment now. Those references are useful, thanks for sharing!
I'm not surprised that there are more straightforward methods for noisy sorting. I was more curious about learning the type of noisy sorting that reward modelling does (the set-up I arrive at is the same that is used in RLHF). A priori it wasn't clear to me whether a neural network would fail gracefully when confronted with non-transitive preferences (or what failing gracefully even means exactly).