I'm not sure I understand what the neural net stuff at the end is about. If you are doing noisy sorting, that doesn't require neural networks, and there are lots of known algorithms: https://gwern.net/Resorter#noisy-sorting (You are right that noisy comparison sorting turns out to be much the same, complexity wise, as noise-free comparison sorting.) You can also make sorting and ranking directly differentiable (https://gwern.net/GPT-2-preference-learning#grover-et-al-2019), which would probably be much more efficient than trying to brute force it by iterating a MLP (wouldn't that approach bloat any NN you tried to do that inside of as a component? You'd have to unroll it for a small fixed number of iterations, I guess, and weight-tie it).

Expand full comment

Hi there, sorry, only found this comment now. Those references are useful, thanks for sharing!

I'm not surprised that there are more straightforward methods for noisy sorting. I was more curious about learning the type of noisy sorting that reward modelling does (the set-up I arrive at is the same that is used in RLHF). A priori it wasn't clear to me whether a neural network would fail gracefully when confronted with non-transitive preferences (or what failing gracefully even means exactly).

Expand full comment