Use a pretrained marian model instead of randomly initialized bart so we get some non-trivial results. Shrink bitod test datasets. Now that we are testing e2e_dialogue_evaluation it will help to keep testing time almost the same.
it's almost never been used