-Using differentiable BLEU loss instead of cross_entropy loss -it helps decreasing train-test evaluation gap