replace chunked fla with recurrent fla for MoE model decoding

Currently we use chunked fla for both prefilling and decoding. However using the recurrent fla for decoding will bring better performance. We should only use chunked fla for prefilling and try to use recurrent fla in decoding.