According to TechWeb's report on September 19, the domestic authoritative evaluation system Flag_ (Libra) announced the evaluation results of the latest large models on the September list. Based on the latest CLCC v2.0 subjective evaluation data set, Flag_ (Libra) September list focuses on evaluating 7 open source dialogue models that have become popular recently. Judging from the overall results, Baichuan2-13 b-chat, Qwen-7 b-chat, and Baichuan2-7 b-chat are among the best, with accuracy rates exceeding 65%. In the base model list, the objective evaluation results of Baichuan 2, Qwen, InternLM, and Aquila all surpassed the Llama and Llama2 models of the same parameter level. In the SFT model list, Baichuan 2-13 B-chat, YuLan-Chat-2-13 B, and AquilaChat-7 B rank in the top three. In both objective evaluation lists, Baichuan 2 showed excellent performance, and the basic model test surpassed Llama 2 in both Chinese and English fields. It is reported that Flag_ (Libra) is a large model evaluation system and open platform launched by Beijing Zhiyuan Artificial Intelligence Research Institute. It aims to establish scientific, fair and open evaluation benchmarks, methods and toolsets to assist researchers in comprehensively evaluating basic models and Performance of training algorithms. Flag_ The large language model evaluation system currently includes 6 major evaluation tasks, nearly 30 evaluation data sets, and more than 100,000 evaluation questions.