Model Leaderboard

Performance comparison of various models on CPsyExam benchmark tasks.

CPsyExam-KG (Knowledge)

ModelSCQ (Zero)MAQ (Zero)SCQ (Five)MAQ (Five)Avg (Zero)Avg (Five)
Open-sourced Models
ChatGLM2-6B49.89%9.86%53.81%14.85%39.81%44.00%
ChatGLM3-6B53.51%5.63%55.75%5.51%41.46%43.10%
YI-6B33.26%0.26%25.39%14.01%24.95%22.31%
QWEN-14B24.99%1.54%38.17%13.19%19.08%31.88%
YI-34B25.03%1.15%33.69%18.18%24.95%22.31%
Psychology-oriented Models
MeChat-6B50.24%4.10%51.79%11.91%38.62%41.75%
MindChat-7B49.25%6.27%56.92%5.51%38.43%43.97%
MindChat-8B26.50%0.00%26.50%0.13%19.83%19.86%
Ours-SFT-6B52.95%10.50%58.77%2.94%42.26%44.71%
API-based Models
ERNIE-Bot52.48%6.66%56.10%10.37%40.94%44.58%
ChatGPT57.43%11.14%61.53%24.71%45.78%52.26%
ChatGLM63.29%26.12%73.85%42.13%53.93%65.86%
GPT476.56%10.76%78.63%43.79%59.99%69.85%

CPsyExam-CA (Case Analysis)

ModelSCQ (Zero)MAQ (Zero)SCQ (Five)MAQ (Five)Avg (Zero)Avg (Five)
Open-sourced Models
ChatGLM2-6B52.50%16.00%48.50%20.00%43.38%41.38%
ChatGLM3-6B47.00%17.00%47.33%13.50%39.50%38.88%
YI-6B38.83%0.00%20.00%13.25%29.12%18.63%
QWEN-14B20.33%2.00%30.00%14.00%15.75%26.00%
YI-34B20.50%0.50%22.33%8.00%15.50%19.39%
Psychology-oriented Models
MeChat-6B48.67%13.50%44.83%10.50%39.86%36.25%
MindChat-7B40.83%5.00%33.83%4.50%31.88%26.50%
MindChat-8B34.17%0.00%34.17%0.00%25.63%25.63%
Ours-SFT-6B46.50%5.50%48.67%13.00%34.00%41.00%
API-based Models
ERNIE-Bot42.50%8.50%50.67%12.00%34.00%41.00%
ChatGPT47.33%9.00%52.67%29.50%37.75%46.88%
ChatGLM69.00%20.50%65.33%42.50%56.88%59.63%
GPT460.33%13.00%64.17%39.50%48.50%58.00%