
编者按:人工智能如何重塑语言测评?HSK如何应对时代之问?作为全球应用最广的中文水平考试,HSK借助人工智能,正在构建更智能、更个性化的测评生态。本文立足测评理论与全球实践,系统阐释智能化时代HSK的演进路径、多语种考试创新及其面临的真实挑战。如果你关注语言测评的未来、人工智能与教育的融合,或正在思考国际化背景下中文测评的发展方向,这篇文章提供了一个理性而前沿的视角。


摘要:国际中文测试是国际中文教育服务于全球各国中文人才进阶的重要工具。作为全球规模最大、体系最成熟的中文能力测评标准,在合理使用的前提下,HSK 考试可利用人工智能为全球中文学习者提供新型态考试,并构建人机协同的多语种翻译能力评估新生态。
Abstract: HSK test, as a crucial tool for international Chinese education to serve the advancement of Chinese language talents worldwide, stands as the largest and most mature Chinese language proficiency assessment globally. When used appropriately, HSK can leverage AI to offer a new paradigm of examination for Chinese language learners worldwide and establish a new ecosystem for human-AI collaborative, multilingual translation assessment.
关键词:智能化;语言测试;HSK;以考促学;来华留学
Key words: Artificial Intelligence; language testing; HSK; assessment for learning; studying in China
一、HSK考试
教育测评是依据特定教育目标,通过系统化的方法和技术,对教育活动、过程及结果进行测量与价值判断的过程。其核心在于科学诊断教育质量、优化教学决策并促进学生发展。国际中文测试是国际中文教育服务于全球各国中文人才进阶的重要工具,旨在通过设计、实施和解释测试来衡量学习者在特定语言环境下的中文听、说、读、写、译等技能水平。
作为最具代表性的国际中文测试, 历经四十余年, HSK取得了长足发展。考试体系不断完善,考生规模持续增长, 应用领域日渐拓展, 已成为广受国际社会认可的国际性权威中文水平考试,在开展国际中文教育质量评价、吸引国际学生来华留学、服务中外经贸合作等方面发挥了重要作用。作为全球规模最大、体系最成熟的中文能力标准化考试,HSK不仅以学理清晰、成效显著成为来华留学的黄金标尺, 更在人工智能等技术赋能下, 逐步构建起人机协同的多语种翻译能力评估新生态。HSK 的科学性源于其严谨的测评理论与标准化设计, 主要体现在以下方面。
1. 能力导向的层级化架构。HSK3.0以《国际中文教育中文水平等级标准》为框架,将语言能力细分为听、说、读、写、译五维,并设置1-9级渐进式等级体系。每一级别均对应明确的“Can-Do”描述(张新玲,刘逸凡,2025),这种基于交际能力理论的设计,避免了传统考试对孤立语言知识的机械考核,更贴近真实语境需求。
2. 心理测量学的精准应用。HSK采用项目反应理论优化题库,通过难度参数、区分度参数等动态校准题目,确保不同版本试卷的等效性。例如,HSK3.0增加的翻译题型,通过认知诊断模型精准识别考生弱项,显著提升反馈价值。
3. 跨文化效度的全球验证。截至2025年11月,HSK已在168个国家设立1677个考点,累计考生达820万人次(郁云峰,2025)。这一庞大的数据充分证明了HSK在国际中文教育领域的核心地位和广泛吸引力。HSK不仅是测评工具,更是推动中文国际化的基础设施。其价值体现在两大场景。
(1)来华留学的语言水平门槛。政策层面,中国教育部明确建议,HSK4级为本科入学门槛,HSK5级为硕士申请标准,绝大多数高校把 HSK成绩作为来华留学生获得奖学金或录取资格的重要条件。实践有效性层面,跟踪研究发现,来华留学生入学 HSK成绩总体可有效预测其学业成就。例如,沈悉尼(2025)基于Perfetti和Stafura(2014)的语言技能框架理论和Anderson和Krathwohl(2001)的认知分类理论,结合《来华留学生高等教育质量规范(试行)》的人才培养目标,展开来华留学生汉语语言能力与学业成就关系研究。213名来华留学生入学HSK成绩,入学后的学业记录、问卷数据,对20名不同专业和语言水平的学生以及4名教师进行深度访谈数据表明:① HSK总分与研究对象的GPA在各级水平上均呈显著相关,这种相关性随HSK等级的提高而增强,在 HSK6级水平上达到最高。② HSK对留学生综合素质发展的影响因活动而异。参加竞赛对语言依赖度最高,学生组织次之,社会实践呈现中等语言依赖,文体活动语言依赖度最低。③ HSK等级与对中国的理解程度呈显著正相关,这种相关性在历史、文学和哲学等语言依赖性强的领域表现得尤为突出。总体说明,HSK考试对来华留学生学业有非常强的预测效度。
(2)海外中文教育的风向标。HSK的正向教学反拨效应促使教师转向沉浸式教学。同时,在“一带一路”共建国家,HSK证书已成为中资企业招聘的优先条件。
I. HSK Test
Educational Assessment refers to the process of measuring and evaluating educational activities, processes, and outcomes through systematic methods and technologies based on specific educational objectives. Its core lies in scientifically diagnosing educational quality, optimizing teaching decisions, and promoting student development. International Chinese language proficiency assessment is crucial for international Chinese education to support the cultivation of Chinese language talents worldwide. It aims to measure learners’ Chinese skills, including listening, speaking, reading, writing, and translation, in specific linguistic contexts through test development, administration, and interpretation.
HSK (Hanyu Shuiping Kaoshi) has achieved remarkable development over more than four decades. With a continuously improved assessment system, growing assessment population, and expanding application fields, HSK has become a globally-recognized international Chinese language proficiency test. It plays a vital role in evaluating the quality of international Chinese education, attracting international students to study in China, and serving Sino-foreign economic and trade cooperation. As the world’s largest and most mature standardized test of Chinese language proficiency, HSK not only serves as a golden standard for studying in China due to its clear theoretical foundation and proven effectiveness but also has the potential of gradually building a new human-AI collaborative system for multilingual translation assessment. The scientific nature HSK stems from its rigorous assessment theory and standardized development, mainly reflected in the following aspects.
1. Proficiency-oriented Hierarchical Framework
Based on the Chinese proficiency Grading Standards for International Chinese Language Education (CPGSICLE), HSK subdivides language proficiency into five dimensions, including listening, speaking, reading, writing, and translation and interpreting. It also adopts a progressive 1-9 level system. Each level corresponds to “Can - Do” statement (Zhang & Liu, 2025). This design, rooted in communicative competence theory, avoids the structural assessment of isolated linguistic knowledge in traditional language tests and aligns more closely with real-world needs.
2. Appropriate Application of Psychometrics
HSK optimizes its item bank using Item Response Theory, dynamically calibrating items through parameters such as difficulty and discrimination to ensure the equivalence of different test versions. For instance, the new translation task in HSK3.0 accurately identifies examinees’ weaknesses via the cognitive diagnostic model and significantly enhancing the feedback value.
3. Global Validation of Cross-cultural Evidence
As of November 2025, there are 1,677HSK test centers across 168 countries, with a cumulative number of examinees reaching 8.2 million (Yu, 2025). This large-scale data fully confirms HSK’s core position and broad appeal in international Chinese education. HSK also serves as infrastructure for promoting the internationalization of Chinese, with value manifested in two key situations:
(1) Language Proficiency Threshold for Studying in China
At the policy level, Ministry of Education of China clearly recommends HSK4 as the threshold for undergraduate admission and HSK5 for master’s program applications. Most universities use HSK scores as a criterion for international student admission or scholarships. At the practical level, follow-up studies have found that international students’ admission HSK scores can generally predict their academic achievements effectively. For example, Shen (2025) conducted a study on the relationship between international students’ Chinese language proficiency and academic achievements, drawing on framework of language skills (Perfetti & Stafura, 2014), cognitive taxonomy (Anderson & Krathwohl, 2001), and the talent training objectives outlined in the Higher Education Quality Standards for International Students in China (Trial). Data from 213 international students, including their admission HSK scores, academic records over two years, questionnaire responses, in-depth interviews with 20 students of varying majors and language levels, and 4 teachers, revealed three key findings: 1) The total HSK scores were significantly correlated with the participants’ GPA across all levels, with the correlation strengthening as HSK levels increased and peaking at HSK6. 2) The impact of HSK on international students’ comprehensive quality development varied by activity type: competitions had the highest language dependence, followed by student organizations, social practices (moderate dependence), and cultural/sports activities (lowest dependence). 3) HSK levels showed a significant positive correlation with students’ understanding of China, particularly in language-intensive fields such as history, literature, and philosophy. Overall, these results indicate that HSK has strong predictive validity for the academic achievements of international students in China.
(2) A Bellwether for Overseas Chinese Education
HSK’s positive washback effect on teaching has prompted educators to adopt immersive teaching methods. Meanwhile, in countries jointly participating in the Belt and Road HSK certificates have become a preferred qualification for recruitment by Chinese-funded enterprises.
二、智能化中文测试
智能化中文测试是指利用人工智能技术,对中文语言能力进行自动化评估和测试的过程。随着人类语言生活历经言语化、语法化、信息化、数字化,进入人工智能化阶段,人工智能技术为教育测评,特别是语言测评熔铸新生态提供了全新机遇。
首先,提高教育测试的效率,以及准确性和个性化水平。以国际中文水平测试为例,在精准中文字符识别和生成、中文语法分析和应用、中文语义理解和产出技术上,通过语料库、大模型、知识图谱、智能体,实现国际中文测评的文本、图像、音频、语境、文化等多类型数据智能处理。例如,大语言模型(如 ChatGPT)为HSK等中文水平测试的命题与评分环节带来了革命性的变革。在命题方面,AI可以基于海量的、经过筛选和标注的中文语料库,自动生成符合不同等级难度要求、题型多样且内容新颖的测试题目。这不仅能大幅提高命题效率,减轻命题专家的工作负担,还能有效避免人为命题可能带来的主观性和局限性,确保试题的科学性和公平性。在评分方面,AI技术同样展现出巨大潜力。对于客观题,AI可以实现快速、准确的自动批改。对于主观题,如口语和写作,AI可以通过自然语言处理、语音识别和计算机视觉等技术,从发音、流利度、语法、词汇、内容、逻辑等多个维度对考生的作答进行精细化分析和评估。
其次,从结果评价到过程赋能。利用AI实现教、学、评一体化,强化动态评估、过程评价、个性化发展,以多模态创生、智能生成、沉浸式、行为测试为关键词的中文测评新范式,正在形成。传统的测评模式往往是终结性的,即在教学阶段结束后进行一次性的考试,难以全面、及时地反映学生的学习过程和进步情况。而智能化测评系统可以将评估无缝融入日常的教学活动,实现过程性评价与终结性评价的有机结合。教师可以利用AI工具随时发起小测验,系统即时生成分析报告,帮助教师了解班级整体和每个学生的知识掌握情况,从而及时调整教学策略。学生也可以通过系统获得即时反馈,了解自己的学习状况,并进行针对性的巩固练习。
第三,构建动态、互动、共荣的人机协同的测评生态体系。可将AI广泛应用于语言测试的任务设计、测试交付、自动化评分、成绩报告以及反馈提供等环节,打造本土化、定制化、个性化的教育评价工具,构建动态、互动、共荣的“人机协同”的语言测评生态体系。
第四,提供教育公平的新机遇,突破资源壁垒。人工智能赋能的教育测评通过云端资源共享、智能个性化支持和数据驱动的精准干预三大技术手段,有效缩小城乡、区域和校际间的教育差距,推动优质教育资源普惠化。
II. Intelligent Chinese Language Proficiency Assessment
Intelligent Chinese language proficiency assessment refers to the process of automated assessment of Chinese language proficiency by using AI technologies. As human linguistic activity has evolved through the stages of verbalization, grammaticalization, informatization, digitization, and now enters the AI era, AI technologies offer unprecedented opportunities to reshape the ecosystem of educational assessment, particularly for language assessment.
Firstly, enhancing efficiency, accuracy, and personalization of educational assessment. Taking international Chinese language proficiency tests for example, advancements in precise Chinese character recognition and generation, Chinese grammar analysis and application, and Chinese semantic understanding and production—supported by corpora, large language models, knowledge graphs, and intelligent agents—enable the intelligent processing of diverse data types in international Chinese assessment, including text, images, audio, context, and culture. For instance, large language models (LLMs) (e.g., ChatGPT) have revolutionized the item development and rating stages of Chinese language proficiency assessment like HSK. AI can automatically generate test items of varying difficulty levels, diverse types, and novel content based on massive, screened, and annotated Chinese corpora. This not only significantly improves item development efficiency and reduces the workload of experts but also avoids subjectivity and limitations associated with manual item creation, ensuring test scientificity and fairness. In the rating stage, AI also demonstrates great potential: for objective tasks, it can conduct automatic rating efficiently and accurately, while for subjective tasks such as speaking and writing, AI leverages natural language processing, speech recognition, and computer vision to conduct refined analysis and evaluation across multiple dimensions, including pronunciation, fluency, grammar, vocabulary, content, and logic.
Secondly, shifting from outcome evaluation to process assessment. AI enables the integration of teaching, learning, and assessment, emphasizing dynamic assessment, process evaluation, and personalized development. A new paradigm of Chinese assessment, characterized by multimodal generation, intelligent creation, immersion, and behavioral testing, is emerging. Traditional assessment is often summative, conducted once at the end of a teaching phase. The traditional language assessment usually makes it difficult to comprehensively and timely reflect students’ learning processes and progress. In contrast, intelligent assessment systems seamlessly integrate evaluation into daily teaching activities, realizing the organic combination of formative and summative assessment. Teachers can use AI tools to launch quizzes at any time, and the system generates real-time analytical reports to help teachers understand the overall class performance and individual students’ mastery of knowledge, enabling timely adjustments to teaching strategies. Students can also receive immediate feedback through the system, gain insights into their learning status, and engage in targeted practice.
Thirdly, building a dynamic, interactive, and symbiotic human-AI collaborative assessment ecosystem. AI can be widely applied in various stages of language assessment, including task development, assessment delivery, automatic rating, score reporting, and feedback provision. This facilitates the development of localized, customized, and personalized educational evaluation tools, as well as the construction of a dynamic, interactive, and symbiotic human-AI collaborative language assessment ecosystem.
Fourthly, creating new opportunities for educational equity and breaking resource barriers. AI-powered educational assessment effectively narrows educational gaps between urban and rural areas, regions, and schools through three technological means, including cloud-based resource sharing, intelligent personalized support, and date-driven precise intervention, promoting educational resources.
三、智能化测评的HSK考试多语种实践
HSK3.0所有级别可利用人工智能和大语言模型,结合国际中文教育特点,调整其词汇、语法、话题、任务大纲,增加汉字大纲,加大AI命题、AI监考、AI评分,以及个性化AI自适应学习测试产品开发、应用和推广力度,更好服务海内外中文学习者和考生。
HSK3.0 7-9级的考试采用“一卷三级”形式,其翻译题型重点考查学术研究、经济文化等复杂场景下的中英、中日、中韩等多语种专业翻译能力。以人工智能为代表的新技术可赋能多语种中外互译能力评估,打造人机协同模式考试新生态,为全球高阶中文学习者提供更具公平性、真实性、交互性的多语种翻译能力评价体系。
以命题环节为例,命题智能体通过自然语言处理与深度学习算法,根据《国际中文教育中文水平等级标准》和考试大纲要求,依托多语种语料库动态生成翻译试题,人工命题专家负责术语校准、文化适配、试题难度与分布均衡性、正向引导、风险规避等质量控制环节,确保试题的难度、内容覆盖面、代表性和文化适切性。
在评分环节,智能化评卷系统整合语音、文本及视觉信息,进行多模态语义对齐,对译文进行语义连贯性、文化适配性等多维度自动评分,人工评分员进行复核,提升主观题评分的效率、信度、效度。此外,数字人技术为口译测评提供了全新机会,除了可通过高仿真虚拟考官实现实时双语交传能力评价,以及人机协同对考生口译中的听力理解、信息转换、语言表达、跨文化交际、实时反应进行动态评估外,更能突破考官资源不足的瓶颈,实现口译测评应考尽考,具有效率提升、客观性增强、场景真实、反馈个性化、支持远程和大规模口译能力测试的强大优势。
III. Multilingual Practice of the Intelligent HSK Test
Leveraging AI and LLMs, HSK 3.0 has adjusted its vocabulary, grammar, topic, and task syllabuses across all levels, added a Chinese character syllabus, and expanded the application, the promotion of AI-driven item development, AI proctoring, AI rating, and personalized AI adaptive learning and testing products to better serve Chinese learners and examinees worldwide.
HSK 3.0 Levels 7-9 adopt one test paper for three levels format, with the newly launched translation tasks assessing professional multilingual translation ability (e.g., Chinese-English, Chinese-Japanese, Chinese-Korean) in complex scenarios such as academic research, economy, and culture. Emerging technologies represented by AI empower multilingual translation assessment, creating a new human-AI collaborative test ecosystem and providing a more fair, authentic, and interactive multilingual translation competence evaluation system for advanced Chinese language learners globally.
For instance, in the item development stage, AI-based item generator agents can dynamically create translation task based on the CPGSICLE and assessment syllabuses using natural language processing and deep learning algorithms, relying on multilingual corpora. Experts are responsible for quality control, including terminology calibration, cultural adaptation, balance of task difficulty and distribution, positive value guidance, and risk mitigation, ensuring the test tasks’ difficulty, content coverage, representativeness, and cultural appropriateness.
In the rating stage, intelligent rating systems integrate audio, text, and visual information for multimodal semantic alignment, conducting automated rating of translations across multiple dimensions such as semantic coherence and cultural adaptability. The raters only review the results to enhance the efficiency, reliability, and validity of subjective rating. Additionally, digital human technology offers new potential for interpreting assessment: high-fidelity virtual examiners enable real-time bilingual consecutive interpreting assessment, and human-AI collaboration dynamically evaluates examinees’ listening comprehension, information conversion, linguistic expression, cross-cultural communication, and real-time responses during interpreting. Furthermore, it breaks the bottleneck of insufficient examiners, allowing all eligible examinees to take the test. This approach offers significant advantages, including improved efficiency, enhanced objectivity, authentic testing scenarios, personalized feedback, and support for remote and large-scale interpreting evaluation.
四、智能化中文测试的挑战
AI 时代的教育测评具有无限可能性,但其面临的新挑战也无疑是时代之问。
第一,评价标准问题。Descartes预言的机器智能和人类智慧的关系问题,在教育评价上,是人类智能为中心还是以人机共创智能为评价标准,对这一问题始终众说纷纭。Descartes强调人类意识的自我反思性是机器无法跨越的鸿沟,即机器无法通过“我思故我在”的终极测试。人工智能缺乏真正的理解能力,其根据情境生成有意义的对话、适应未预设的新场景的能力受到质疑。南加州大学的一项研究显示,当受试者收到内容完全相同的情感支持回复时,若被告知是AI生成,其评价显著低于人类回复,认为AI缺乏真实共情。这印证了 Descartes 的意识怀疑,即人类拒绝承认无意识的机器能理解情感,暴露了以人类共情为评价标准的现实。
第二,分数效度问题。效度指所考察心理特质和计划考察心理特质之间的匹配程度,匹配程度越高,效度越理想(Messick,1989)。AI参与的表面化的关键词匹配和伪创造力可能造成语言测评幻象,在智能化评分环节更是如此。一方面,我们难以获得可与人类产出语言相匹敌的AI数据库,训练数据不足使得AI 的题目资源大打折扣,若依靠这样的数据库命题,通过关键词匹配进行评分,语言考试获得分数的效度将受到威胁。另一方面,AI参与语言测评正从技术底层颠覆传统评分效度的理论基础,其威胁体现在三个层面:技术误判引发效度失准;算法同质化瓦解创造力评估;认知替代模糊能力边界。大语言模型生成的应试作文被多位阅卷名师评价为中规中矩、缺乏血肉,虽符合评分细则却丧失个性文采,在高考作文测评中暴露出模板化表达对创新思维的威胁。更值得警惕的是,AI参与正导致人类语言习惯算法化—高频词分布趋同与句式结构标准化使爆发性指标持续衰减。这种语言同质化将最终瓦解对人类独特思维模式的识别能力。目前,语言评价科学尚未构建起人机智能共生共存的新型评价标准,语言考试得分可能偏离考生的真实语言能力,考试分数效度可能不那么理想。
第三,以考促学面临挑战。从以考促学的角度看,智能化测评固然高效,但最根本的危机在于认知替代的不良导向。比如,如果AI深度介入写作测评全过程——从谋篇布局、逻辑推理到语言润色,测评分数反映的已是人机混合认知的产物。由于考试的强大反拨效应,这必然引发学习者对AI的部分依赖。研究表明,学习者和考生过度依赖AI可导致原创表达能力退化,呈现出辅写工具使用率与认知能力萎缩度正相关的悖论。
第四,伦理问题。尽管人工智能在语言测试中展现出巨大潜力,但其应用也引发了一系列伦理问题:①公平性。AI测试系统可能因算法偏见而影响测试的公平性,对不同文化背景的考生尤其可能。②数据隐私与安全风险。智能化语言测评往往依赖云端系统处理包含个人隐私和敏感信息在内的海量数据,云端系统存在安全漏洞,可能遭遇黑客攻击,导致机密信息泄露或译文被篡改,造成严重后果。③独特文化和技术功利主义博弈。仁者爱人的东方伦理和技术功利主义算法之间的张力将持续存在。AI难以准确处理习语、隐喻、幽默和文化典故的内容。部分文化将语言视为神圣载体,而AI测试中对这类表达的误判,被视为对文化根基的亵渎。④碳足迹争议触发文化抵制。在德国等激进环保国家或地区,学生团体抵制AI语言测试,认为其隐含碳足迹违背“可持续发展”的传统文化理念,要求恢复纸质考试。欧盟文件指出,AI教育应用需通过“绿色算法”认证,要求开发者披露模型全生命周期碳排放数据,否则禁止在公立学校使用(European Commission, 2020)。
第五,社会信任和接受度。虽然AI技术能够提升智能化测评的效率和个性化路径, 但社会对其是否信任仍依赖于其能否保持公正、可控和负责任。人类智能的不可替代性体现在意识、价值观与创造力上,建立健全的伦理规范与监管框架,是确保智能化教育评价健康发展的前提。
第六,职业与就业挑战。智能化测评在提高效率的同时,可能对人类语言评价的职业资格认定和就业机会带来冲击。AI辅助的新语言评价模式可能改变传统的岗位与行业格局,甚至引发职位流失。
IV. Challenges of Intelligent Chinese Language Proficiency Testing
While educational assessment holds boundless potential in the AI era, it also faces new challenges that demand attention.
Firstly, the evaluation standards. The relationship between machine intelligence and human wisdom, as predicted by Descartes, remains a subject of ongoing debate. In educational evaluation, there is no consensus on whether to center on human intelligence or adopt intelligence co-created by human-AI as the evaluation standard. Descartes emphasized that the self-reflective nature of human consciousness is an insurmountable gap for AI, which means AI cannot pass the ultimate test of “I think, therefore I am.” AI lacks authentic understanding, and its ability to generate meaningful dialogue in context and adapt to unpreset new scenarios is questionable. A study by the researchers from the University of Southern California found that when participants received emotionally supportive responses, they rated those attributed to AI significantly lower than human-generated ones, perceiving AI as lacking genuine empathy. This confirms Descartes’ skepticism about machine consciousness: humans refuse to acknowledge that unconscious machines can understand emotions, reflecting the reality of using human empathy as an evaluation standard.
Secondly, the rating validity. Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment (Messick, 1989). Superficial keyword matching and pseudo-creativity involving AI may create illusions in language assessment, particularly in intelligent rating. On the one hand, it is difficult to obtain AI databases comparable to human-generated language. Insufficient training data undermines the quality of AI-generated test items; if test items are developed based on such databases and scored through keyword matching, the assessment validity will be compromised. On the other hand, AI’s involvement in language assessment is fundamentally reshaping the theoretical foundation of traditional rating validity, with threats manifesting at three levels: validity inaccuracies caused by technical misjudgments, erosion of creativity assessment due to algorithmic homogenization, and blurred competence boundaries resulting from cognitive substitution. Essay responses generated by LLMs have been rated by experienced scorers as conventional and lacking depth, which means they meet the rating criteria but lacking individuality and literary merit. This exposes the threat of formulaic expression to innovative thinking in National College Entrance Examination essay assessment. More alarmingly, the involvement of AI is leading to the algorithmization of human linguistic habits—the convergence of high-frequency word distributions and standardization of sentence structures are causing a continuous decline in explosive indicators. This linguistic homogenization will ultimately erode the ability to identify unique human thinking patterns. Currently, the field of language assessment science has not established new evaluation standards for the coexistence of human and machine intelligence, leading to potential discrepancies between test scores and examinees’ actual linguistic competence, and compromising the rating validity.
Thirdly, challenges to assessment for learning. From the perspective of assessment for learning, while intelligent assessment is efficient, its most fundamental crisis lies in the adverse orientation of cognitive substitution. For example, if AI is deeply involved in the entire writing assessment process from structure planning and logical reasoning to linguistic refinement, which means the assessment score will reflect the product of human-AI hybrid cognition. Due to the strong washback effect of assessment, this will inevitably lead to learners’ partial reliance on AI. Studies have shown that excessive dependence on AI among learners can lead to the degradation of original expressive ability, presenting a paradox where the frequency of using writing assistance tools is positively correlated with the atrophy of cognitive ability.
Fourthly, ethical issues. Despite AI’s great potential in language testing, its application raises a series of ethical concerns: 1) Fairness: AI testing systems may suffer from algorithmic biases, disproportionately affecting examinees from different cultural backgrounds. 2) Data privacy and security risks: Intelligent language assessment often relies on cloud-based systems to process massive amounts of data containing personal privacy and sensitive information. Security vulnerabilities in cloud-based systems may lead to hacking resulting in the leakage of confidential information or tampering with translations, with serious consequences. 3) Tensions between unique cultural values and technological utilitarianism: The tension between Eastern ethics of benevolence and care for others and utilitarian algorithms will persist. AI struggles to accurately process idioms, metaphors, humor, and cultural allusions. Some cultures regard language as a sacred carrier, and misjudgments of such expressions in AI testing systems are perceived as desecration of cultural roots. 4) Cultural resistance triggered by carbon footprint controversies: In radical environmentalist countries or regions such as Germany, student groups have boycotted AI language testing, arguing that their implicit carbon footprint violates traditional “sustainable development” values and demanding the restoration of paper-based tests. EU documents stipulate that AI applications in education must obtain “green algorithm” certification, requiring developers to disclose the carbon emission data of models throughout their life cycles; otherwise, their use in public schools is prohibited (European Commission, 2020).
Fifthly, social trust and acceptance. While AI technology can improve the efficiency and personalization of intelligent assessment, social trust in it depends on its ability to maintain fairness, controllability, and accountability. The irreplaceability of human intelligence lies in consciousness, values, and creativity. Therefore, establishing sound ethical norms and regulatory frameworks is a prerequisite for the healthy development of intelligent educational evaluation.
Sixthly, career and employment challenges. While intelligent assessment improves efficiency, it may impact professional qualification recognition and employment opportunities in human language evaluation. New AI-assisted language assessment models may reshape traditional job roles and industry structures, potentially leading to job losses.
五、结束语
语言考试为英才提供了进阶阶梯。HSK的全球化成功印证了科学测评与技术创新结合的无限可能。正如国际语言测试领域泰斗Lyle Bachman(1990)所言:“语言测试既服务于语言习得和语言教学研究,也从中获益。”我们主张,AI时代的教育评价,仍应以培养完整的人为终极标准,而非仅追求算法的最优解。中国文化具有强大生命力和独特性,AI中文测试系统可能因算法偏见而影响测试的公平性,开发适用于不同文化背景的AI中文测试系统至关重要。国际中文教育呈现低龄化、中文+职业教育需求强劲等趋势,利用AI开发适用于不同年龄和职业场景的AI中文个性化测试工具和平台恰逢其时。随着人工智能、大数据、虚拟现实等技术的深度融合,在技术创新与伦理规范之间找到平衡,向智能化、个性化、场景化方向演进,为全球中文学习者打造更高效、公平、沉浸 式的测评体验,实现国际中文测评的公平性、公正性、以考促学的目标。
V. Conclusion
Language assessments provide a ladder for talents to advance. The global success of HSK confirms the boundless potential of combining scientificass essment with technological innovation. As Lyle Bachman (1990), a leading authority on international language testing, noted: “Language testing both serves and is served by research in language acquisition and language teaching.” We argue that educational evaluation in the AI era should still take cultivating well-rounded individuals as its ultimate goal, rather than merely pursuing optimal algorithmic solutions. Chinese culture possesses strong vitality and uniqueness, and AI-powered assessment systems may suffer from algorithmic biases that undermines test fairness. Therefore, it is crucial to develop an AI-powered Chinese assessment system suitable for diverse cultural backgrounds. International Chinese education is witnessing trends such as younger learners and strong demand for “Chinese + vocational education,” making it an opportune time to use AI to develop personalized AI-powered Chinese assessment tools and platforms tailored to different age groups and professional scenarios. With the deep integration of AI, big data, virtual reality, and other technologies, balancing technological innovation with ethical norms, advancing toward intelligence, personalization, and contextualization, and creating more efficient, fair, and immersive assessment experiences for global Chinese learners will enable international Chinese assessment to achieve its goals of fairness, impartiality, and assessment for learning. 
参考文献:
[1] Anderson, L., & Krathwohl, D. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Educational Objectives. London: Longman.
[2] Bachman, L. F. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press.
[3] European Commission (2020). Digital Education Action Plan (2021-2027).
[4] Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13-103). New York: Macmillan.
[5] Perfetti, C., & Stafura, J. (2014). Word knowledge in a theory of reading comprehension. Scientific Studies of Reading, 18(1), 22-37.
[6] 沈悉尼 . 来华留学生语言能力与学业成就的关系研究 . 上海:上海大学 ,2025.
[7] 郁云峰 . 以考促学 以评促教:与时俱进的中文水平考试 . 光明日报 ,2025-11-11(012).
[8] 张新玲,刘逸凡 .《国际中文教育中文水平等级标准》和HSK5写作对接研究 . 西安外国语大学学报 ,2025,33(01): 39-44.
本文选自2026年第1期《孔子学院》,点击文末“阅读原文”阅览本文其他语言版本。

2026年第1期《孔子学院》上架啦!
点击文末“阅读原文”
浏览不同中外文对照版《孔子学院》


《孔子学院》由中国国际中文教育基金会主办、上海外国语大学协办,拥有标准国际连续出版物刊号(ISSN)和中国国内统一刊号(CN),面向全球发行。本刊为双月刊,有中英、中法、中西、中俄、中德、中意、中葡、中阿、中泰、中韩、中日11个中外文对照版。
《孔子学院》期刊设有“文化视窗”“汉语学习”“当代中国”和“孔院链接”栏目。2026年,本刊拟以河南省、广西壮族自治区、江苏省、海南省、河北省、吉林省为主题,以上各大板块全面开放。欢迎您的来稿!投稿邮箱:ci.journal@ci.cn。
夜雨聆风