Researchers at the Indian Institute of Technology-Guwahati (IIT-G) have developed a multilingual method to identify and correct Surface Name Errors (SNEs) in Wikipedia, enhancing information reliability for users and artificial intelligence systems. This method, unveiled at the India AI Impact Summit 2026, addresses errors in surface names—text used in Wikipedia articles for linking entities—found in 3-6% of mentions. Such errors undermine credibility and can negatively affect machine learning models relying on Wikipedia data.
The method, created by Prof. Amit Awekar and M. Tech student Mr. Anuj Khare, employs a three-step approach: firstly, it scans Wikipedia links to create quadruplets containing contextual information; secondly, it checks if a surface name appears at least 10 times and accounts for 5% of links to be considered correct; lastly, it classifies errors into typing mistakes or entity span errors.
The method was tested in eight languages, yielding accurate results. Prof. Awekar emphasized the importance of reliable data for both human users and AI training. Validation against Wikipedia snapshots from 2018 to 2022 revealed a 30% correction rate of predicted errors, corroborating the method’s effectiveness. Furthermore, over 99% of manual corrections proposed by researchers were adopted by the Wikipedia community. This innovative approach combines scalable data processing with community validation, reinforcing the integrity of digital knowledge systems.
