[논문] HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

https://arxiv.org/abs/2501.11951

Hanja Processing Platform: https://hanja.dev

YouTube: HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja

Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh

While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.

한국의 역사 문서는 귀중한 문화 유산이지만, 이러한 문서를 이해하려면 심층적인 한자 전문 지식이 필요합니다. 한자는 20세기 이전 한국에서 사용되었던 고대 언어로, 그 문자는 고대 중국에서 차용되었지만 수세기 동안 한국에서 발전해 왔습니다. 현대 한국인과 중국인은 실질적인 추가 도움 없이는 한국 역사 문서를 이해할 수 없으며, 이전의 노력으로 일부 한국어 및 영어 번역이 생성되었지만, 이는 심층적인 전문 지식이 필요하기 때문에 대부분의 문서는 여전히 현대 언어로 번역되지 않고 있습니다. 이러한 격차를 해소하기 위해, 우리는 한자로 작성된 미개척 한국 역사 문서의 이해와 번역을 돕기 위한 최초의 오픈 소스 한자 NLP 툴킷인 HERITAGE를 제시합니다. HERITAGE는 한자 언어 모델을 통해 역사 문서 이해에 중요한 세 가지 작업(구두점 복원, 개체명 인식, 기계 번역(MT))의 모델 예측을 제공하는 웹 기반 플랫폼입니다. HERITAGE는 또한 한자 문자의 현대 한국어 음독과 문자 수준의 영어 정의를 제공하는 대화형 용어집을 제공합니다. HERITAGE는 두 가지 목적을 제공합니다. 첫째, 이러한 문서에 관심이 있는 모든 사람은 모델 예측과 대화형 용어집, 특히 한국어 및 영어로 된 MT 출력을 통해 전반적인 이해를 얻을 수 있습니다. 둘째, 모델 출력이 완벽하지 않기 때문에 한자 전문가가 이를 수정하여 더 나은 주석과 번역을 생성할 수 있습니다. 이를 통해 번역 효율성이 향상되고 잠재적으로 대부분의 역사 문서가 현대 언어로 번역되어 미개척 한국 역사 문서에 대한 장벽을 낮출 수 있습니다.

핵심 요약

문제: 한국 역사 문서는 대부분 한자로 작성되어 현대 한국인과 외국인이 이해하기 어렵고, 번역된 자료도 부족합니다.
해결책: HERITAGE라는 오픈 소스 한자 자연어 처리(NLP) 툴킷을 개발했습니다.
기능:
- 웹 기반 플랫폼으로, 구두점 복원, 개체명 인식, 기계 번역(한국어/영어) 기능을 제공합니다.
- 한자-한국어/영어 대화형 용어집을 제공합니다.
목적:
- 일반인에게 한국 역사 문서에 대한 기본적인 이해를 제공합니다.
- 한자 전문가의 번역 작업을 돕고 효율성을 높여, 더 많은 역사 문서의 번역을 가능하게 합니다.
결론: HERITAGE는 한국 역사 문서에 대한 접근성을 높이고, 문화 유산 보존에 기여할 것으로 기대됩니다.

출처: “한국어로 전체를 번역한 것을 출력하고, 핵심 요약본도 출력해.”. Gemini Advanced 2.0 Experimental Advanced. 2025.01.30.

댓글 남기기 응답 취소