Chinese-Russian Corpus of Political Texts: A Comparative Analysis of Probabilistic Topic Models
https://doi.org/10.22162/2619-0990-2025-77-1-247-271
Abstract
Introduction. The article introduces a comparative analysis of probabilistic topic models derived from a Chinese-Russian corpus of parallel and comparable political texts. The corpus developed hereto includes a total of three sub-corpora: Reports on the Work of the Government in 2012–2022 (original Chinese-language texts), their Russian translations, and Presidential Addresses to the Federal Assembly of Russia in 2011–2021 (a comparable Russian-language sub-corpus). Goals. The work aims at identifying and describing topics that prove common within the corpus, as well as ones specific to individual texts. Linguistic interpretations have been conducted with topic labeling tools of the YandexGPT language model, the resulting topic labels be further compared to expert-generated annotations and automatically extracted keyphrases. The conducted probabilistic topic modelling involves the LDA algorithm in TMT (Topic Modeling Tool), as well as the YAKE, mBERT, and TF-IDF algorithms from Orange library for Python. The algorithms are intended to identify keyphrases and find out similarities in topical words across different sub-corpora and between the languages under comparison. Results. So, a family of probabilistic topic models that describe semantic organization of the Chinese-Russian parallel and comparable corpus of political texts has been created. The outcomes of our topic modelling are compared to the automatically extracted keyphrases, and reveal certain intersections for each sub-corpus. The study also provides a part-of-speech (POS) tagging analysis of topical words. As is shown, the models reproduce key paradigmatic and syntagmatic relationships in the text corpus. The research is first to present automatically constructed probabilistic topic models for a Chinese-Russian parallel and comparable corpus of political texts, thus filling in some gaps existing in this field.
About the Authors
Zhu HuiChina
Postgraduate Student
Olga А. Mitrofanova
Russian Federation
Cand. Sc. (Philology), Associate Professor
References
1. Bolshakova et al. Natural Language Processing and Computational Linguistics. Coursebook. Moscow: Moscow Institute of Electronics and Mathematics, 2011. 272 p. (In Russ.)
2. Bolshakova et al. Natural Language Processing and Data Analysis. Coursebook. Moscow: HSE University, 2017. 269 p. (In Russ.)
3. Vorontsov K. V. Probabilistic Topic Modeling: ARTM Regularization Theory and the BigARTM Open Source Library. On: MachineLearning.ru. 2023. Available at: http://www.machinelearning.ru/wiki/images/d/d5/Voron17survey-artm.pdf (accessed: 12 April 2024). (In Russ.)
4. Guseva D. D., Mitrofanova O. A. Key phrases in Russian-language popular science texts: Comparison of oral and written speech perception with the results of automatic analysis. Terra Linguistica. 2024. Vol. 15. № 1. Pp. 20–35. (In Russ.) DOI: 10.18721/JHSS.15102
5. Dan Na. Russian-Chinese parallel corpora in the theory and practice of translation. In: PGLU University Readings – 2015. Proceedings. Vol. 6. Pyatigorsk: PGLU University, 2015. Pp. 204–208. (In Russ.)
6. Erofeeva A.R., Mitrofanova O.A. Automatic assignment of topic labels in topic models for Russian text corpora. In: Nikolaev I. S. (ed.) Structural and Applied Linguistics. Vol. 12. St. Petersburg: St. Petersburg University, 2019. Pp. 122–147. (In Russ.)
7. Zakharov V. P., Bogdanova S. Y. Corpus Linguistics. St. Petersburg: St. Petersburg University, 2020. 234 p. (In Russ.)
8. Kolpachkova E. N. Chinese language corpora: An overview and major problems. In: Corpus Linguistics – 2015. Conference proceedings. St. Petersburg, 2015. Pp. 278–286. (In Russ.)
9. Koltsov S. N., Koltsova O. Ju., Mitrofanova O. A., Shimorina A. S. Interpretation of semantic relations in the texts of the Russian LiveJournal segment based on LDA topic model. In: Information Society Technologies in Science, Education and Culture. Conference proceedings (Internet and Modern Society). St. Petersburg, 2014. Pp. 135–142. (In Russ.)
10. Lyashevskaya O. N., Sharov S. A. Frequency Dictionary of Modern Russian: [Analyzing] the Russian National Corpus. Moscow: Azbukovnik, 2009. 1087 p. (In Russ.)
11. Milkova M. A. Topic models as a tool for “long distance reading”. Digital Economy. 2019. No. 1 (5). Pp. 57–70. (In Russ.) DOI: 10.34706/DE-2019-01-06
12. Mitrofanova O. A., Athugodage M. M. Dynamic topic modelling of the Russian legal text corpus. Terra Linguistica. 2023. Vol. 14. No. 1. Pp. 70–87. (In Russ.) DOI: 10.18721/JHSS.14107
13. Mitrofanova O. A. Possibilities of parallel and comparable texts in building thematic models of corpora. In: Applied Linguistics in Science and Education. ALPAC Report Half a Century after the Destruction. Conference proceedings. St. Petersburg: Herzen University, 2016. Pp. 194–199. (In Russ.)
14. Mukhin M.Y., Yang Y. Building a Chinese-Russian parallel discourse structure corpus of official texts. Bulletin of the South Ural State University. Ser. Linguistics. 2016. Vol. 13. No. 4. Pp. 23–31. (In Russ.) DOI: 10.14529/ling160404
15. Nokel M. A., Loukashevich N. V. Topic models: Adding bigrams and taking account of the similarity between unigrams and bigrams.Numerical Methods and Programming. 2015. Vol. 16. No. 2. Pp. 215–234. (In Russ.) DOI: 10.26089/NumMet.v16r222
16. Nikolaev I. S., Mitrenina O. V., Lando T. M. (eds.) Applied and Computational Linguistics. Moscow: URSS, 2016. 320 p. (In Russ.)
17. Sedova A. G., Mitrofanova O. A. Topic modelling of Russian texts based on lemmata and lexical constructions. In: Computational Linguistics and Ontology. Conference proceedings (Internet and Modern Society). St. Petersburg: ITMO University, 2017. Pp. 132–143. (In Russ.) DOI: 10.17586/2541-9781-2017-1-132-144
18. Tao Yuan, Zakharov V. P. Creation and use of a parallel Russian-Chinese corpus. Nauchno-tekhnicheskaya informatsiya Ser. 2: Informatsionnye protsessy i sistemy. 2015. No. 4. Pp. 18–29. (In Russ.)
19. Zhu Hui, Zakharov V. P. A corpus-based linguistic comparison of Chinese and Russian political texts. Political Linguistics. 2024. No. 1 (103). Pp. 115–128. (In Russ.)
20. Chen Xiaohui, Kukushkina O. V. The parallel corpora of Russian and Chinese texts. Lomonosov Philology Journal. 2018. No. 2. Pp. 170–197. (In Russ.)
21. Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C., Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Information Sciences. 2019. No. 509. Pp. 257–289. (In Eng.) DOI: 10.1016/j.ins.2019.09.013
22. Cao S. Y. How does discourse affect Spanish-Chinese translation? A case study based on a Spanish-Chinese parallel corpus. In: First Workshop on Computational Approaches to Discourse. Proceedings [online edition]. 2020. Pp. 1–10. Available at: https://aclanthology.org/2020.codi-1.1 (accessed: 15 June 2024). (In Eng.)
23. Cui W., Zhang L. Research on parallel corpus of Russian-Chinese translation and its application. Journal of PLA University of Foreign Languages. 2014. No. 1. Pp. 81–87. (In Chin.)
24. Dalianis H., Xing H.-Ch., Zhang X. Creating a reusable English-Chinese parallel corpus for bilingual dictionary construction. In: Seventh International Conference on Language Resources and Evaluation (LREC’10). Proceedings. Valletta, Malta: European Language Resources Association (ELRA), 2010. Pp. 1700–1705. (In Eng.)
25. Daud A., Li J., Zhou L., Muhammad F. Knowledge discovery through directed probabilistic topic models: A survey. Frontiers of Computer Science in China. 2010. Vol. 4. No. 2. Pp. 280−301. (In Eng.)
26. Huang X. L., Li X., Liu T.L., David Chiu, Zhu T. S., Zhang L. Topic model for identifying suicidal ideation in Chinese microblog. In: 29th Pacific Asia Conference on Language, Information and Computation. Proceedings. Shanghai, 2015. Pp. 553–562. (In Eng.)
27. Li X. Q., Hu K. B. Keywords and their collocations in the English translations of Chinese government work reports. Foreign Language in China. 2017. No. 6. Pp. 81–89. (In Chin.)
28. Liu P. F., Yuan W. Z., Fu J. L., Jiang Z. B., Hayashi H., Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. On: ArXiv.org. Available at: https://arxiv.org/abs/2107.13586 (accessed: 15 June 2024). (In Eng.)
29. Liu M., Shao Q. The creation of a corpus of Russian-Chinese literary translations: Design and construction of a parallel corpus based on Chekhov’s novels. Foreign Language Research. 2016. No. 1. Pp. 154–158. (In Chin.)
30. Mamaev I. D, Mitrofanova O. A. Automatic detection of hidden communities in the texts of Russian social network corpus. In: Filchenkov A., Kauttonen J., Pivovarova L. (eds.) Artificial Intelligence and Natural Language (AINL 2020). Conference proceedings (Communications in Computer and Information Science 1292). [Helsinki]: Springer, 2020. Pp. 17–33. (In Eng.)
31. Manning Ch., Schütze H. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 2000. 680 p. (In Eng.)
32. Mimno D., Wallach H. M., Naradowsky J., Smith D. A., McCallum A. Polylingual topic models. In: 2009 Conference on Empirical Methods in Natural Language Processing. Proceedings. Singapore, 2009. Pp. 880– 889. (In Eng.)
33. Wu S. J., Dredze M. Are all languages created equal in multilingual BERT? In: 5th Workshop on Representation Learning for NLP. Proceedings. 2020. Pp. 120–130. (In Eng.) DOI: 10.18653/v1/2020.repl4nlp-1.16
34. Zhai Y. M., Liu L. F., Zhong X. Y., Illouz G., Vilnat A. Building an English-Chinese parallel corpus annotated with sub-sentential translation techniques. In: Twelfth Language Resources and Evaluation Conference. Proceedings. Marseille: European Language Resources Association, 2020. Pp. 4024–4033. (In Eng.)
35. Zhang B. L., Nagesh A., Knight K. Parallel corpus filtering via pre-trained language models. In: 58th Annual Meeting of the Association for Computational Linguistics. Proceedings [online edition]. Association for Computational Linguistics, 2020. Pp. 8545–8554. (In Eng.)
36. Mitrofanova O., Sampetova V., Mamaev I., Moskvina A., Sukharev K. Topic modelling of the Russian corpus of Pikabu posts: Author-topic distribution and topic labelling. In: Internet and Modern Society 2020. CEUR Workshop Proceedings. St. Petersburg, 2021. Pp. 101–116. (In Eng.)
37. Newman D., Asuncion A., Smyth P., Welling M. Distributed algorithms for topic models. Journal of Machine Learning Research. 2009. Vol. 10. Pp. 1801–1828. (In Eng.)
38. Mitrofanova O., Kriukova A., Shulginov V., Shulginov V. E-hypertext media topic model with automatic label assignment. In: Recent Trends in Analysis of Images, Social Networks and Texts (AIST 2020). Conference proceedings (Communications in Computer and Information Science 1357). Springer, 2021. Pp. 102−114. (In Eng.)
39. Sherstinova T., Mitrofanova O., Skrebtsova T., Zamiraylova E., Kirina M. Topic modelling with NMF vs. expert topic annotation: The case study of Russian fiction. In: Martínez-Villaseñor L., Ponce H., Herrera-Alcántara O., Castro-Espinoza F. A. (eds.). Advances in Computational Intelligence (MICAI 2020). Conference proceedings (Lecture Notes in Computer Science 12469). Springer, 2020. Pp. 134–151. (In Eng.)
40. Sun J. S., Wang T. M., Li L., Wu X. Person name disambiguation based on topic model. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing. Proceedings [online edition]. 2010. Pp. 1–8. (In Eng.)
41. Tian L., Wong D. F., Chao L. S., Quaresma P., Oliveira F., Lu Y., Li S., Wang Y. M., Wang L. Y. UM-Corpus: A large English-Chinese parallel corpus for statistical machine translation. In: Ninth International Conference on Language Resources and Evaluation (LREC’14). Proceedings. Reykjavik: European Language Resources Association (ELRA), 2014. Pp. 1837–1842. (In Eng.)
42. Vulić I. Moens M.-F. Detecting highly confident word translations from comparable corpora without any prior knowledge. In: 13th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings. Avignon, 2012. Pp. 449–459. (In Eng.)
43. Wang K. F., Qin H. W. A parallel corpus-based study of general features of translated Chinese. Foreign Language Research. 2009. No. 1. Pp. 102–105. (In Chin.)
Review
For citations:
Hui Zh., Mitrofanova O.А. Chinese-Russian Corpus of Political Texts: A Comparative Analysis of Probabilistic Topic Models. Oriental Studies. 2025;18(1):247-271. (In Russ.) https://doi.org/10.22162/2619-0990-2025-77-1-247-271