An NLP Pipeline for Bangla Text Understanding and Linguistic Data Processing: Framework and Implementation
Ahmad Galib *
Research & Development Department, Panjeree Publications Ltd., Bangladesh and Jahangirnagar University, Dhaka, Bangladesh.
*Author to whom correspondence should be addressed.
Abstract
Raw text is the most prevalent form of human language in digital and electronic formats. This research proposes a comprehensive Bangla language processing framework that transitions raw data into structured data and value-added information through clearly defined annotation guidelines. Unlike existing fragmented approaches, this work offers a unified treatment of corpus development and annotation. It specifically detailed each processing phase and its input-output specifications. The pipeline focuses primarily on the text-understanding components and integrated essential tasks such as Parts of Speech (PoS) tagging, parsing and Named Entity Recognition (NER) etc. Moreover, to establish the semantic state of their linguistic inputs, the framework includes coreference resolution and word sense disambiguation. This end-to-end pipeline is designed for several different uses, including high-precision sentiment analysis, automated content moderation, and developing gold standard datasets that can be used in advanced Bangla NLP research.
Keywords: Bangla text processing system, linguistic corpus, annotated text, NLP pipeline