Статья опубликована в рамках: Научного журнала «Студенческий» № 2(340)
Рубрика журнала: Информационные технологии
Скачать книгу(-и): скачать журнал часть 1, скачать журнал часть 2, скачать журнал часть 3, скачать журнал часть 4, скачать журнал часть 5, скачать журнал часть 6, скачать журнал часть 7
AUTOMATING SPATIAL DATA EXTRACTION FROM UNSTRUCTURED TEXT SOURCES: A COMPARATIVE STUDY OF NLP MODELS
АВТОМАТИЗАЦИЯ ИЗВЛЕЧЕНИЯ ПРОСТРАНСТВЕННЫХ ДАННЫХ ИЗ НЕСТРУКТУРИРОВАННЫХ ТЕКСТОВЫХ ИСТОЧНИКОВ: СРАВНИТЕЛЬНОЕ ИССЛЕДОВАНИЕ МОДЕЛЕЙ НЛП
Ндзана Айбе Кэтрин
студент, кафедра компьютерных наук, Адыгейский государственный университет,
РФ, г. Майкоп
ABSTRACT
The paper explores the challenges of extracting geospatial information from unstructured text data, such as social media and news articles. Given the critical role of spatial data in emergency response and urban planning, this study compares three distinct architectural approaches: a rule-enhanced spaCy NER model, a custom BiLSTM-CRF deep learning network, and a zero-shot Mistral-7B Large Language Model. The results indicate that while LLMs show promise, the rule-based spaCy pipeline remains the most robust production baseline with an F1-score of 0.812.
АННОТАЦИЯ
В статье рассматриваются проблемы извлечения геопространственной информации из неструктурированных текстовых данных, таких как социальные сети и новостные статьи. Учитывая решающую роль пространственных данных в реагировании на чрезвычайные ситуации и городском планировании, в этом исследовании сравниваются три различных архитектурных подхода: модель spaCy NER с улучшенными правилами, специальная сеть глубокого обучения BiLSTM-CRF и большая языковая модель Mistral-7B с нулевым выстрелом. Результаты показывают, что, хотя программы LLM показывают многообещающие результаты, основанный на правилах конвейер spaCy остается наиболее надежным базовым уровнем производства с показателем F1 0,812.
Keywords: Natural Language Processing, Geoparsing, Named Entity Recognition, spaCy, Mistral, FastAPI.
Ключевые слова: обработка естественного языка, геоанализ, распознавание именованных сущностей, spaCy, Mistral, FastAPI.
Introduction
In the digital age, a massive volume of unstructured text is generated daily. Hidden within this text is valuable geospatial information, but manual extraction is inefficient and unscalable. This research tackles the problem by developing an automated system to extract location entities, turning raw text into structured data for critical applications like emergency response, urban planning, and business analytics.
Methodology
The study conducted a comparative analysis of three models using an evaluation set of approximately 60,000 tokens, characterized by a significant class imbalance where location tokens represent only 1.3% of the total data.
- spaCy Pipeline: An industry-standard NLP library utilizing a pre-trained Named Entity Recognition (NER) model, enhanced with custom rules to boost performance on specific location formats.
- BiLSTM-CRF: A custom deep learning model combining a Bidirectional LSTM network to understand text context with a Conditional Random Field (CRF) layer to ensure logical and coherent predicted location tags.
- Mistral-7B: A powerful Large Language Model (Mistral-7B-Instruct-v0.2) utilized via prompt engineering in a zero-shot approach to identify and return locations in a structured format.
System Architecture The system is implemented as a multi-layered API using FastAPI. The request flow begins with the client sending text via a Uvicorn ASGI server to a FastAPI instance. The system utilizes global model caching to ensure efficient, single-instance loading of heavy models before processing the request through specific service functions to generate a structured JSON output.
Table 1.
Results and Analysis
|
Model |
Precision |
Recall |
F1-Score |
|
spaCy |
0.772 |
0.858 |
0.812 |
|
Mistral |
0.53 |
0.58 |
0.55 |
|
BiLSTM-CRF |
0.18 |
0.13 |
0.15 |
Analysis shows that spaCy is the clear winner. Its high Recall (0.858) is particularly valuable for finding most true locations and minimizing missed opportunities. The BiLSTM-CRF model proved highly sensitive to the data imbalance, while Mistral showed moderate success but requires significant fine-tuning or advanced prompting to compete with specialized NER pipelines.
Conclusion
This comparative study confirms that the spaCy pipeline is the most effective out-of-the-box solution for geoparsing tasks. Future work will focus on assembling a large, high-quality Russian dataset to fine-tune transformer architectures capable of handling the unique grammatical and structural complexities of Russian addresses.
References:
- Honnibal M., Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. – 2017.
- Jiang A. Q. et al. Mistral 7B // arXiv preprint arXiv:2310.06825. – 2023.
- Lample G. et al. Neural architectures for named entity recognition // Proceedings of the 2016 Conference on North American Chapter of the Association for Computational Linguistics. – 2016. – P. 260–270.
- Tiangolo S. FastAPI framework, high performance, easy to learn, fast to code, ready for production. URL: https://fastapi.tiangolo.com/ (date of access: 10.11.2025).


Оставить комментарий