Телефон: 8-800-350-22-65
Напишите нам:
WhatsApp:
Telegram:
MAX:
Прием заявок круглосуточно
График работы офиса: с 9:00 до 21:00 Нск (с 5:00 до 19:00 Мск)

Статья опубликована в рамках: Научного журнала «Студенческий» № 18(356)

Рубрика журнала: Информационные технологии

Библиографическое описание:
Wu Ch. DETECTION OF EMERGING TOPICS AND TRENDS IN SOCIAL NETWORKS USING NLP AND MACHINE LEARNING // Студенческий: электрон. научн. журн. 2026. № 18(356). URL: https://sibac.info/journal/student/356/416662 (дата обращения: 24.05.2026).

DETECTION OF EMERGING TOPICS AND TRENDS IN SOCIAL NETWORKS USING NLP AND MACHINE LEARNING

Wu Changming

master's student, Department of Applied Mathematics and Computer Science, Tomsk State University,

Russia, Tomsk

Karev Svyatoslav Vasilievich

научный руководитель,

Scientific supervisor, assistant, Department of Applied Mathematics and Computer Science, Tomsk State University,

Russia, Tomsk

ABSTRACT

Aiming at data acquisition difficulties, text noise, and weak model adaptability in social network topic detection, this study takes Bilibili as the research object and builds a full-process system: compliant crawler → text preprocessing → optimized LDA → trend visualization. We collected 3,851 valid video entries, optimized LDA for small-sample short texts, and identified four core topics. Experiments show the model performs well (log perplexity = -7.1716), and topic popularity peaks are consistent with real events. This work provides a reusable technical path for social network emerging topic detection.

 

Keywords: Social networks; Emerging topic detection; LDA; NLP; Bilibili; Compliant crawler.

 

1 Introduction

Social networks are major information dissemination platforms, and Bilibili contains massive real-time hotspots. However, topic detection faces three key problems: strict platform API and anti-crawling mechanisms, serious social text noise, and poor stability of traditional LDA in small-sample short-text scenarios. Topic detection and tracking from social media streams has long been a challenging task in natural language processing [2]. Traditional topic models represented by Latent Dirichlet Allocation (LDA) perform poorly on short and noisy social media texts [1][3]. Meanwhile, emerging topic detection using NLP and machine learning has become a mainstream research direction in social network analysis [4]. This study aims to realize compliant data collection, efficient text preprocessing, optimized topic modeling, and trend verification, covering Bilibili’s 15 categories and hot list from April 29 to December 8, 2025, focusing on text data without multi-modal analysis.

2 Research Methods

2.1 Data Acquisition

We use Bilibili’s public APIs to build a compliant crawler with Python 3.9, requests, and pandas. Optimizations include a multi-group UA/Cookie pool, hierarchical delay, and proxy rotation to avoid blocking. We only collect public non-private data for non-commercial research, and control access scale to ensure compliance. After deduplication, filtering low-play videos, and standardizing formats, we obtain 3,851 valid samples with a 98.2% like rate and 92% comment coverage.

2.2 Text Preprocessing & LDA Modeling

We fuse video titles and descriptions, filter URLs, digits, and non-Chinese characters, then conduct Jieba segmentation and stopword removal (general, advertising, and modal words). The valid text retention rate is 85%. We extract features via TF-IDF, optimize the LDA model by filtering extreme words, setting iterations=1000 and alpha='auto', and correcting assignment errors. The model is evaluated by perplexity (log perplexity = -7.1716), and results are visualized by word clouds and weekly trend charts.

3 Experimental Results

Table 1.

Core Topics, Key Words and Semantic Focus of Bilibili Video Texts

Topic

Core Keywords

Semantic Focus

1

Parade, Nation, Power, Documentary

Current affairs hotspots

2

Documentary, Taiwan, Prison, Ivy Chen

Taiwan-related comprehensive issues

3

Taiwan, Curator, Drama, Silicone

Taiwan cultural popular science

4

Documentary, Cultural Relics, Iran, Emerald

Cultural popular science & cross-regional topics

 

The optimized LDA model extracts four stable core topics, as shown in the table:

Trend analysis shows Topic 1 (Parade) peaks match the 2025 National Day commemoration; Topic 3 (Taiwan) stays high in the early stage and rebounds slightly later; Topics 2 and 4 (Documentary) remain stable. The achievements include a reusable compliant crawler and improved LDA stability, while limitations are lack of multi-modal data, fixed topic number, and insufficient sample size.

4 Conclusions and Prospects

This study constructs a complete framework for Bilibili emerging topic detection, realizing compliant data collection, efficient text preprocessing, optimized LDA modeling, and trend-effectiveness verification. The scheme can be widely reused in social network topic analysis. Future work will expand multi-modal and long-tail data, introduce BERT-BiLSTM, build an event-popularity correlation model, and develop a real-time visualization system.

 

References:

  1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
  2. Rosa, H., Correia, N., & Pardo, C. (2011). Topic detection and tracking in social media streams. Proceedings of the 20th International Conference on World Wide Web (WWW), 211–220.
  3. Qiang, J. P., Qian, Z. Y., Li, Y., Yuan, Y. H., & Wu, X. D. (2022). Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1427–1445.
  4. Sangeetha, S., & Michael, A. (2020). Emerging topic detection in social networks using NLP and machine learning techniques. Journal of King Saud University - Computer and Information Sciences, 32(7), 789–798.