A Neural Network Approach to Named Entity Recognition on Noisy User-Generated Texts
ACS Natural Language Processing (L90) Final Assignment
Named entity recognition (NER) is an important information extraction task in natural language processing (NLP) which involves automatic identification of entities of interest, such as people’s names, organisations and locations. Current state-of-the-art NER systems can achieve F1-scores of up to 94.6% on English news texts (Wang et al., 2021a), where the named entities are fairly standard, well-formed and highly predictable. However, the diverse and noisy nature of user-generated texts as well as the novel, emerging and rare named entities make NER in social media much more challenging, and standard NER sysmtes were not found to work very well on these tasks. As a comparison, current state-of-the-art NER systems on user-generated texts can only achieve F1-scores of up to 60.45% (Wang et al., 2021b). In this project, I investigated a bidirectional long short-term memory (BiLSTM) structure for NER on social media texts, explored various data-processing techniques in order to improve the model’s performance, namely:
- Downweighting non-named entity labels
- Downsampling non-named entity tokens
- Merging named entity labels
- Adding part-of-speech (PoS) embeddings
I then evaluated my trained models on the W-NUT 2017 shared task on novel and emerging entity recognition. The model trained with data-processing techniques applied has achieved significant improvements on performance, compared to the model with the same structure, but trained on the original dataset without any optimisation.