Finding topic boundaries in literary text

Authors

  • Nuette Heyns North-West University
  • Menno van Zaanen South African Centre for Digital Language Resources North West University

DOI:

https://doi.org/10.55492/dhasa.v3i01.3863

Keywords:

topic modelling, LDA, boundary identification

Abstract

 

When performing a distant reading analysis of large amounts of literary texts, we would like to be able to automatically identify the high level structure or story lines of these texts. Story lines are not always linear, but contain transitions, such as flashbacks or changes of scenery. To identify these transitions, we propose a system that aims to identify a boundary describing such a transition. First, we split the text in short snippets. Next, topics are assigned to each of the snippets using LDA, a topic modelling approach. Based on this sequence of LDA topics, potential transition boundaries between snippets are identified. Potential transitions occur between snippets with the smallest intersection of the LDA topics that occur on either side of the potential transition. If multiple potential transitions are available, the system selects one at random. To evaluate this system, we apply it to the concatenation of two texts such that the real boundary is known. We provide results of this system with respect to a random baseline and an oracle system that always selects the best transition when more than one possible transition is available. The system consistently outperforms the baseline. Future work will focus on extending this system to allow for the identification of multiple transitions.

Downloads

Published

2022-02-25

How to Cite

Finding topic boundaries in literary text. (2022). Journal of the Digital Humanities Association of Southern Africa , 3(01). https://doi.org/10.55492/dhasa.v3i01.3863