Finding topic boundaries in literary text
DOI:
https://doi.org/10.55492/dhasa.v3i01.3863Keywords:
topic modelling, LDA, boundary identificationAbstract
When performing a distant reading analysis of large amounts of literary texts, we would like to be able to automatically identify the high level structure or story lines of these texts. Story lines are not always linear, but contain transitions, such as flashbacks or changes of scenery. To identify these transitions, we propose a system that aims to identify a boundary describing such a transition. First, we split the text in short snippets. Next, topics are assigned to each of the snippets using LDA, a topic modelling approach. Based on this sequence of LDA topics, potential transition boundaries between snippets are identified. Potential transitions occur between snippets with the smallest intersection of the LDA topics that occur on either side of the potential transition. If multiple potential transitions are available, the system selects one at random. To evaluate this system, we apply it to the concatenation of two texts such that the real boundary is known. We provide results of this system with respect to a random baseline and an oracle system that always selects the best transition when more than one possible transition is available. The system consistently outperforms the baseline. Future work will focus on extending this system to allow for the identification of multiple transitions.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Nuette Heys, Menno van Zaanen
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.