Accurately identifying topics using text: Mapping PubMed

Leiden Repository

Accurately identifying topics using text: Mapping PubMed

Type: Article in monograph or in proceedings
Title: Accurately identifying topics using text: Mapping PubMed
Author: Boyack K.W.Klavans R.
Journal Title: STI 2018 Conference Proceedings
Start Page: 107
End Page: 115
Publisher: Centre for Science and Technology Studies (CWTS)
Issue Date: 2018-09-11
Keywords: Scientometrics
Abstract: Recently, citation links have been shown to produce accurate delineations of tens of millions of scientific documents into a large number (~100,000) of clusters (Sjögårde & Ahlgren, 2018). Such clusters, which we refer to as topics, can be used for research evaluation and planning (Klavans & Boyack, 2017a) as well as to identify hot and/or emerging topics (Small, Boyack, & Klavans, 2014). While direct citation links have been shown to produce more accurate topics using large citation databases than co-citation or bibliographic coupling links (Klavans & Boyack, 2017b), no such comparison has been done at a similar scale using topics based on textual relatedness due to the extreme computational requirements of calculating an enormous number of document-document similarities using text. Thus, we simply do not know if topics identified from a large database using textual characteristics are as accurate as those that are identified using direct citation. This paper aims to fill that gap. In this work we cluster over 23 million documents from the PubMed database (1975-2017) using a text-based similarity and compare the accuracy of the resulting topics to those from existing citation-based topics using three different measures.
Handle: http://hdl.handle.net/1887/65319
 

Files in this item

Description Size View
application/pdf STI2018_paper_26.pdf 1.110Mb View/Open

This item appears in the following Collection(s)