research-article

Extracting key terms from noisy and multitheme documents

Authors:

Dmitry LizorkinAuthors Info & Claims

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 661 - 670

https://doi.org/10.1145/1526709.1526798

Published: 20 April 2009 Publication History

Abstract

We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia.

Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages.

Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.

References

[1]

S. Auer and J. Lehmann. What have innsbruck and leipzig in common? extracting semantics from wiki content. pages 503--517. 2007.

[2]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107--117, 1998.

Digital Library

[3]

A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70:066111, 2004.

[4]

D. J. de Solla Price. Networks of scientific papers. Science, 169:510--515, 1965.

[5]

E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-manning. Domain-specific keyphrase extraction. pages 668--673. Morgan Kaufmann Publishers, 1999.

[6]

E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, pages 1606--1611, Hyderabad, India, 2007.

Digital Library

[7]

M. Janik and K. J. Kochut. Wikipedia in action: Ontological knowledge in text categorization. International Conference on Semantic Computing, 0:268--275, 2008.

Digital Library

[8]

S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol, 22(3):437--467, March 1969.

[9]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.

Digital Library

[10]

C. D. Manning and H. Sch'Aijtze. Foundations of Statistical Natural Language Processing. The MIT Press, June 1999.

Digital Library

[11]

O. Medelyan, I. H. Witten, and D. Milne. Topic indexing with wikipedia. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.

[12]

R. Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 411--418, Morristown, NJ, USA, 2005. Association for Computational Linguistics.

Digital Library

[13]

R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of NAACL HLT 2007, pages 196--203, 2007.

[14]

R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 233--242, New York, NY, USA, 2007. ACM.

Digital Library

[15]

R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004.

[16]

G. A. Miller, C. Fellbaum, R. Tengi, P. Wakefield, H. Langone, and B. R. Haskell. Wordnet: a lexical database for the english language. http://wordnet.princeton.edu/.

[17]

D. Milne. Computing semantic relatedness using wikipedia link structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand, 2007.

[18]

D. Milne and I. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.

[19]

M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004.

[20]

S. Redner. How popular is your paper? an empirical study of the citation distribution. The European Physical Journal B, 4:131, 1998.

[21]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988.

Digital Library

[22]

R. Sinha and R. Mihalcea. Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity. In ICSC '07: Proceedings of the International Conference on Semantic Computing, pages 363--369, Washington, DC, USA, 2007. IEEE Computer Society.

Digital Library

[23]

M. Strube and S. Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pages 1419--1424, Boston, Mass., July 2006.

Digital Library

[24]

Z. Syed, T. Finin, and A. Joshi. Wikipedia as an Ontology for Describing Documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press, March 2008.

[25]

W. tau Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 213--222, New York, NY, USA, 2006. ACM.

Digital Library

[26]

D. Turdakov and P. Velikhov. Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation. In Colloquium on Databases and Information Systems (SYRCoDIS), 2008.

[27]

S. Wasserman, K. Faust, and D. Iacobucci. Social Network Analysis : Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, November 1994.

Cited By

Xu CMao XXin CShang YChe TMao HHuang H(2024)HCUKE: A Hierarchical Context-aware approach for Unsupervised Keyphrase ExtractionKnowledge-Based Systems10.1016/j.knosys.2024.112511304(112511)Online publication date: Nov-2024
https://doi.org/10.1016/j.knosys.2024.112511
Tohalino JSilva TAmancio D(2024)Using word embedding to detect keywords in texts modeled as complex networksScientometrics10.1007/s11192-024-05055-7129:7(3599-3623)Online publication date: 9-Jun-2024
https://doi.org/10.1007/s11192-024-05055-7
Gupta KAhmad AGhosal TEkbal A(2024)A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publicationsInternational Journal on Digital Libraries10.1007/s00799-023-00393-yOnline publication date: 23-Jan-2024
https://doi.org/10.1007/s00799-023-00393-y
Show More Cited By

Index Terms

Extracting key terms from noisy and multitheme documents
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Using Wikipedia concepts and frequency in language to extract key terms from support documents

In this paper, we present a new key term extraction system able to handle with the particularities of ''support documents''. Our system takes advantages of frequency-based and thesaurus-based approaches to recognize two different classes of key terms. ...
Short-text domain specific key terms/phrases extraction using an n-gram model with wikipedia
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Finding domain specific key terms/phrases from a given set of documents is a challenging task. A domain may be defined as an area of interest over a collection of documents which may not be explicitly defined but implicitly observable in those ...
Improving retrieval effectiveness by using key terms in top retrieved documents
ECIR'05: Proceedings of the 27th European conference on Advances in Information Retrieval Research

In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '09: Proceedings of the 18th international conference on World wide web

April 2009

1280 pages

ISBN:9781605584874

DOI:10.1145/1526709

General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University

Copyright © 2009 IW3C2 org.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WWW '09

Sponsor:

WWW '09: The 18th International World Wide Web Conference

April 20 - 24, 2009

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

158
Total Citations
View Citations
1,237
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu CMao XXin CShang YChe TMao HHuang H(2024)HCUKE: A Hierarchical Context-aware approach for Unsupervised Keyphrase ExtractionKnowledge-Based Systems10.1016/j.knosys.2024.112511304(112511)Online publication date: Nov-2024
https://doi.org/10.1016/j.knosys.2024.112511
Tohalino JSilva TAmancio D(2024)Using word embedding to detect keywords in texts modeled as complex networksScientometrics10.1007/s11192-024-05055-7129:7(3599-3623)Online publication date: 9-Jun-2024
https://doi.org/10.1007/s11192-024-05055-7
Gupta KAhmad AGhosal TEkbal A(2024)A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publicationsInternational Journal on Digital Libraries10.1007/s00799-023-00393-yOnline publication date: 23-Jan-2024
https://doi.org/10.1007/s00799-023-00393-y
Bassem BZrigui M(2024)Deep Learning Based Transformers for Keyword ExtractionIntelligent Systems Design and Applications10.1007/978-3-031-64836-6_37(380-388)Online publication date: 25-Jul-2024
https://doi.org/10.1007/978-3-031-64836-6_37
Bonthu SMuddam HMudunuri KDayal ARao VBolla B(2024)TextRank – Based Keyword Extraction for Constructing a Domain-Specific DictionaryCognitive Computing and Cyber Physical Systems10.1007/978-3-031-48888-7_29(340-349)Online publication date: 5-Jan-2024
https://doi.org/10.1007/978-3-031-48888-7_29
anon MStolicki D(2023)Regulatory Capture of Intelligence Oversight Committees: A New Method Applied to the Polish CaseSSRN Electronic Journal10.2139/ssrn.4428355Online publication date: 2023
https://doi.org/10.2139/ssrn.4428355
Tohalino JSilva TAmancio D(2023)Using citation networks to evaluate the impact of text length on keyword extractionPLOS ONE10.1371/journal.pone.029450018:11(e0294500)Online publication date: 27-Nov-2023
https://doi.org/10.1371/journal.pone.0294500
Xie BSong JShao LWu SWei XYang BLin HXie JSu J(2023)From statistical methods to deep learning, automatic keyphrase predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338260:4Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103382
Song MXiao LJing L(2023)Learning to extract from multiple perspectives for neural keyphrase extractionComputer Speech and Language10.1016/j.csl.2023.10150281:COnline publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1016/j.csl.2023.101502
Li QZhao CYu TWu JLi S(2023)Clustering of conversational bandits with posterior sampling for user preference learning and elicitationUser Modeling and User-Adapted Interaction10.1007/s11257-023-09358-x33:5(1065-1112)Online publication date: 6-Mar-2023
https://doi.org/10.1007/s11257-023-09358-x
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents