skip to main content
10.1145/1526709.1526798acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Extracting key terms from noisy and multitheme documents

Published: 20 April 2009 Publication History

Abstract

We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia.
Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages.
Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.

References

[1]
S. Auer and J. Lehmann. What have innsbruck and leipzig in common? extracting semantics from wiki content. pages 503--517. 2007.
[2]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107--117, 1998.
[3]
A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70:066111, 2004.
[4]
D. J. de Solla Price. Networks of scientific papers. Science, 169:510--515, 1965.
[5]
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-manning. Domain-specific keyphrase extraction. pages 668--673. Morgan Kaufmann Publishers, 1999.
[6]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, pages 1606--1611, Hyderabad, India, 2007.
[7]
M. Janik and K. J. Kochut. Wikipedia in action: Ontological knowledge in text categorization. International Conference on Semantic Computing, 0:268--275, 2008.
[8]
S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol, 22(3):437--467, March 1969.
[9]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.
[10]
C. D. Manning and H. Sch'Aijtze. Foundations of Statistical Natural Language Processing. The MIT Press, June 1999.
[11]
O. Medelyan, I. H. Witten, and D. Milne. Topic indexing with wikipedia. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.
[12]
R. Mihalcea. Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In HLT '05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 411--418, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[13]
R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of NAACL HLT 2007, pages 196--203, 2007.
[14]
R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 233--242, New York, NY, USA, 2007. ACM.
[15]
R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004.
[16]
G. A. Miller, C. Fellbaum, R. Tengi, P. Wakefield, H. Langone, and B. R. Haskell. Wordnet: a lexical database for the english language. http://wordnet.princeton.edu/.
[17]
D. Milne. Computing semantic relatedness using wikipedia link structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand, 2007.
[18]
D. Milne and I. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.
[19]
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004.
[20]
S. Redner. How popular is your paper? an empirical study of the citation distribution. The European Physical Journal B, 4:131, 1998.
[21]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988.
[22]
R. Sinha and R. Mihalcea. Unsupervised graph-basedword sense disambiguation using measures of word semantic similarity. In ICSC '07: Proceedings of the International Conference on Semantic Computing, pages 363--369, Washington, DC, USA, 2007. IEEE Computer Society.
[23]
M. Strube and S. Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pages 1419--1424, Boston, Mass., July 2006.
[24]
Z. Syed, T. Finin, and A. Joshi. Wikipedia as an Ontology for Describing Documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press, March 2008.
[25]
W. tau Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 213--222, New York, NY, USA, 2006. ACM.
[26]
D. Turdakov and P. Velikhov. Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation. In Colloquium on Databases and Information Systems (SYRCoDIS), 2008.
[27]
S. Wasserman, K. Faust, and D. Iacobucci. Social Network Analysis : Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, November 1994.

Cited By

View all
  • (2024)HCUKE: A Hierarchical Context-aware approach for Unsupervised Keyphrase ExtractionKnowledge-Based Systems10.1016/j.knosys.2024.112511304(112511)Online publication date: Nov-2024
  • (2024)Using word embedding to detect keywords in texts modeled as complex networksScientometrics10.1007/s11192-024-05055-7129:7(3599-3623)Online publication date: 9-Jun-2024
  • (2024)A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publicationsInternational Journal on Digital Libraries10.1007/s00799-023-00393-yOnline publication date: 23-Jan-2024
  • Show More Cited By

Index Terms

  1. Extracting key terms from noisy and multitheme documents

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '09: Proceedings of the 18th international conference on World wide web
      April 2009
      1280 pages
      ISBN:9781605584874
      DOI:10.1145/1526709

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 April 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. community detection
      2. graph analysis
      3. keywords extraction
      4. semantic similarity
      5. wikipedia

      Qualifiers

      • Research-article

      Conference

      WWW '09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)18
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)HCUKE: A Hierarchical Context-aware approach for Unsupervised Keyphrase ExtractionKnowledge-Based Systems10.1016/j.knosys.2024.112511304(112511)Online publication date: Nov-2024
      • (2024)Using word embedding to detect keywords in texts modeled as complex networksScientometrics10.1007/s11192-024-05055-7129:7(3599-3623)Online publication date: 9-Jun-2024
      • (2024)A BERT-based sequential deep neural architecture to identify contribution statements and extract phrases for triplets from scientific publicationsInternational Journal on Digital Libraries10.1007/s00799-023-00393-yOnline publication date: 23-Jan-2024
      • (2024)Deep Learning Based Transformers for Keyword ExtractionIntelligent Systems Design and Applications10.1007/978-3-031-64836-6_37(380-388)Online publication date: 25-Jul-2024
      • (2024)TextRank – Based Keyword Extraction for Constructing a Domain-Specific DictionaryCognitive Computing and Cyber Physical Systems10.1007/978-3-031-48888-7_29(340-349)Online publication date: 5-Jan-2024
      • (2023)Regulatory Capture of Intelligence Oversight Committees: A New Method Applied to the Polish CaseSSRN Electronic Journal10.2139/ssrn.4428355Online publication date: 2023
      • (2023)Using citation networks to evaluate the impact of text length on keyword extractionPLOS ONE10.1371/journal.pone.029450018:11(e0294500)Online publication date: 27-Nov-2023
      • (2023)From statistical methods to deep learning, automatic keyphrase predictionInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10338260:4Online publication date: 1-Jul-2023
      • (2023)Learning to extract from multiple perspectives for neural keyphrase extractionComputer Speech and Language10.1016/j.csl.2023.10150281:COnline publication date: 1-Jun-2023
      • (2023)Clustering of conversational bandits with posterior sampling for user preference learning and elicitationUser Modeling and User-Adapted Interaction10.1007/s11257-023-09358-x33:5(1065-1112)Online publication date: 6-Mar-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media