Skip to main content

Clustering Abstracts of Scientific Texts Using the Transition Point Technique

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Abstract

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.

This work was partially supported by BUAP-VIEP 3/G/ING/05, R2D2 (CICYT TIC2003-07158-C04-03), ICT EU-India (ALA/95/23/2003/077-054), and Generalitat Valenciana Grant (CTESIN/2005/012).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  2. Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and control (1967)

    Google Scholar 

  3. Bueno, C., Pinto, D., Jimenez, H.: El párrafo virtual en la generación de extractos. Research on Computing Science Journal (2005)

    Google Scholar 

  4. Cabrera, R., Pinto, D., Jimenez, H., Vilariño, D.: Una nueva ponderación para el modelo de espacio vectorial de recuperación de información. Research on Computing Science Journal (2005)

    Google Scholar 

  5. Jimenez, H., Pinto, D., Rosso, P.: Selección de Términos No Supervisada para Agrupamiento de Resúmenes. In: Proceedings of Workshop on Human Language, ENC 2005 (2005)

    Google Scholar 

  6. Jiménez-Salazar, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Journal: Procesamiento del Lenguaje Natural (35), 114–118 (2005)

    Google Scholar 

  7. Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington DC (2003)

    Google Scholar 

  8. Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domain oriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)

    Google Scholar 

  9. Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  11. Montejo-Ráez, A., Ureña-López, L.A., Steinberger, R.: Text Categorization using bibliographic records: beyond document content. Journal: Procesamiento del Lenguaje Natural, Num (35), 119–116 (2005)

    Google Scholar 

  12. Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proceedings of XX Conference of Spanish Natural Language Processing Society, SEPLN 2004 (2004)

    Google Scholar 

  13. Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of dtp feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  14. Pinto, D., Pérez, F.: Una Técnica para la Identificación de Términos Multipalabra. In: Proceedings of 2nd. National Conference on Computer Science, México (2004)

    Google Scholar 

  15. Hernández, E.M.: DTP, un metodo de selección de términos para agrupamiento de textos, Tesis de maestría, Facultad de Ciencias de la Computación, BUAP (2005)

    Google Scholar 

  16. van Rijsbergen, C.J.: Information Retrieval, London, Butterworths (1999)

    Google Scholar 

  17. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  18. Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  19. Tovar, M., Carrillo, M., Pinto, D., Jimenez, H.: Combining Keyword Identification Techniques. Journal: Research on Computing Science (2005)

    Google Scholar 

  20. Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)

    Google Scholar 

  21. Yang, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: Proc. of SIGIR-ACM, pp. 256–263 (1995)

    Google Scholar 

  22. Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Pinto, D., Jiménez-Salazar, H., Rosso, P. (2006). Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_55

Download citation

  • DOI: https://doi.org/10.1007/11671299_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32205-4

  • Online ISBN: 978-3-540-32206-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics