Clustering Abstracts of Scientific Texts Using the Transition Point Technique

Pinto, David; Jiménez-Salazar, Héctor; Rosso, Paolo

doi:10.1007/11671299_55

David Pinto^17,18,
Héctor Jiménez-Salazar¹⁷ &
Paolo Rosso¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3878))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1472 Accesses
16 Citations

Abstract

Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.

This work was partially supported by BUAP-VIEP 3/G/ING/05, R2D2 (CICYT TIC2003-07158-C04-03), ICT EU-India (ALA/95/23/2003/077-054), and Generalitat Valenciana Grant (CTESIN/2005/012).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Using Extended Stopwords Lists to Improve the Quality of Academic Abstracts Clustering

Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

Searching Through Scientific PDF Files Supported by Bi-clustering of Key Terms Matrices

References

Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)
Chapter Google Scholar
Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and control (1967)
Google Scholar
Bueno, C., Pinto, D., Jimenez, H.: El párrafo virtual en la generación de extractos. Research on Computing Science Journal (2005)
Google Scholar
Cabrera, R., Pinto, D., Jimenez, H., Vilariño, D.: Una nueva ponderación para el modelo de espacio vectorial de recuperación de información. Research on Computing Science Journal (2005)
Google Scholar
Jimenez, H., Pinto, D., Rosso, P.: Selección de Términos No Supervisada para Agrupamiento de Resúmenes. In: Proceedings of Workshop on Human Language, ENC 2005 (2005)
Google Scholar
Jiménez-Salazar, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Journal: Procesamiento del Lenguaje Natural (35), 114–118 (2005)
Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington DC (2003)
Google Scholar
Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domain oriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)
Google Scholar
Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)
Chapter Google Scholar
Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Montejo-Ráez, A., Ureña-López, L.A., Steinberger, R.: Text Categorization using bibliographic records: beyond document content. Journal: Procesamiento del Lenguaje Natural, Num (35), 119–116 (2005)
Google Scholar
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proceedings of XX Conference of Spanish Natural Language Processing Society, SEPLN 2004 (2004)
Google Scholar
Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of dtp feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005)
Chapter Google Scholar
Pinto, D., Pérez, F.: Una Técnica para la Identificación de Términos Multipalabra. In: Proceedings of 2nd. National Conference on Computer Science, México (2004)
Google Scholar
Hernández, E.M.: DTP, un metodo de selección de términos para agrupamiento de textos, Tesis de maestría, Facultad de Ciencias de la Computación, BUAP (2005)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval, London, Butterworths (1999)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)
Chapter Google Scholar
Tovar, M., Carrillo, M., Pinto, D., Jimenez, H.: Combining Keyword Identification Techniques. Journal: Research on Computing Science (2005)
Google Scholar
Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)
Google Scholar
Yang, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: Proc. of SIGIR-ACM, pp. 256–263 (1995)
Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, BUAP, Puebla, 72570, Ciudad Universitaria, Mexico
David Pinto & Héctor Jiménez-Salazar
Department of Information Systems and Computation, UPV, Valencia, 46022, Camino de Vera s/n, Spain
David Pinto & Paolo Rosso

Authors

David Pinto
View author publications
You can also search for this author in PubMed Google Scholar
Héctor Jiménez-Salazar
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pinto, D., Jiménez-Salazar, H., Rosso, P. (2006). Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_55

Download citation

DOI: https://doi.org/10.1007/11671299_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Clustering Abstracts of Scientific Texts Using the Transition Point Technique

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Using Extended Stopwords Lists to Improve the Quality of Academic Abstracts Clustering

Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

Searching Through Scientific PDF Files Supported by Bi-clustering of Key Terms Matrices

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Clustering Abstracts of Scientific Texts Using the Transition Point Technique

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Using Extended Stopwords Lists to Improve the Quality of Academic Abstracts Clustering

Clustering Narrow-Domain Short Texts Using K-Means, Linguistic Patterns and LSI

Searching Through Scientific PDF Files Supported by Bi-clustering of Key Terms Matrices

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation