Abstract
Free access to scientific papers in major digital libraries and other web repositories is limited to only their abstracts. Current keyword-based techniques fail on narrow domain-oriented libraries, e.g., those containing only documents on high energy physics like those of the hep-ex collection of CERN. We propose a simple procedure to cluster abstracts which consists in applying the transition point technique during the term selection process. This technique uses the mid-frequency terms to index the documents due to the fact that they have a high semantic content. In the experiments we have carried out, the transition point approach has been compared with well known unsupervised term selection techniques. Transition point technique shown that it is possible to obtain a better performance than traditional methods. Moreover, we propose an approach to analyse the stability of transition point term selection method.
This work was partially supported by BUAP-VIEP 3/G/ING/05, R2D2 (CICYT TIC2003-07158-C04-03), ICT EU-India (ALA/95/23/2003/077-054), and Generalitat Valenciana Grant (CTESIN/2005/012).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alexandrov, M., Gelbukh, A., Rosso, P.: An Approach to Clustering Abstracts. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 275–285. Springer, Heidelberg (2005)
Booth, A.: A Law of Ocurrences for Words of Low Frequency. Information and control (1967)
Bueno, C., Pinto, D., Jimenez, H.: El párrafo virtual en la generación de extractos. Research on Computing Science Journal (2005)
Cabrera, R., Pinto, D., Jimenez, H., Vilariño, D.: Una nueva ponderación para el modelo de espacio vectorial de recuperación de información. Research on Computing Science Journal (2005)
Jimenez, H., Pinto, D., Rosso, P.: Selección de Términos No Supervisada para Agrupamiento de Resúmenes. In: Proceedings of Workshop on Human Language, ENC 2005 (2005)
Jiménez-Salazar, H., Pinto, D., Rosso, P.: Uso del punto de transición en la selección de términos índice para agrupamiento de textos cortos. Journal: Procesamiento del Lenguaje Natural (35), 114–118 (2005)
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington DC (2003)
Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domain oriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)
Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)
Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Montejo-Ráez, A., Ureña-López, L.A., Steinberger, R.: Text Categorization using bibliographic records: beyond document content. Journal: Procesamiento del Lenguaje Natural, Num (35), 119–116 (2005)
Moyotl, E., Jiménez, H.: An Analysis on Frequency of Terms for Text Categorization. In: Proceedings of XX Conference of Spanish Natural Language Processing Society, SEPLN 2004 (2004)
Moyotl-Hernández, E., Jiménez-Salazar, H.: Enhancement of dtp feature selection method for text categorization. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 719–722. Springer, Heidelberg (2005)
Pinto, D., Pérez, F.: Una Técnica para la Identificación de Términos Multipalabra. In: Proceedings of 2nd. National Conference on Computer Science, México (2004)
Hernández, E.M.: DTP, un metodo de selección de términos para agrupamiento de textos, Tesis de maestría, Facultad de Ciencias de la Computación, BUAP (2005)
van Rijsbergen, C.J.: Information Retrieval, London, Butterworths (1999)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Shin, K., Han, S.Y.: Fast clustering algorithm for information organization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 619–622. Springer, Heidelberg (2003)
Tovar, M., Carrillo, M., Pinto, D., Jimenez, H.: Combining Keyword Identification Techniques. Journal: Research on Computing Science (2005)
Urbizagástegui, R.: Las posibilidades de la Ley de Zipf en la indización automática, Research report of the California Riverside University (1999)
Yang, Y.: Noise Reduction in a Statistical Approach to Text Categorization. In: Proc. of SIGIR-ACM, pp. 256–263 (1995)
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pinto, D., Jiménez-Salazar, H., Rosso, P. (2006). Clustering Abstracts of Scientific Texts Using the Transition Point Technique. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_55
Download citation
DOI: https://doi.org/10.1007/11671299_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)