skip to main content
article

Finding and classifying web units in websites

Published: 01 December 2005 Publication History

Abstract

In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of web pages to represent an instance of the semantic concept. Such a subgraph of web pages is known as a web unit. To construct and classify web units, we formulate the web unit mining problem and propose an iterative web unit mining (iWUM) method. The iWUM method first finds subgraphs of web pages using knowledge about website structure and connectivity among the web pages. From these web subgraphs, web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct web units and classify web units with high accuracy for the more structured parts of a website.

References

[1]
Blum, A. and Mitchell, T. (1998) 'Combining labeled and unlabeled data with co-training', Proc. of the 11th Annual Conf. on Computational Learning Theory (COLT98), Madison, Wisconsin, ACM Press, pp. 92-100.
[2]
Broder, A.Z., Krauthgamer, R. and Mitzenmacher, M. (2000) 'Improved classification via connectivity information', Proc. of 11th ACM-SIAM Symposium on Discrete Algorithm, Society for Industrial and Applied Mathematics, San Francisco, United States, pp. 576-585.
[3]
Candan, K.S. and Li, W-S. (2002) 'Reasoning for web document associations and its applications in site map construction', Data and Knowledge Eng., November, Vol. 43, No. 2, pp. 121-150.
[4]
Chakrabarti, S., Dom, B.E. and Indyk, P. (1998) 'Enhanced hypertext categorization using hyperlinks', Proc. of ACM SIGMOD, ACM Press, Seattle, pp. 307-318.
[5]
Chen, Z., Liu, S., Liu, W., Pu, G. and Ma, W-Y. (2003) 'Building a web thesaurus from web link structure', Proc. 26th ACM SIGIR, pp. 48-55.
[6]
Cohen, W.W. (2002) 'Improving a page classifier with anchor extraction and link analysis', In Advances in Neural Processing Systems 15 (NIPS02), Vancouver, British Columbia.
[7]
Craswell, N., Hawking, D. and Robertson, S. (2001) 'Effective site finding using link anchor information', Proc. of ACM SIGIR, ACM Press, New Orleans, pp. 250-257.
[8]
Craven, M. and Slattery, S. (2001) 'Relational learning with statistical predicate invention: better models for hypertext', Machine Learning, Vol. 43, Nos.1-2, pp. 97-119.
[9]
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) 'Maximum likelihood from incomplete data via the em algorithm', Journal of the Royal Statistical Society Series B, November, Vol. 39, No. 1, pp. 1-38.
[10]
Dumais, S.T. and Chen, H. (2000) 'Hierarchical classification of web content', Proc. of ACM SIGIR, ACM Press, Athens, Greece, pp. 256-263.
[11]
Ester, M., Kriegel, H-P. and Schubert, M. (2002) 'Web site mining: a new way to spot competitors, customers and suppliers in the World Wide Web', Proc. of ACM SIGKDD, ACM Press, Alberta, Canada, pp. 249-258.
[12]
Getoor, L., Segal, E., Taskar, B. and Koller, D. (2001) Probabilistic models of text and link structure for hypertext classification', Proc. of Intl Joint Conf. on Artificial Intelligence Workshop on Text Learning: Beyond Supervision, Seattle, WA.
[13]
Hawking, D. and Craswell, N. (2001) 'Overview of the TREC-2001 web track', Proc. of TREC, Maryland, http://trec.nist.gov/.
[14]
Joachims, T. (1999) 'Making large-scale svm learning practical', in Scholkopf, B., Burges, C. and Smola, A. (Eds.): Advances in Kernel Methods - Support Vector Learning, MIT-Press, pp. 169-184.
[15]
Joachims, T., Cristianini, N. and Shawe-Taylor, J. (2001) 'Composite kernels for hypertext categorization', Proc. of ICML, San Francisco, Morgan Kaufmann, pp. 250-257.
[16]
Kraaij, W., Westerveld, T. and Hiemstra, D. (2002) 'The importance of prior probabilities for entry page search', Proc. of ACM SIGIR, August, ACM Press, Tampere, Finland, pp. 27-34.
[17]
McCallum, A. and Nigam, K. (1999) 'Text classification by bootstrapping with keywords, EM and shrinkage', Proc. of ACL Workshop for Unsupervised Learning in Natural Language Processing, Maryland, June.
[18]
Nigam, K. and Ghani, R. (2000) 'Analyzing the effectiveness and applicability of co-training', Proc. of ACM CIKM, November, ACM Press, McLean, VA, pp. 86-93.
[19]
Oh, H-J., Myaeng, S.H. and Lee, M-H. (2000) 'A practical hypertext categorization method using links and incrementally available class information', Proc. of ACM SIGIR, ACM Press, Athens, Greece, pp. 264-271.
[20]
Perre, J.M. (2001) 'On the automated classification of websites', Linkoping Electronic Articles in Computer and Information Science, nr 0, Available online: http://www.ep.liu.se/ea/cis/ 2001/001/, 4th February, Vol. 6.
[21]
Platt, J.C. (2000) 'Probabilistic outputs for support vector machines and comparison to regularized likelihood methods', in Bartlett, P.J., Scholkopf, B., Schuurmans, D. and Smola, A.J. (Eds.): Advances in Large-Margin Classifiers, MIT Press, Cambridge, pp. 61-74.
[22]
Porter, M.F. (1980) 'An algorithm for suffix stripping', Program, Vol. 14, No. 3, pp. 130-137.
[23]
Sebastiani, F. (2002) 'Machine learning in automated text categorization', ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47.
[24]
Sun, A., Lim, E-P. and Ng, W-K. (2002) 'Web classification using support vector machine', Proc. of WIDM held in Conj. CIKM, ACM, Virginia, pp. 96-99.
[25]
Sun, A., Lim, E-P. and Ng, W-K. (2003) 'Performance measurement framework for hierarchical text classification', Journal of the American Society for Information Science and Technology (JASIST), Vol. 54, No. 11, pp. 1014-1028.
[26]
Wahba, G. (1999) 'Support vector machines, reproducing kernel hilbert spaces and the randomized gacv', in Scholkopf, B., Burges, C.J.C. and Smola, A.J. (Eds.): Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, pp. 69-88.
[27]
Westerveld, T., Hiemstra, D. and Kraaij, W. (2001) 'Retrieving web pages using content, links, urls and anchors', Proc. of TREC, Maryland, http://trec.nist.gov/.
[28]
Yang, Y. (2001) 'A study on thresholding strategies for text categorization', Proc. of ACM SIGIR, ACM Press, New Orleans, pp. 137-145.
[29]
Yang, Y., Slattery, S. and Ghani, R. (2002) 'A study of approaches to hypertext categorization', J. Intelligent Info. Sys., Vol. 18, Nos. 2-3, pp. 219-241.

Index Terms

  1. Finding and classifying web units in websites
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image International Journal of Business Intelligence and Data Mining
      International Journal of Business Intelligence and Data Mining  Volume 1, Issue 2
      December 2005
      109 pages
      ISSN:1743-8195
      EISSN:1743-8187
      Issue’s Table of Contents

      Publisher

      Inderscience Publishers

      Geneva 15, Switzerland

      Publication History

      Published: 01 December 2005

      Author Tags

      1. data mining
      2. internet
      3. web classification
      4. web pages
      5. web unit mining
      6. web units
      7. websites
      8. world wide web

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 10 Sep 2024

      Other Metrics

      Citations

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media