article

Finding and classifying web units in websites

Authors:

Ee-Peng LimAuthors Info & Claims

International Journal of Business Intelligence and Data Mining, Volume 1, Issue 2

Pages 161 - 193

https://doi.org/10.1504/IJBIDM.2005.008361

Published: 01 December 2005 Publication History

Abstract

In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of web pages to represent an instance of the semantic concept. Such a subgraph of web pages is known as a web unit. To construct and classify web units, we formulate the web unit mining problem and propose an iterative web unit mining (iWUM) method. The iWUM method first finds subgraphs of web pages using knowledge about website structure and connectivity among the web pages. From these web subgraphs, web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct web units and classify web units with high accuracy for the more structured parts of a website.

References

[1]

Blum, A. and Mitchell, T. (1998) 'Combining labeled and unlabeled data with co-training', Proc. of the 11th Annual Conf. on Computational Learning Theory (COLT98), Madison, Wisconsin, ACM Press, pp. 92-100.

Digital Library

[2]

Broder, A.Z., Krauthgamer, R. and Mitzenmacher, M. (2000) 'Improved classification via connectivity information', Proc. of 11th ACM-SIAM Symposium on Discrete Algorithm, Society for Industrial and Applied Mathematics, San Francisco, United States, pp. 576-585.

[3]

Candan, K.S. and Li, W-S. (2002) 'Reasoning for web document associations and its applications in site map construction', Data and Knowledge Eng., November, Vol. 43, No. 2, pp. 121-150.

[4]

Chakrabarti, S., Dom, B.E. and Indyk, P. (1998) 'Enhanced hypertext categorization using hyperlinks', Proc. of ACM SIGMOD, ACM Press, Seattle, pp. 307-318.

Digital Library

[5]

Chen, Z., Liu, S., Liu, W., Pu, G. and Ma, W-Y. (2003) 'Building a web thesaurus from web link structure', Proc. 26th ACM SIGIR, pp. 48-55.

[6]

Cohen, W.W. (2002) 'Improving a page classifier with anchor extraction and link analysis', In Advances in Neural Processing Systems 15 (NIPS02), Vancouver, British Columbia.

[7]

Craswell, N., Hawking, D. and Robertson, S. (2001) 'Effective site finding using link anchor information', Proc. of ACM SIGIR, ACM Press, New Orleans, pp. 250-257.

[8]

Craven, M. and Slattery, S. (2001) 'Relational learning with statistical predicate invention: better models for hypertext', Machine Learning, Vol. 43, Nos.1-2, pp. 97-119.

Digital Library

[9]

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) 'Maximum likelihood from incomplete data via the em algorithm', Journal of the Royal Statistical Society Series B, November, Vol. 39, No. 1, pp. 1-38.

[10]

Dumais, S.T. and Chen, H. (2000) 'Hierarchical classification of web content', Proc. of ACM SIGIR, ACM Press, Athens, Greece, pp. 256-263.

[11]

Ester, M., Kriegel, H-P. and Schubert, M. (2002) 'Web site mining: a new way to spot competitors, customers and suppliers in the World Wide Web', Proc. of ACM SIGKDD, ACM Press, Alberta, Canada, pp. 249-258.

Digital Library

[12]

Getoor, L., Segal, E., Taskar, B. and Koller, D. (2001) Probabilistic models of text and link structure for hypertext classification', Proc. of Intl Joint Conf. on Artificial Intelligence Workshop on Text Learning: Beyond Supervision, Seattle, WA.

[13]

Hawking, D. and Craswell, N. (2001) 'Overview of the TREC-2001 web track', Proc. of TREC, Maryland, http://trec.nist.gov/.

[14]

Joachims, T. (1999) 'Making large-scale svm learning practical', in Scholkopf, B., Burges, C. and Smola, A. (Eds.): Advances in Kernel Methods - Support Vector Learning, MIT-Press, pp. 169-184.

[15]

Joachims, T., Cristianini, N. and Shawe-Taylor, J. (2001) 'Composite kernels for hypertext categorization', Proc. of ICML, San Francisco, Morgan Kaufmann, pp. 250-257.

[16]

Kraaij, W., Westerveld, T. and Hiemstra, D. (2002) 'The importance of prior probabilities for entry page search', Proc. of ACM SIGIR, August, ACM Press, Tampere, Finland, pp. 27-34.

[17]

McCallum, A. and Nigam, K. (1999) 'Text classification by bootstrapping with keywords, EM and shrinkage', Proc. of ACL Workshop for Unsupervised Learning in Natural Language Processing, Maryland, June.

[18]

Nigam, K. and Ghani, R. (2000) 'Analyzing the effectiveness and applicability of co-training', Proc. of ACM CIKM, November, ACM Press, McLean, VA, pp. 86-93.

Digital Library

[19]

Oh, H-J., Myaeng, S.H. and Lee, M-H. (2000) 'A practical hypertext categorization method using links and incrementally available class information', Proc. of ACM SIGIR, ACM Press, Athens, Greece, pp. 264-271.

[20]

Perre, J.M. (2001) 'On the automated classification of websites', Linkoping Electronic Articles in Computer and Information Science, nr 0, Available online: http://www.ep.liu.se/ea/cis/ 2001/001/, 4th February, Vol. 6.

[21]

Platt, J.C. (2000) 'Probabilistic outputs for support vector machines and comparison to regularized likelihood methods', in Bartlett, P.J., Scholkopf, B., Schuurmans, D. and Smola, A.J. (Eds.): Advances in Large-Margin Classifiers, MIT Press, Cambridge, pp. 61-74.

[22]

Porter, M.F. (1980) 'An algorithm for suffix stripping', Program, Vol. 14, No. 3, pp. 130-137.

[23]

Sebastiani, F. (2002) 'Machine learning in automated text categorization', ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47.

Digital Library

[24]

Sun, A., Lim, E-P. and Ng, W-K. (2002) 'Web classification using support vector machine', Proc. of WIDM held in Conj. CIKM, ACM, Virginia, pp. 96-99.

[25]

Sun, A., Lim, E-P. and Ng, W-K. (2003) 'Performance measurement framework for hierarchical text classification', Journal of the American Society for Information Science and Technology (JASIST), Vol. 54, No. 11, pp. 1014-1028.

[26]

Wahba, G. (1999) 'Support vector machines, reproducing kernel hilbert spaces and the randomized gacv', in Scholkopf, B., Burges, C.J.C. and Smola, A.J. (Eds.): Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, pp. 69-88.

[27]

Westerveld, T., Hiemstra, D. and Kraaij, W. (2001) 'Retrieving web pages using content, links, urls and anchors', Proc. of TREC, Maryland, http://trec.nist.gov/.

[28]

Yang, Y. (2001) 'A study on thresholding strategies for text categorization', Proc. of ACM SIGIR, ACM Press, New Orleans, pp. 137-145.

[29]

Yang, Y., Slattery, S. and Ghani, R. (2002) 'A study of approaches to hypertext categorization', J. Intelligent Info. Sys., Vol. 18, Nos. 2-3, pp. 219-241.

Index Terms

Finding and classifying web units in websites
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Web unit mining: finding and classifying subgraphs of web pages
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

In web classification, most researchers assume that the objects to classify are individual web pages from one or more web sites. In practice, the assumption is too restrictive since a web page itself may not always correspond to a concept instance of ...
A metric-based analysis of web sites in Serbia: first findings
AIC'08: Proceedings of the 8th conference on Applied informatics and communications

Web engineering can be defined as a disciplined and systematic approach to development, deployment and maintenance of high-quality web applications. It borrows many principles, processes, methods and tools from software engineering and measurements and ...
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Big data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of Business Intelligence and Data Mining

International Journal of Business Intelligence and Data Mining Volume 1, Issue 2

December 2005

109 pages

ISSN:1743-8195

EISSN:1743-8187

Issue’s Table of Contents

Publisher

Inderscience Publishers

Geneva 15, Switzerland

Publication History

Published: 01 December 2005

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents