基于网页的信息系统的一种预处理过程.doc

约57页DOC格式手机打开展开

基于网页的信息系统的一种预处理过程,57页共计29240字摘要随着web的迅速发展,web上的信息越来越丰富。web使用方便、信息丰富,人们越来越多的使用web来寻找需要的信息。为了更好的使用web上的信息,人们也不断的追求能够有效组织和利用网上信息的技术和系统。然而,web上的信息存在很多问题:网页内的噪音内容多、web上近似网页量大以及缺乏必要的元数...
编号:68-45075大小:1.25M
分类: 论文>计算机论文

内容介绍

此文档由会员 bfxqt 发布

57页共计29240字


摘要
随着Web的迅速发展,Web上的信息越来越丰富。Web使用方便、信息丰富,人们越来越多的使用Web来寻找需要的信息。为了更好的使用Web上的信息,人们也不断的追求能够有效组织和利用网上信息的技术和系统。然而,Web上的信息存在很多问题:网页内的噪音内容多、Web上近似网页量大以及缺乏必要的元数据信息,这些问题严重影响了Web信息系统的服务质量。
针对Web信息系统的共性需求,本文提出了一个预处理框架及相应的方法。该预处理框架包括了三个预处理工作:网页净化、近似网页删除和网页元数据提取。通过预处理过程,原始网页集中的近似网页被删除,而保留下来的网页被净化并转化为一个统一的结构化模型(称之为DocView模型)。该模型中提供了各个领域需求较多的元数据和内容数据,它包括网页标识、网页类型、内容类别、标题、关键词、摘要、正文、相关链接等元素。本文提出的预处理方法的一个重要优点是它不需要除原始网页以外的其他信息,而这些额外信息是该领域中其他方法所必须的;另一个优点是将Web信息系统的共性需求放到一个过程中一次性提取出来,可以避免相同中间过程的重复执行,从而提高信息提取效率。
本文中提出的预处理框架和方法已经应用到了“天网”搜索引擎和网页自动分类系统中。通过使用预处理后应用系统质量的提高,验证了该预处理方法的有效性。不难看出,通过这样一个预处理过程,可以在任何一个网页集上(包括World Wide Web)搭建一个组织良好的、净化的、更易使用的信息层。

Abstract
With the rapid expansion of the Web, the content of the Web become richer and richer. People are increasingly using Web to find their wanted information because of the Web’s convenience and its abundance of information. In order to make better use of Web information, technologies that can automatically re-organize and manipulate web pages are pursued such as Web information retrieval, Web page classification and other Web mining work. However, there are many noises in the Web such as the noise content in the Web page (local noise) and near replica Web pages in the Web (global noise), which decrease the quality of the information on the Web, and consequently descrease the quality of the Web information systems seriously. Also, meta data of the Web pages are widely used in Web information systems, but they are not described explicitly. Some of these problems are never met in the traditional work.
In this thesis, we propose a new preprocessing framework and the corresponding approach to meet the common requirements of several typical web information systems. The framework includes three parts: Web page cleaning, replica removal and meta data extraction. After the preprocessing stage, redundant Web pages are deleted, then, reserved Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content and relevant hyperlinks. Most of them are meta data, while the latter two are content data. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work.
The preprocessing framework and approach have been applied to our search engine [TW] and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web information systems.

Keywords: World Wide Web, Data preprocessing, Data cleaning, Near replica detection, Meta data extraction


目 录

第1章 引言 1
1.1 研究背景 1
1.2 本文研究内容 2
1.3 本文贡献 3
1.4 本文组织 3
第2章 相关研究 4
2.1 搜索引擎 4
2.2 网页自动分类 7
2.3 信息提取 9
2.4 元数据提取 10
第3章 Web信息系统面临的问题及共性需求 12
第4章 预处理方法与技术 14
4.1 预处理框架及结果描述 14
4.1.1 预处理框架 14
4.1.2 预处理结果描述 14
4.2 网页表示 15
4.2.1 网页标签树表示 16
4.2.2 网页量化表示 19
4.3 网页净化 24
4.3.1 网页类型判断 24
4.3.2 主题网页净化 25
4.3.3 目录网页净化 25
4.3.4 图片网页净化 26
4.3.5 网页净化时空效率分析 26
4.4 近似网页的发现 27
4.4.1 近似网页发现算法 27
4.4.2 性能分析 29
4.5 网页元数据提取 29
4.5.1 网页元数据提取流程描述 30
4.5.2 正文提取 30
4.5.3 关键词提取 30
4.5.4 内容类别判断 31
4.5.5 标题提取 32
4.5.6 摘要提取 32
4.5.7 主题相关超链提取 33
4.6 本章小结 35
第5章 应用与评测 36
5.1 网页净化在网页自动分类系统中的应用与评测 36
5.1.1 应用 36
5.1.2 评测标准 37
5.1.3 评测结果与分析 37
5.2 近似网页消除在搜索引擎中的应用与评测 38
5.2.1 实验设计 38
5.2.2 评测标准 39
5.2.3 评测结果与分析 40
5.3 网页元数据在搜索引擎的索引过程中的应用与评测 41
5.3.1 检索效率评测 41
5.3.2 检索精度评测 42
5.4 本章小结 44
第6章 总结与展望 45
6.1 总结 45
6.2 展望 45
参考资料 47

关键词:万维网, 数据预处理,数据净化,近似网页识别,元数据提取
参考资料
[ACMP] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the Web. ACM Transactions on Internet Technology, 2001
[APE] Allison Woodruff, Paul M. Aoki, Eric Brewer, Paul Gauthier, and Lawrence A. Rowe. An Investigation of Documents from the World Wide Web. In Proceedings of the 5th International World Wide Web Conference, pages 963--979, Paris, France, May 1996.

[Fabrizio] Sebastinai Fabrizio. A tutorial on Automated text categorization.
[FSC] 冯是聪,中文网页自动分类技术研究及其在搜索引擎中的应用,北京大学,博士学位研究生毕业论文。
[Google] Google Inc. http://www.google.com .
[HCB] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HCBG] D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information Retrieval, 4(1):33-59, 2001.
[HD98] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538, 1998.
[HITS] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
[HMC] J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data, pages 18-25, May 1997.
[JW] Cowie, Jim and Lehnert, Wendy. Information Extraction. Communications of the ACM, January 1996/Vol. 39, No. 1, pp 80 – 91.
[LD] Lewis D et al. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp.298-306
[LG98] Steve Lawrence and C.Lee Giles. Searching the World Wide Web. Science, 280(5360): 98~100, Apr. 1998.
[LH02] S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD, 2002.
[LS] L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of Computer Science and Technology, 17(1), January 2002.
[Manber94] U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
[PR] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7):107-117, 1998.
[Ralph97] Grishman, Ralph. Information Extraction: Techniques and Challenges. Lecture Notes In Artificial Intelligence, Vol. 1299, pp 10 – 27, Springer-Verlag, Berlin Heidelberg, 1997. ISBN 3-540-63438-X
[SB] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988.
[SCAM] N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.
[SM99] N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.