主题可定制的web双语平行语料库自动获取技术研究,硕士论文 73页共计42022字摘要大规模双语平行语料库是构建高质量统计机器翻译系统的重要基础资源。在特定领域统计机器翻译应用中,使用与领域主题相关的平行语料作为训练数据能够获得好的翻译质量。本文提出了一种主题可定制的web双语平行语料库自动获取方法,目的在于充分利用we...
此文档由会员 bfxqt 发布
硕士论文 主题可定制的web双语平行语料库自动获取技术研究
摘 要
1. 基于Web的双语平行语料库自动获取
2. 主题可定制双语平行语料库自动获取
3. 主题可定制的双语平行语料库自动获取系统设计与实现
目 录
摘 要 I
目 录 V
图目录 VII
表目录 IX
第一章 引言 1
1.1概述 1
1.1.1研究背景和意义 1
1.1.2 国内外研究现状分析 2双语语料库建设 2基于web的双语翻译资源自动获取 3领域主题可定制的双语资源获取 4
1.2主要研究目标和内容 5
1.2.1研究目标 5
1.2.2主要研究内容 5
1.3 论文组织结构 6
第二章 基于Web的双语平行语料库自动获取 9
2.1引言 9
2.2基于URL命名相似性的平行句对获取方法简介 10
2.3基于网页结构相似性的平行句对获取 12
2.3.1基于DOM树对齐模型的平行句对获取方法简介 13
2.3.2基于标签序列最长公共子串的DOM树对齐改进获取算法 16
2.4融合URL相似性及网页结构相似性的平行句对获取方法 20
2.5实验与分析 22
2.5.1平行网站识别模块效果分析 22
2.5.2网页相似度情况统计 22
2.5.3两种获取方法的比较 23
2.6本章小结 24
第三章 主题可定制双语平行语料库自动获取 25
3.1引言 25
3.2主题描述模型 26
3.2.1用户需求描述 26
3.2.2用户主题描述的分析和理解 28
3.3特定主题数据获取方法 34
3.4实验及讨论 36
3.4.1主题可定制双语资源获取方法性能评价 36
3.4.2特定主题双语数据在统计机器翻译中的应用 37在NIST评测任务中的实验 37在旅游会话主题上的实验 40
3.5本章小结 40
第四章 主题可定制双语资源自动获取系统设计与实现 43
4.1 引言 43
4.2 系统设计及主要模块介绍 43
4.3 重要功能的实现 47
4.3.1 网站下载功能 47
4.3.2 用户交互方式 48
4.3.3 数据检索功能 48
4.4 系统应用 50
4.5 本章小结 50
第五章 总结 53
5.1 本文工作总结 53
5.2 下一步研究方向 54
参考文献 i
致 谢 ii
[1] Lei Shi, et al. "A DOM Tree Alignment Model for Mining Parallel Data from the Web" Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 489-496, Sydney, July 2006.
[2] Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation, Computational Linguistics, 1990
[3] Melamed, I. Dan. 2000. Models of translational equivalence among words. Computational Linguistics, 26(2):211-249.
[4] Och, Franz-Josef and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 295-302, Philadelphia, July.
[5] Gale, William A. and Kenneth W. Church. 1991. Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pages 152-157, Asilomar, CA, February.
[6] Melamed, I. Dan 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 97-108, Providence, RI, August.
[7] Davis, Mark and Ted Dunning. 1995. A TREC evaluation of query translation methods for multi-lingual text retrieval. In Fourth Text Rerieval Conference (TREC-4), pages 483-498. NIST, Gaithersburg, MD.
[8] Nie, J.Y., Simard, M., Foster, G.,: Using parallel web pages for multi-lingual ir. Carol Peters, editor, Cross-Language Information Retrieval and Evaluation, number 2069 in Lecture Notes in Computer Science. Springer Verlag, 2001.
[9] Philip Resnik. Parallel strands: a preliminary investigation into mining the web for bilingual text. In: Proceeding of the Third Conference of the Association for Machine Translation. America, pages 72-2, 1998.
[10] Menezes, Arul and Stephen D. Richardson. 2001. A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, pages 39-46, Toulouse, France.
[11] Resnik, Philip and I. Dan Melamed. 1997. Semi-automatic acquisition of domain-specific translation lexicons.
[14] Jiang Chen and Jian-Yun Nie. Automatic construction of parallel english-chinese corpus for cross-language information retrieval. In: Proceedings of the International Conference on Chinese Language Computing. San Francisco, pages 21-28, 2000.
[15] Philip Resnik and Noah A. Smith. The Web as a parallel corpus. Computational Linguistics, volume 29, pages 349-380.
[16] Xiaoyi Ma and Mark Y. Liberman. Bits: A method for bilingual text search over the Web. In: Proceedings of the Machine Translation Summit VII, 1999.
[17] Ying Zhang, Ke. Wu, Jianfeng Gao, and P. Vines. Automatic acquisition of chinese-english parallel corpus from the web. In: Proceedings of ECIR-06, 28th European Conference on Information Retrieval. ACL, 2006.
[18] Jisong Chen, Rowena Chau, and Chung-Hsing Yeh. Discovering parallel text from the World Wide WEB . In CRPIT'32: Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalization. Australia, pages 157-61, 2004.
[19] 叶莎妮,基于Web的大规模双语平行语料库自动获取技术研究与系统实现,硕士学位论文,2008.
[20] Brown, P. F., J. C. Lai and R. L. Mercer. 1991. Aligning Sentences in Parallel Corpora. In Proceedings of 29th Annual Meeting of the Association for Computational Linguistics.
[21] Chen, S. 1993. Aligning Sentences in Bilingual Corpora Using Lexical Information. In Proceedings of 31st Annual Meeting of the Association for Computational Linguistics.
[22] Gale W. A. and K. Church. 1991. A Program for Aligning Sentences in Parallel Corpora. In Proceedings of 29th Annual Meeting of the Association for Computational Linguistics.
[23] Zhao B. and S. Vogel. 2002. Adaptive Parallel Sentences Mining From Web Bilingual News Collection. In 2002 IEEE International Conference on Data Mining. page: 745.
[24] Zhao B. and S. Vogel. 2002. Adaptive Parallel Sentences Mining From Web Bilingual News Collection. In 2002 IEEE International Conference on Data Mining.
[25] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, INTRODUCTION TO ALGORITHMS (Second Edition), Higher Education Press, 2002.
[26] 黄瑾, 统计机器翻译预处理若干技术研究, 硕士学位论文, 2007.
[27] Almut Silja Hildebrand et al, Adaptation of the Translation Model for Statistical Machine Translation based on Information Retrieval [A]. EAMT 2005 Conference Proceedings[C].
摘 要
1. 基于Web的双语平行语料库自动获取
2. 主题可定制双语平行语料库自动获取
3. 主题可定制的双语平行语料库自动获取系统设计与实现
目 录
摘 要 I
目 录 V
图目录 VII
表目录 IX
第一章 引言 1
1.1概述 1
1.1.1研究背景和意义 1
1.1.2 国内外研究现状分析 2双语语料库建设 2基于web的双语翻译资源自动获取 3领域主题可定制的双语资源获取 4
1.2主要研究目标和内容 5
1.2.1研究目标 5
1.2.2主要研究内容 5
1.3 论文组织结构 6
第二章 基于Web的双语平行语料库自动获取 9
2.1引言 9
2.2基于URL命名相似性的平行句对获取方法简介 10
2.3基于网页结构相似性的平行句对获取 12
2.3.1基于DOM树对齐模型的平行句对获取方法简介 13
2.3.2基于标签序列最长公共子串的DOM树对齐改进获取算法 16
2.4融合URL相似性及网页结构相似性的平行句对获取方法 20
2.5实验与分析 22
2.5.1平行网站识别模块效果分析 22
2.5.2网页相似度情况统计 22
2.5.3两种获取方法的比较 23
2.6本章小结 24
第三章 主题可定制双语平行语料库自动获取 25
3.1引言 25
3.2主题描述模型 26
3.2.1用户需求描述 26
3.2.2用户主题描述的分析和理解 28
3.3特定主题数据获取方法 34
3.4实验及讨论 36
3.4.1主题可定制双语资源获取方法性能评价 36
3.4.2特定主题双语数据在统计机器翻译中的应用 37在NIST评测任务中的实验 37在旅游会话主题上的实验 40
3.5本章小结 40
第四章 主题可定制双语资源自动获取系统设计与实现 43
4.1 引言 43
4.2 系统设计及主要模块介绍 43
4.3 重要功能的实现 47
4.3.1 网站下载功能 47
4.3.2 用户交互方式 48
4.3.3 数据检索功能 48
4.4 系统应用 50
4.5 本章小结 50
第五章 总结 53
5.1 本文工作总结 53
5.2 下一步研究方向 54
参考文献 i
致 谢 ii
[1] Lei Shi, et al. "A DOM Tree Alignment Model for Mining Parallel Data from the Web" Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 489-496, Sydney, July 2006.
[2] Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, Paul S. Roossin, A Statistical Approach to Machine Translation, Computational Linguistics, 1990
[3] Melamed, I. Dan. 2000. Models of translational equivalence among words. Computational Linguistics, 26(2):211-249.
[4] Och, Franz-Josef and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 295-302, Philadelphia, July.
[5] Gale, William A. and Kenneth W. Church. 1991. Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pages 152-157, Asilomar, CA, February.
[6] Melamed, I. Dan 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 97-108, Providence, RI, August.
[7] Davis, Mark and Ted Dunning. 1995. A TREC evaluation of query translation methods for multi-lingual text retrieval. In Fourth Text Rerieval Conference (TREC-4), pages 483-498. NIST, Gaithersburg, MD.
[8] Nie, J.Y., Simard, M., Foster, G.,: Using parallel web pages for multi-lingual ir. Carol Peters, editor, Cross-Language Information Retrieval and Evaluation, number 2069 in Lecture Notes in Computer Science. Springer Verlag, 2001.
[9] Philip Resnik. Parallel strands: a preliminary investigation into mining the web for bilingual text. In: Proceeding of the Third Conference of the Association for Machine Translation. America, pages 72-2, 1998.
[10] Menezes, Arul and Stephen D. Richardson. 2001. A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, pages 39-46, Toulouse, France.
[11] Resnik, Philip and I. Dan Melamed. 1997. Semi-automatic acquisition of domain-specific translation lexicons.
[14] Jiang Chen and Jian-Yun Nie. Automatic construction of parallel english-chinese corpus for cross-language information retrieval. In: Proceedings of the International Conference on Chinese Language Computing. San Francisco, pages 21-28, 2000.
[15] Philip Resnik and Noah A. Smith. The Web as a parallel corpus. Computational Linguistics, volume 29, pages 349-380.
[16] Xiaoyi Ma and Mark Y. Liberman. Bits: A method for bilingual text search over the Web. In: Proceedings of the Machine Translation Summit VII, 1999.
[17] Ying Zhang, Ke. Wu, Jianfeng Gao, and P. Vines. Automatic acquisition of chinese-english parallel corpus from the web. In: Proceedings of ECIR-06, 28th European Conference on Information Retrieval. ACL, 2006.
[18] Jisong Chen, Rowena Chau, and Chung-Hsing Yeh. Discovering parallel text from the World Wide WEB . In CRPIT'32: Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalization. Australia, pages 157-61, 2004.
[19] 叶莎妮,基于Web的大规模双语平行语料库自动获取技术研究与系统实现,硕士学位论文,2008.
[20] Brown, P. F., J. C. Lai and R. L. Mercer. 1991. Aligning Sentences in Parallel Corpora. In Proceedings of 29th Annual Meeting of the Association for Computational Linguistics.
[21] Chen, S. 1993. Aligning Sentences in Bilingual Corpora Using Lexical Information. In Proceedings of 31st Annual Meeting of the Association for Computational Linguistics.
[22] Gale W. A. and K. Church. 1991. A Program for Aligning Sentences in Parallel Corpora. In Proceedings of 29th Annual Meeting of the Association for Computational Linguistics.
[23] Zhao B. and S. Vogel. 2002. Adaptive Parallel Sentences Mining From Web Bilingual News Collection. In 2002 IEEE International Conference on Data Mining. page: 745.
[24] Zhao B. and S. Vogel. 2002. Adaptive Parallel Sentences Mining From Web Bilingual News Collection. In 2002 IEEE International Conference on Data Mining.
[25] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, INTRODUCTION TO ALGORITHMS (Second Edition), Higher Education Press, 2002.
[26] 黄瑾, 统计机器翻译预处理若干技术研究, 硕士学位论文, 2007.
[27] Almut Silja Hildebrand et al, Adaptation of the Translation Model for Statistical Machine Translation based on Information Retrieval [A]. EAMT 2005 Conference Proceedings[C].