###
DOI:
电力大数据:2018,21(12):-
←前一篇   |   后一篇→
本文二维码信息
基于Counting Bloom Filter的海量网页快速去重研究
吴家奇
(安徽省电力有限公司淮南供电公司)
Research of Massive Web Rapidly Filter Base on Counting Bloom Filter
wujiaqi
(State Grid Huainan Power Supply Company)
摘要
图/表
参考文献
相似文献
本文已被:浏览 687次   下载 862
投稿时间:2018-06-16    修订日期:2018-06-24
中文摘要: 随着网络技术和电力信息化业务的不断发展,网络信息越发膨胀,将导致互联网和电力信息网中存在海量网页冗余的现象,这类现象将会使数据挖掘、快速检索的复杂度加大,从而对网络设备和存储设备的性能带来了巨大的挑战,因此研究海量网页快速去重是非常有必要的。网页去重是从给定的大量的数据集合中检测出冗余的网页,然后将冗余的网页从该数据集合中去除的过程,其中基于同源网页的URL去重的研究已经取得了很大的发,但是针对海量网页去重问题,目前还没有很好的解决方案,本文在基于MD5指纹库网页去重算法的基础上,结合Counting Bloom filter算法的特性,提出了一种快速去重算法IMP-CMFilter。该算法通过减少I/0频繁操作,来提高海量网页去重的效率。实验表明,IMP-CMFilter算法的有效性。
Abstract:With the fast development of network technology and the electric power information service, the expansion of network information will be lead to the phenomenon of massive web redundancy in the Internet and the power information network.it''s increates the complexity of data mining and rapid retrieval, which brings a great challenge to the performance of network equipment and storage equipment. Therefore, it is necessary to study massive Web deduplication.Web deduplication is a process which detected duplicate content pages from a given amount of data collection, and then removed from the copy of the collection. Which research of web deduplication based on URL filter has achieved great development,But it is no good solution to the problem of massive web pages filter.Based on web-based MD5 fingerprint deduplication algorithm, and using Counting Bloom filter algorithm, this essay proposed a algorithm for rapidly deduplication called IMP-CMFilter,which could improve the efficiency of mass web pages filter by reducing the frequent operation of I/O . On the fact that the IMP-CMFilter algorithm had higher performance.
文章编号:     中图分类号:    文献标志码:
基金项目:
引用文本: