Normalization of Duplicate Records from Multiple Sources

Normalization of Duplicate Records from Multiple Sources

PhoenixZone Technologies

5 лет назад

589 Просмотров

Data consolidation is a challenging issue in data integration. The usefulness of data increases when it is linked and fused
with other data from numerous (Web) sources. The promise of Big Data hinges upon addressing several big data integration
challenges, such as record linkage at scale, real-time data fusion, and integrating Deep Web. Although much work has been conducted
on these problems, there is limited work on creating a uniform, standard record from a group of records corresponding to the same
real-world entity. We refer to this task as record normalization. Such a record representation, coined normalized record, is important for
both front-end and back-end applications. In this paper, we formalize the record normalization problem, present in-depth analysis of
normalization granularity levels (e.g., record, field, and value-component) and of normalization forms (e.g., typical versus complete).
We propose a comprehensive framework for computing the normalized record. The proposed framework includes a suit of record
normalization methods, from naive ones, which use only the information gathered from records themselves, to complex strategies,
which globally mine a group of duplicate records before selecting a value for an attribute of a normalized record. We conducted
extensive empirical studies with all the proposed methods. We indicate the weaknesses and strengths of each of them and recommend
the ones to be used in practice.

Тэги:

#Record_normalization #data_quality #data_fusion #web_data_integration #deep_web
Ссылки и html тэги не поддерживаются


Комментарии: