Duplicate product record detection engine for e-commerce platforms

作者:

Highlights:

• An extensive feature engineering is done to detect duplicate records.

• Categories and brands have their own naming conventions which need to be adapted.

• Text processing special to the e-commerce domain is done.

• Results show that it is possible to detect duplicate records with high precision.

摘要

•An extensive feature engineering is done to detect duplicate records.•Categories and brands have their own naming conventions which need to be adapted.•Text processing special to the e-commerce domain is done.•Results show that it is possible to detect duplicate records with high precision.

论文关键词:Duplicate record detection,Feature engineering,Text similarity,Classification

论文评审过程:Received 3 January 2021, Revised 7 November 2021, Accepted 15 December 2021, Available online 7 January 2022, Version of Record 18 January 2022.

论文官网地址:https://doi.org/10.1016/j.eswa.2021.116420