A deep recurrent neural network approach to learn sequence similarities for user-identification

作者:

Highlights:

• A novel, deep neural-network-based framework for quantifying similarities for sequential data is presented.

• The method combines a specific type of recurrent neural nets trained on pairwise samples of sequences with a triplet loss cost function.

• It yields an embedding space that, after training, serves as a similarity metric for complex sequential data.

• The framework is illustrated using clickstream data in user re-identification, subsequence clustering and user classification settings.

• The model yields significantly improved re-identification rates compared to alternative models, such as sequence alignment methods.

摘要

The evolving digital economy entails multifaceted behavioral tracking data such as internet clickstreams, location trajectories, or taste preferences revealed by music or video streaming. Organizations are increasingly interested in using such data streams to profile customers based on their behavioral similarities for targeting purposes. However, measuring similarities in sequential data is a challenging task. We present a generic deep neural-network-based framework for quantifying the similarity of ordered sequences in observed event histories. This novel approach combines a specific type of recurrent neural nets with a triplet loss cost function used for network training. It yields an embedding space that serves as a similarity metric for complex sequential data, can handle multivariate sequential data and incorporate covariates. We empirically validate the derived similarity metric for user embeddings in the domain of re-identifying users in web browsing histories. We demonstrate its superior performance in discriminating users based on their behavioral browsing patterns by benchmarking against more conventional approaches to measure sequence similarity. In addition, we show that the methodology can be used for clustering sub-sequences and re-classifying users based on their observed clickstream behavior. Finally, we critically reflect benefits and possible downsides of the proposed framework, discuss extensions and promising future applications. An open-source reference implementation can be obtained from github.com/vamosi/tl_rnn.

论文关键词:Sequence similarity,Embeddings,Deep learning,User identification,Similarity matching,Sequence clustering

论文评审过程:Received 23 January 2021, Revised 4 December 2021, Accepted 27 December 2021, Available online 10 January 2022, Version of Record 21 February 2022.

论文官网地址:https://doi.org/10.1016/j.dss.2021.113718