Using Grammatical Inference to Build Privacy Preserving Data-sets of User Logs

Victor Connes, Colin De La Higuera and Hoel Le Capitaine

Video:

Link to the paper: here

Abstract: In many web applications, user logs are extracted to build a user model which can be part of further development, recommendation systems or personalization. This is the case for education platforms like X5gon. In order to obtain community collaboration, these logs should be shared, but logical privacy issues arise. In this work, we propose to build a user model from a data-set of logs: this will be a timed and probabilistic k-testable automaton, which can then be used to generate a new data-set having statistically close characteristics, yet have in which the original sequences have been sufficiently chunked the original data to not be able to identify the original logs. Following ideas from Differencial Privacy, we provide a second algorithm allowing to eliminate any strings whose influence would be too great. Experiments validate the approach.

Any question or comment can be made using the comment section of the page below.