Extracting Weighted Automata for Approximate Minimization in Language Modelling

Clara Lacroce, Prakash Panangaden and Guillaume Rabusseau

Video:

Link to the paper: here

Abstract: In this paper we study the approximate minimization problem for language modelling. We assume we are given some language model as a black box. The objective is to obtain a weighted finite automaton (WFA) that fits within a given size constraint and which mimics the behaviour of the original model while minimizing some notion of distance between the black box and the extracted WFA. We provide an algorithm for the approximate minimization of black boxes trained for language modelling of sequential data over a oneletter alphabet. By reformulating the problem in terms of Hankel matrices, we leverage classical results on the approximation of Hankel operators, namely the celebrated AdamyanArov-Krein (AAK) theory. This allows us to use the spectral norm to measure the distance between the black box and the WFA. We provide theoretical guarantees to study the potentially infinite-rank Hankel matrix of the black box, without accessing the training data, and we prove that our method returns an asymptotically-optimal approximation

Any question or comment can be made using the comment section of the page below.

7 Comments

Jean-Christophesays:

Posted on 23 August 2021 at 3 h 33 min

Very interesting work. Please could you give us a toy example of WFA that is extracted from a black box ?
Clarasays:

Posted on 24 August 2021 at 14 h 15 min

Thank you! Unfortunately, we do not have any experiment yet, as we focused on the theoretical component of the result. We plan to have some toy example and experimental results in our future work, likely after extending the algorithm to multi-letter alphabets. Experiments in this setting would be much more interesting and easier to interpret, as there are many more datasets that could be used.
Jeffrey Heinzsays:

Posted on 24 August 2021 at 14 h 24 min

What do you think the biggest challenge is to extending the algorithm to multi-letter alphabets?
1. Clarasays:
  
  Posted on 25 August 2021 at 4 h 05 min
  
  There are two important challenges to extend the work. The main challenge will be to adapt results from harmonic analysis to the case of non abelian structures. While some work as been done in that direction (Popescu 2003), it is still unclear how we can transfer these results to the setting of multi-letter alphabets. A second major obstacle is that the generalization done by Popescu leads to a proof of AAK Theory that is not constructive, making it difficult to develop an algorithm to optimal approximation.
Kaito Suzukisays:

Posted on 24 August 2021 at 22 h 01 min

Thank you for the interesting talk! I am a beginner at spectral learning of WFAs and have a question about the problem setup. Some papers on spectral learning of WFAs seem to recover WFAs from Hankel matrices without considering whether the truncated SVD results have Hankel properties (e.g., Balle 2012). What is the difference between such results and this work? Does optimal spectral-norm approximate minimization require the Hankel properties and AAK theory?
1. Clarasays:
  
  Posted on 25 August 2021 at 4 h 28 min
  
  Thank you very much for your question! The idea is that the matrix of the WFA needs to have the Hankel property. When you truncate the SVD the result might not be Hankel, so you need to find a way to preserve the property when doing approximation. This can be done in several ways. An alternative example can be found in Balle 2019 (Singular Value Automaton and Approximate Minimization), where the authors truncate a canonical form of WFA instead of the corresponding Hankel matrix. The tools we use, in particular AAK Theory, are needed to guarantee that this property is preserved and that the matrix obtained is still Hankel. In fact, AAK Theorem “returns” a Hankel matrix. On the other hand, when you apply spectral methods to recover WFAs from Hankel matrices, you start from a Hankel matrix, and use the truncated SVD and its relation with rank factorization in order to compute the parameters of the WFA. Note that the truncated SVD in the case of spectral methods is not the full Hankel matrix of the extracted WFA. I hope that I understood your question well and that I answered clearly, please don’t hesitate asking again if I didn’t.
  1. Anonymoussays:
    
    Posted on 25 August 2021 at 9 h 19 min
    
    Thank you for your response and explanation in Q&A session! It’s clear to me. Thank you!

Comments are closed.