Improving sparse transformer models for efficient self-attention
Date: , Conference: spaCyIRL, Berlin, Germany
One disadvantage of using attention layers in a neural network architecture is that the memory and time complexity of the operation is quadratic. This talk tries to address the following question: “Can we design attention layers with lower complexity that are able to discover all dependencies in the input?”. The answer seems to be yes, by modeling the problem of introducing sparsity to the attention layer with Information Flow Graphs.