魔改Attention大集合

Github地址:

https://github.com/Separius/awesome-fast-attention

 

Efficient Attention

 

Paper (引用量)

源码实现

复杂度

AutoRegressive

Main Idea

Generating Wikipedia by Summarizing Long Sequences[1] (208)

memory-compressed-attention[2]

 

 

魔改Attention大集合

 

魔改Attention大集合

 

compresses key and value + blocked attention

CBAM: Convolutional Block Attention Module[3] (677)

attention-module[4]  魔改Attention大集合

 

 

魔改Attention大集合

 

魔改Attention大集合

 

combines the SE attention with a per pixel(local) weight

CCNet: Criss-Cross Attention for Semantic Segmentation[5] (149)

CCNet[6]

 

魔改Attention大集合

 

 

 

each pixel attends to its row and column simultaneously

Efficient Attention: Attention with Linear Complexities[7] (2)

efficient-attention[8]

 

 

魔改Attention大集合

 

 

 

Softmax(Q)*(Softmax(K^T)*V)

Star-Transformer[9] (24)

fastNLP[10]

 

 

 

 

 

uses a relay(global) node and attends to/from that node

Generating Long Sequences with Sparse Transformers[11] (139)

torch-blocksparse[12] 魔改Attention大集合

 

 

 

 

 

 

sparse block based attention

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond[13] (96)

GCNet[14]  魔改Attention大集合

 

魔改Attention大集合

 

魔改Attention大集合

 

squeeze and excitation with an attention pooling (instead of a GAP)

SCRAM: Spatially Coherent Randomized Attention Maps[15] (1)

-

 

 

 

 

 

uses PatchMatch to find close keys

Interlaced Sparse Self-Attention for Semantic Segmentation[16] (13)

IN_PAPER

魔改Attention大集合

 

魔改Attention大集合

 

combination of a short length and then long range(dilated) attention

Permutohedral Attention Module for Efficient Non-Local Neural Networks[17] (2)

Permutohedral_attention_module[18] 魔改Attention大集合

 

 

魔改Attention大集合

 

魔改Attention大集合

 

uses permutohedral lattice approximation algorithm to approximate the attention output

Large Memory Layers with Product Keys[19] (28)

XLM[20]

 

 

 

 

 

search for nearest neighbor keys

Expectation-Maximization Attention Networks for Semantic Segmentation[21] (38)

EMANet[22] 魔改Attention大集合

 

 

魔改Attention大集合

 

 

 

applys expectation maximization to cluster keys into k clusters

Compressive Transformers for Long-Range Sequence Modelling[23] (20)

compressive-transformer-pytorch[24] 魔改Attention大集合

 

魔改Attention大集合

 

魔改Attention大集合

 

compresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL

BP-Transformer: Modelling Long-Range Context via Binary Partitioning[25] (8)

BPT[26] 魔改Attention大集合

 

魔改Attention大集合

 

魔改Attention大集合

 

attends to distant tokens coarsely and attends to close tokens in a more fine-grained manner

Axial Attention in Multidimensional Transformers[27] (5)

axial-attention[28] 魔改Attention大集合

 

魔改Attention大集合

 

魔改Attention大集合

 

apply attention on each axis separately

Reformer: The Efficient Transformer[29] (69)

trax[30] 魔改Attention大集合

 

魔改Attention大集合

 

魔改Attention大集合

 

uses LSH to find close keys

Transformer on a Diet[31] (2)

transformer-on-diet[32] 魔改Attention大集合

 

 

魔改Attention大集合

 

dilated transformer like wavenet

Sparse Sinkhorn Attention[33] (4)

sinkhorn-transformer[34] 魔改Attention大集合

 

 

魔改Attention大集合

 

uses a cost matrix to limit attention between buckets

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection[35] (1)

-

 

魔改Attention大集合

 

learns the q, k connections == dynamically creates a sparse attention matrix

Efficient Content-Based Sparse Attention with Routing Transformers[36] (11)

routing-transformer[37] 魔改Attention大集合

 

 

魔改Attention大集合

 

computes attention with same-cluster tokens (computed by online k-means)

Longformer: The Long-Document Transformer[38] (15)

longformer[39] 魔改Attention大集合

 

 

魔改Attention大集合

 

global + blocked attention

Neural Architecture Search for Lightweight Non-Local Networks[40] (2)

AutoNL[41] 魔改Attention大集合

 

 

魔改Attention大集合

 

computes Q(KV) and also down samples q, k, v both in spatial and channel dimensions

ETC: Encoding Long and Structured Data in Transformers[42] (2)

-

 

魔改Attention大集合

 

combines global attention (star transformer with multiple global tokens) with local attention

Multi-scale Transformer Language Models[43] (1)

IN_PAPER

 

魔改Attention大集合

 

UNet like + retina attetion is something close to BP-Transformer

Synthesizer: Rethinking Self-Attention in Transformer Models[44] (5)

-

 

魔改Attention大集合

 

does not compute pairwise interactions

Jukebox: A Generative Model for Music[45] (9)

jukebox[46] 魔改Attention大集合

 

 

魔改Attention大集合

 

better attention patterns from Sparse Transformer

GMAT: Global Memory Augmentation for Transformers[47] (0)

gmat[48] 魔改Attention大集合

 

 

魔改Attention大集合

 

adds global tokens

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers[49] (0)

google-research[50] 魔改Attention大集合

 

 

魔改Attention大集合

 

calculate an unbiased stochastic approximation of the attention matrix

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer[51] (0)

-

 

魔改Attention大集合

 

does not compute pairwise interactions and uses fixed mask patters

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention[52] (1)

fast-transformers[53] 魔改Attention大集合

 

 

魔改Attention大集合

 

uses phi(q)(phi(k)v) and also improves the sequential sampling step

Linformer: Self-Attention with Linear Complexity[54] (3)

linformer-pytorch[55] 魔改Attention大集合

 

 

魔改Attention大集合

 

project key and value from nd

Real-time Semantic Segmentation with Fast Attention[56] (0)

-

 

魔改Attention大集合

 

l2_norm(q)*(l2_norm(k)*v)

Fast Transformers with Clustered Attention[57] (0)

fast-transformers[58] 魔改Attention大集合

 

 

魔改Attention大集合

 

groups queries together with LSH

Big Bird: Transformers for Longer Sequences[59] (0)

-

 

魔改Attention大集合

 

ETC with random connections

 

 

接下来,给大家介绍一下租用GPU做实验的方法,我们是在智星云租用的GPU,使用体验很好。具体大家可以参考:智星云官网: http://www.ai-galaxy.cn/,淘宝店:https://shop36573300.taobao.com/公众号智星AI

魔改Attention大集合

参考资料

[1] Generating Wikipedia by Summarizing Long Sequences: https://arxiv.org/abs/1801.10198v1

[2]memory-compressed-attention: https://github.com/lucidrains/memory-compressed-attention

[3] CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521v2

[4] attention-module: https://github.com/Jongchan/attention-module 

[5] CCNet: Criss-Cross Attention for Semantic Segmentation: https://arxiv.org/abs/1811.11721v2

[6] CCNet: https://github.com/speedinghzl/CCNet

[7] Efficient Attention: Attention with Linear Complexities: https://arxiv.org/abs/1812.01243v8

[8] Efficient-attention: https://github.com/cmsflash/efficient-attention 

[9] Star-Transformer: https://arxiv.org/abs/1902.09113v2

[10] fastNLP: https://github.com/fastnlp/fastNLP/blob/master/fastNLP/modules/encoder/star_transformer.py

[11] Generating Long Sequences with Sparse Transformers: https://arxiv.org/abs/1904.10509v1

[12] torch-blocksparse: https://github.com/ptillet/torch-blocksparse

[13] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond: https://arxiv.org/abs/1904.11492v1

[14] GCNet: https://github.com/xvjiarui/GCNet

[15] SCRAM: Spatially Coherent Randomized Attention Maps: https://arxiv.org/abs/1905.10308v1

[16] Interlaced Sparse Self-Attention for Semantic Segmentation: https://arxiv.org/abs/1907.12273v2

[17] Permutohedral Attention Module for Efficient Non-Local Neural Networks: https://arxiv.org/abs/1907.00641v2 

[18] Permutohedral_attention_module: https://github.com/SamuelJoutard/Permutohedral_attention_module 

[19] Large Memory Layers with Product Keys: https://arxiv.org/abs/1907.05242v2 

[20] XLM: https://github.com/facebookresearch/XLM 

[21] Expectation-Maximization Attention Networks for Semantic Segmentation: https://arxiv.org/abs/1907.13426v2 

[22] EMANet: https://github.com/XiaLiPKU/EMANet 

[23] Compressive Transformers for Long-Range Sequence Modelling: https://arxiv.org/abs/1911.05507v1 

[24] compressive-transformer-pytorch: https://github.com/lucidrains/compressive-transformer-pytorch

[25] BP-Transformer: Modelling Long-Range Context via Binary Partitioning: https://arxiv.org/abs/1911.04070v1

[26] BPT: https://github.com/yzh119/BPT

[27] Axial Attention in Multidimensional Transformers: https://arxiv.org/abs/1912.12180v1

[28] axial-attention: https://github.com/lucidrains/axial-attention

[29] Reformer: The Efficient Transformer: https://arxiv.org/abs/2001.04451v2

[30] trax: https://github.com/google/trax/tree/master/trax/models/reformer 

[31] Transformer on a Diet: https://arxiv.org/abs/2002.06170v1

[32] transformer-on-diet: https://github.com/cgraywang/transformer-on-diet

[33] Sparse Sinkhorn Attention: https://arxiv.org/abs/2002.11296v1

[34] sinkhorn-transformer: https://github.com/lucidrains/sinkhorn-transformer

[35] SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection: https://arxiv.org/abs/2003.09833v2 

[36] Efficient Content-Based Sparse Attention with Routing Transformers: https://arxiv.org/abs/2003.05997v1

[37] routing-transformer: https://github.com/lucidrains/routing-transformer 

[38] Longformer: The Long-Document Transformer: https://arxiv.org/abs/2004.05150v1

[39] longformer: https://github.com/allenai/longformer

[40] Neural Architecture Search for Lightweight Non-Local Networks: https://arxiv.org/abs/2004.01961v1 

[41] AutoNL: https://github.com/LiYingwei/AutoNL

[42] ETC: Encoding Long and Structured Data in Transformers: https://arxiv.org/abs/2004.08483v2 

[43] Multi-scale Transformer Language Models: https://arxiv.org/abs/2005.00581v1

[44] Synthesizer: Rethinking Self-Attention in Transformer Models: https://arxiv.org/abs/2005.00743v1

[45] Jukebox: A Generative Model for Music: https://arxiv.org/abs/2005.00341v1 

[46] jukebox: https://github.com/openai/jukebox

[47] GMAT: Global Memory Augmentation for Transformers: https://arxiv.org/abs/2006.03274v1 

[48] gmat: https://github.com/ag1988/gmat 

[49] Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers: https://arxiv.org/abs/2006.03555v1

[50] google-research: https://github.com/google-research/google-research/tree/master/performer/fast_self_attention

[51] Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer: https://arxiv.org/abs/2006.05174v1 

[52] Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention: https://arxiv.org/abs/2006.16236v2

[53] fast-transformers: https://github.com/idiap/fast-transformers

[54] Linformer: Self-Attention with Linear Complexity: https://arxiv.org/abs/2006.04768v3 

[55] linformer-pytorch: https://github.com/tatp22/linformer-pytorch

[56] Real-time Semantic Segmentation with Fast Attention: https://arxiv.org/abs/2007.03815v2

[57] Fast Transformers with Clustered Attention: https://arxiv.org/abs/2007.04825v1

[58] fast-transformers: https://github.com/idiap/fast-transformers

[59] Big Bird: Transformers for Longer Sequences: https://arxiv.org/abs/2007.14062v1 

[60] A Survey of Long-Term Context in Transformers: https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/