DeBERTa
A BERT variant using disentangled attention that separates content and position information.
Overview
DeBERTa (Decoding-enhanced BERT with disentangled Attention), developed by Microsoft, improves the transformer architecture by using disentangled attention, where content and position are represented as separate vectors. Each token's attention score is computed from both content-to-content and content-to-position interactions.
Key Details
DeBERTa also introduces an enhanced mask decoder that incorporates absolute position information for prediction. DeBERTa V3 was the first model to surpass human performance on the SuperGLUE benchmark and remains one of the strongest encoder models for natural language understanding tasks.