Introduction
A common occurrence when I’m using GPT-4 for tasks like code generation is that it’s often interrupted in the middle of a task by it’s token length limitations. I can ask it to continue, and this often works, but the question arises as to why this occurs in the first place. The problem of context window length turns out to be one that lies at the core of the transformer architecture itself. Recently, there have been great advances in academic research that have sought to solve this inherent limitation, and commercial applications (probably!) use these to great success, with recent releases of GPT-4-32k and Claude-100k (the numeric designations referring to the context window sizes, in tokens, of these model variants). There are more approaches in the works, and expanding the context windows of models will unlock exciting new applications for consumers and the enterprise.
The context window constraint is fundamental to the transformer architecture. This exciting innovation allowed practitioners to scale model parameter sizes easily, but its self attention mechanism has a quadratic constraint based on the context window size. The self-attention mechanism in transformer models allows each token in the input to ‘pay attention’ to every other token. It does so by computing a score with every other token in the sequence to generate this ‘attention’ vector. The key benefit of the transformer architecture is that the attention vector of a given token is independently calculable from the others, and so the user can easily parallelize this operation in GPUs. However, the number of these computations scales quadratically with input token length, since every token must generate a vector with every other token. This is a considerable barrier to scaling models with this architecture.
Academic Advances
Improving the quadratic nature of the attention mechanism has been a strong focus of recent research, and the latest literature has several broad approaches. The blog by Stanford Professor Chris Re’s Hazy Research Lab contains excellent technical primers on the subject (they wrote some of the below papers as well), and I urge readers to check it out for great technical coverage on the topic. I also recommend OpenAI researcher Lilian Weng’s blog Lil’Log for deeper technical explanations as well. For a quick summary on the main thrusts of development, I’ve summarized the highlights of recent research directions below.
Sparse Attention Approaches: Sparse attention methods aim to reduce the complexity of attention by allowing each token to attend only to a subset of other tokens, instead of all tokens.
Longformer: This approach modifies the standard transformer self-attention mechanism by applying it to different combinations of input windows, allowing it to process longer documents than was previously possible. This model uses a sliding window approach where each token attends to a fixed number of tokens to its left and right, giving it a sparse attention pattern.
BigBird: Google's BigBird, like Longformer, applies a sparse attention pattern. It includes a combination of random, local, and global attention that allows for both structured understanding of local context and capture of long-term dependencies.
Engineering Optimizations: The below method can be considered an extension of the Sparse Attention Approaches. It broadly takes advantage of the fact that, on modern GPUs, a common bottleneck to the computation of attention is reading and writing to GPU memory. There are several engineering optimizations that take advantage of this reality to speed up the computation of attention.
Flash Attention: This approach extends on the sparse attention idea by additionally rearranging the attention computation in transformers to decrease memory usage by GPUs during the computation to speed up the process, effectively shifting the memory usage from quadratic to linear in sequence length. This reordering, combined with the use of classical tiling techniques and recomputation, allows FlashAttention to significantly improve the efficiency of attention calculation. This innovation is widely used in the industry today, and represents a key enabling technical advancement.
Approximation of Self-Attention: This approach aims to reduce the complexity of self-attention using approximation techniques in order to limit the number of computations performed.
Linformer: Introduced by Facebook AI, the Linformer model reduces the self-attention matrix from a quadratic to a linear function by projecting the sequence length to a smaller dimension, effectively approximating the attention mechanism without a significant loss in performance. The key observation here is that the attention matrix contains patterns or redundancies in the attention relationship since language has several inherent properties like structure and consistent context within sentences that make it such that the attention matrix is low rank (can be faithfully represented by a smaller matrix without losing information).
Structured Space Models: Anyone who’s diagrammed sentences knows that language can be represented in different structures rather than the traditional sequence. These structures sometimes have properties that allow for efficient computation of attention relationships, and the papers below each propose a different kind of attention mechanism that uses the properties of input sequences that are structured in a unique way.
H3: The Hungry Hungry Hippos (H3) model uses a tree-like structure to represent the input sequence. The importance of tokens is represented in the structure of the tree itself. This allows the attention mechanism to focus on the most important tokens in the input sequence, and to avoid computing attention for less important tokens.
Meta Mega: Meta Mega is a graph-based attention model that uses a graph to represent the input sequence. The graph is constructed by first creating a node for each token in the input sequence. Then, edges are added between tokens that are close to each other in the input sequence. The attention weights are then computed using a graph neural network. This allows the attention mechanism to take into account the relationships between tokens, and to avoid computing attention for tokens that are not related to the current token, reducing complexity
Hyena: This approach works by first computing a set of linear projections of the input sequence. These projections are then combined using long convolutions and element-wise multiplication. The long convolutions allow Hyena to learn long-range dependencies in the input sequence, while the element-wise multiplication allows Hyena to select specific parts of the input sequence. The combination of the two allows the architecture to combine long range context with short-range structure and content, while reducing computational complexity to a subquadratic level.
Conditional Computation on most important tokens: This approach involves dynamically deciding which parts of the model to execute for each token based on its importance.
CoLT5: This approach was pioneered in this paper, and it first computes a learned importance score for each token. Attention is only then calculated for the tokens with the highest scores, greatly reducing computational complexity.
The presence of so many advancements in such a short period of time reveals the interest that academia has in pushing forward this area of research. There are likely more advancements to come, and it’s unclear which of these will have the biggest impact, but let’s assume that the steady drumbeat of progress in this field continues. What new applications become feasible when context windows scale from ~100k to millions, if not greater?
Industry Applications
Spoiler Alert: The implications of scaling context windows are immense.
Some readers of my previous blog post on multimodal models may have rightly reacted with skepticism to some of the applications suggested as possible. Data is everywhere, but it can be incredibly large in size. A picture might be worth a thousand words, but an fMRI scan might be worth millions. Data rich modalities are dependent on long context windows, to say the least about combining these in a multimodal modeling task. The expansion of context windows to the millions of tokens and beyond will hopefully make more of the applications discussed in the previous blog post more feasible.
Below are some other applications that could soon be feasible:
Enterprise Applications
Genetics
Genomic Data Analysis: Language models with long context windows could transform the field of genetic analysis. The haploid human genome consists of about 3 billion base pairs, well beyond the reach of current LLMs. With models capable of handling larger context windows, we could feasibly interpret the entire human genome as a 'text' for analysis. This could give us a much more comprehensive understanding of genetic relationships, and might enable novel insights in areas like genetic disease risk, trait inheritance, and evolutionary biology.
DNA Sequence Generation: If we can train LLMs on large-scale genetic data, they might learn to generate novel DNA sequences that serve specific functions, which would be a revolutionary advance for synthetic biology. By treating DNA as a 'language', we could potentially use LLMs to design new organisms or modify existing ones in highly specific ways. By combining the training data of DNA with verbal descriptions of phenotypes, one could imagine a world where scientists could query models for modifications to genomes for changes across a specific axis.
Medicine
High-resolution Medical Imaging Analysis: With the ability to handle larger context windows, LLMs could analyze high-resolution medical images, such as MRI and CT scans, in their entirety, leading to more accurate diagnosis and treatment plans. Companies already offer full-body MRIs on demand as an extension of routine medical checkups, and this technology could help drive down their costs by providing great leverage to the radiologists that will analyze the resulting scans. The longer context could allow models to learn complex patterns in human scans that are impossible to discern for current approaches.
Software Engineering
Deep Codebase Understanding: Longer context windows will revolutionize software engineering. Whereas current AI in IDE applications can act as autocomplete for a single function, long context windows will enable models to understand entire codebases, drastically improving tools for developers. This full codebase understanding will benefit developers new to a project or those managing large projects with numerous contributors, as the model will be able to analyze how different parts of the codebase interact, enhancing efficiency and reducing bugs. Code reviews will be transformed, with LLMs offering detailed feedback by considering the overall impact of changes on the codebase, aiding in documentation, improving code quality and streamlining the review process. Intelligent autocomplete systems will move beyond local context to provide suggestions based on a wider understanding of the codebase, preemptively detecting bugs and significantly reducing debugging time.
Consumer Applications:
Entertainment
Interactive Storytelling: Back in February, I played a game of Dungeons and Dragons with GPT-3 as a dungeon master. It was pretty novel, but there were some roadblocks. Many of these stemmed from the fact that the model required constant human supervision to generate content past the limits of the context window. With the ability to maintain much larger context windows, LLMs could deliver significantly richer interactive storytelling experiences in situations like this. Imagine games that don’t just respond to your most recent inputs, but tailor their entire narrative based on the full history of your choices throughout the game.
Education
Personalized Learning: LLMs that can maintain larger context windows could offer more effective personalized learning experiences. Such a model could track a student's learning progress across a wide range of topics and adapt its teaching strategy in real-time to match the student's changing needs.
Considerations
There’s a unifying thread between many of these potential applications of LLMs with larger context windows. If these models can take in and generate extremely large pieces of text, it might change the need for complex data storage and retrieval systems. In meeting various practitioners and at my work, the data engineering around LLM deployment is a continuation of the headache that has plagued commercial ML development over the last decade. Given the real-time nature of most LLM applications, the data problem is already made harder. With this advancement, instead of storing many small chunks of information and needing to search and retrieve them, models could just retrieve relevant information from the context window on the fly. However, this shift would likely only be feasible for data that these models have been effectively trained on - there would potentially still be a need to store and retrieve other types of data. In any case, these models will impact the way we think about how LLMs use data.
The push towards expanding context windows is not just a race for bigger and better, but rather a concerted effort to meaningfully revolutionize how we interact with data and how we generate insights from it. From synthetic biology to immersive gaming, personalized education to codebase understanding, the expansion of context windows has the potential to make these applications not only feasible but incredibly transformative. There are certainly exciting things to come.