Introduction
In our data-rich world, making sense of the torrent of information around us is difficult. We’re awash in a sea of signal, overlooked not necessarily due to irrelevance, but rather because it’s been cumbersome to scalably process and act on such complex, multimodal data. Even prior machine learning approaches tended to stitch together various smaller models that perform specific tasks. Models that are trained to handle larger tasks using multimodal inputs can be trained from the raw combinations of data present in the world, hopefully reducing development complexity and increasing overall effectiveness.
Recent strides in academia offer promising directions to this puzzle. The papers present innovative strategies to create a unified representation space capable of interpreting and integrating diverse data types. Further, several papers propose strategies to act on this multimodal input in a complex way, with the models learning to choose the proper actuators (digital and physical) to perform the appropriate actions. Tracing these advances out to potential end-states uncovers novel applications that would have been heretofore difficult to develop.
Interesting Recent Academic Advances
A number of recent academic papers are propelling the field towards models that can begin to continually process and act upon the world around us. Here are a few noteworthy recent papers that caught my eye:
PaLM-E: An Embodied Multimodal Language Model - This paper proposes a continuous embedding space that can handle multimodal input and output. The authors demonstrate that real time sensory information can be communicated to a pretrained Large Language Model that can then process the information and take actions (in a continuous fashion). The common representation space to which the inputs are reduced is thus language, and PaLM-E takes advantage of the advances in this space to manage its understanding of the world. Interestingly, the paper demonstrates that PaLM-E exhibits positive transfer, wherein the learning of one task improves performance on another task by leveraging rich, cross-modal semantic relationships to capture and deeper understanding of the underlying data.
ImageBind: One Embedding Space To Bind Them All - Whereas the PaLM-E paper combines sensory information into the language space, this paper proposes a method for combining perception from multiple senses into a combined embedding space. The authors show that this method can be used to learn joint representations of images, text, audio, depth, thermal data, and IMU data. By aligning these modalities’ embeddings into a common space, ImageBind enables retrieval across this complex embedding space, and even allows for cross-modal retrieval, which means that users can use combinations of sensory input to retrieve other relevant entities from the space. Furthermore, and perhaps most striking of all, the authors observe emergent alignment of modalities such as audio, depth or text that aren’t observed together, hinting that the embedding model is learning deeper representations that aren’t seen in training data.Â
Any-to-Any Generation via Composable Diffusion - The authors of this paper demonstrate Composable Diffusion (CoDi), which is a method that can take in any combination of input modalities and generate and combination of output modalities in response to queries or interactions. Crucially, because they align modalities in both the input and output space, this approach enables models to generate different combinations of output modalities based on the specific type of complex query initially sent to the model. This general method can be applied to any grouping of modalities, which unlocks the potential for various use cases.
In essence, these exciting developments showcase the impressive pace of advancement in multimodal interaction patters, and the direction and speed with which this field is moving (all of these papers were released before June 2023, and there are obviously many more that I did not cover). The exciting thing is that several papers note positive transfer between different combinations of data formats not present in training data. Thus, these advancements are likely greater than the sum of the represented pairs of data present in the training data. I contend that continuous models that combine complex sensory input and act dynamically will continue to improve in surprising ways.
Potential Industry Applications
Like previous technological advances, these improvements unlock novel applications as they expand the scope of development feasibility. Below are several examples that hopefully illustrate the transformative potential of these advances. These are really just raw ideas, and their utility is more to help me understand the implications of the advances and to hopefully help me paint a vivid picture of the possibilities of these technologies. The purpose is not really to prescribe solutions to specific industries, which obviously requires far deeper context, but rather to share some cool applications to get creative juices flowing.1
Enterprise Applications:
Medical Data Fusion: The vision of an intelligent, digital doctor may seem far fetched, but we are inching closer to it. A system capable of analyzing and interpreting the vast array of possible medical data — from lab results to medical scans — could provide healthcare professionals with invaluable assistance in patient management.
Advanced Manufacturing Monitoring: These models could enhance quality control in manufacturing by analyzing audio, video, and sensor data from the production line in real time. This could help to identify subtle anomalies that might indicate a problem with the manufacturing process, allowing for earlier intervention and reducing waste. There are likely many manufacturing processes that were previously too low stakes to monitor and improve, but one can imagine a future where such technology lowers the ‘activation energy’ for analysis, allowing for monitoring and improvements across more physical processes than is possible today.
Agentic Security Systems: Enhanced security for physical spaces potentially becomes easier with these advances. A model with access to cameras and sensors, could establish a deeper understanding of its environment. It could assist security professionals in efficiently monitoring large spaces like events or schools for threats in real time. One could even foresee a future where autonomous agents continuously surveil this diverse embedding space, flagging any suspicious behavior automatically.
Consumer Applications:
Sports and Music Coaching: Picture a scenario where you're playing a sport or practicing an instrument under the watchful eye of a multimodal model. This model could provide real-time analysis of things like form and strategy, taking into account data like body pose, audio, and raw video to offer actionable suggestions for improvement.
Home Management: Consider the potential of smart homes where you're relieved from the drudgery of configuring dozens of individual devices. Instead, an intelligent home agent could help you handle this complexity. Drawing upon a plethora of sensory inputs (or even just logs) from different devices, it could execute complex commands through actuators (physical and digital).
Personalized Shopping: With such models, users could have shopping assistants that understand the consumer's preferences based on visual and textual input, making for a more personalized and effective shopping experience. Not only could they analyze pictures from a customer's existing wardrobe, they could also process fashion blogs and social media to better understand tastes and preferences to suggest new items that match the user’s style. The user could even describe desired directions for improvement in a conversation with such an assistant.
Considerations
The potential implications of these advances for both enterprise and consumer applications are profound. One common theme to note is that these developments highlight the importance of high-quality data. Comprehensive data collection, management, and retraining are crucial for the development and maintenance of these potential applications. Successful businesses in this space might have a unique multimodal dataset and will likely have some self improving flywheel dynamics for user data with respect to the underlying model. One novel constraint for data in this specific space is that it should inherently be multimodal – the same concepts should be represented in different modes in the training data so as to allow the model to learn these complex relationships. There’s a lot of research focused on aligning data across modes, and further advancement in this space is welcome, as it’ll unlock so many more potential use cases.
A Quick Note
As a product manager, angel investor, and technical contributor, I'm excited about where all this could go and am always eager to meet folks who are brainstorming in these areas. The ideas I've laid out here are obviously just the tip of the iceberg, and I'm convinced there's a whole lot more to discover.Â
The AI space is always shifting, and it seems to be moving faster than it has in recent memory. I’ve been drinking from the arXiv firehose and spending far too much time on ML Twitter, so I’ve resolved to make positive use of that time and try to thread together various technical concepts I’ve encountered and explore the applications that such developments unlock. There’s more to come, and I'd love for you to tag along.Â
There’s also an entire software stack that needs to be created to build these models and deploy them into production, but I’m going to skip over this for now. I’ll also be skipping mention of integrations into existing software that will enable incumbents with distribution channels to improve their offerings. I think discussing novel applications will better illustrate the potential of the underlying research, but if you’re building these solutions, I’d still love to learn more!
Great post! I am running a Discord community for practitioners tinkering with Multimodal AI research and applications. Would love to have you join: https://discord.com/invite/Sh6BRfakJa
Wow! Waiting to see more on how these impact financial services