Great post! I am running a Discord community for practitioners tinkering with Multimodal AI research and applications. Would love to have you join: https://discord.com/invite/Sh6BRfakJa
Nice work! I too have been drinking from the firehose. I am working on a way to analyze baseball swings. Have you encountered any solutions to train a model on correct form in a movement? In this case, a baseball swing.
Very exciting -- I think the starting point is really a high quality multimodal dataset that aligns “form”/pose with outcome data (launch angle, launch velocity, distance, etc). Luckily, I believe MLB provides the latter for every game and you can probably align this with video of a game. You could probably also use another llm to “describe” the at bat to provide text as another aligned data source. Further, you could consider fine tuning a base LLM on text data, say from a book on baseball form, to ensure that you have the best starting point for this. I’d probably try collapsing the video and outcome data to the text domain before training embeddings directly on this (like they did in PaLM-E). This way, you could upload a video and be able to chat with the model to discover improvements one could make to their form. I’d love to hear about how this goes!
Great post! I am running a Discord community for practitioners tinkering with Multimodal AI research and applications. Would love to have you join: https://discord.com/invite/Sh6BRfakJa
Wow! Waiting to see more on how these impact financial services
Oh there’s a post on fine tuning that’ll use a financial use case as the example coming soon!
Nice work! I too have been drinking from the firehose. I am working on a way to analyze baseball swings. Have you encountered any solutions to train a model on correct form in a movement? In this case, a baseball swing.
Very exciting -- I think the starting point is really a high quality multimodal dataset that aligns “form”/pose with outcome data (launch angle, launch velocity, distance, etc). Luckily, I believe MLB provides the latter for every game and you can probably align this with video of a game. You could probably also use another llm to “describe” the at bat to provide text as another aligned data source. Further, you could consider fine tuning a base LLM on text data, say from a book on baseball form, to ensure that you have the best starting point for this. I’d probably try collapsing the video and outcome data to the text domain before training embeddings directly on this (like they did in PaLM-E). This way, you could upload a video and be able to chat with the model to discover improvements one could make to their form. I’d love to hear about how this goes!
Excellent post. Thanks