Another entrant in the backronym-of-the-year award, released in late 2023 is HaMeR: Hand Mesh Recovery, from the Universities of Berkeley, Michigan and New York.
Reconstructing hands in 3D space is an important component of our non-verbal communications. See above for the quintessential Italian experience. (This also forms the punchline to the groan-worthy Dad joke: how do you silence an Italian? Tie up their hands.)
There have been no shortage of AI models over the past 10-15 years that try to match hand motions fluidly, and at high frame rates. Early attempts started with identifying the finger tips, then some basic splines (bones) between joints, and into the white-glove approximations like with HaMeR.
The problem with hand detection has always been when leaving safe orientations – the standard overhead poses we think about when looking at our own hands. Complex oblique and/or occluded views are typically fleeting, yet this is exactly where previous AI/ML models lose their detection, and either blank out (stop detection), or make spurious guesses – these guesses can result in anything ranging from noise/jitter, all the way up to distorted fingers or obviously physically impossible contortions.
HaMeR has the full gamut of open source evidence that they are more effectively solving for these edge cases, as showcased via their Project Page, Paper, Code Repository, Demo on Huggingface, and working Google Colab. To top it all off, they will be releasing training data soon, too, so that you can extend their model for your particular use case.
- Be sure to go to the Project Page to watch their embedded video demonstrations.
Georgios Pavlakos, as first author, et al, does list their known limitations with the model, but overall, the smaller amount of guesswork and accompanying jitter results in more stable motion/flow than their predecessors.
Implications for this work are useful for inside-out processing of AR/XR applications, where you want your own hands and fingers accurately captured as input to your own commands – but equally for outside-in applications: where remote sensing of the full gamut of human actions as well as non-verbal communications are needed for full and complete understanding of the world around us.