NTT Docomo has developed the Hands-Free Videophone, which enables video calls without having to hold the camera. This is part of docomo’s research on creating future glasses-type devices.
The Hands-Free Videophone captures the user’s face with three cameras in each of the left and right sides of the frames. The video sent to the other person is created by combining the pictures with a pre-rendered 3D model of the users face.
“Each camera has 720p resolution, and a fish-eye lens, with a 180-degree field of view. This is the High Definition picture currently being captured in real time. If you look at the face, you can see it’s really distorted, because the fish-eye lens is so close. The distortion is compensated, and the picture is combined with a 3D model of the person in the computer. Currently, priority is given to the part around the eyes. As you can see when the man closes his eyes, the eyelids and the corners of the eyes appear quite realistic. Such a level of realism is hard to achieve with models like CG-based avatars, where parts are overlaid on the face.”
Currently, the resolution isn’t high enough to handle the mouth and upper body parts of the picture, so those are based on CG. The orientation of the face is based on six-axis sensor data, and the motion of the mouth is based on audio data from the microphone.
There’s also a camera on the back of the device, so more realistic video calls can be achieved by combining the person’s image with the actual background.
“So, in the picture the other person receives, it looks as if there’s a virtual camera taking pictures from in front of this person’s face. In the current prototype, we haven’t yet built in a HMD. But we’d like to do that in the future, to make it feel as if the picture of the other person is floating before your eyes. We think it could be possible to make it seem as if you’re having a face-to-face dialog anywhere, even in an empty outdoor space.”
If the camera resolution is increased dramatically, to 4K or 8K, it’ll be possible to increase the amount of actual video used. Ultimately, the aim is for most of the face to be recreated from actual video.