The Future of Digital Frames
A team from Intel asked Marco Triverio and I to look 10 years into the future and imagine how people might capture or display photos given expected advances in technology. Intel is in the ever faster processor business and uses these predictions to steer chip architecture as well as to help spur developers imagination to maintain demand for faster silicon.
Over the course of the project we realized that using photos for storytelling is such a fundamentally instinctive behavior that it is safe to predict it will last through several generations of technological innovation. Our concept develops on this insight. However, foreseeing 10 years into the future is tricky. With IBM’s Watson, we’re seeing a sneak peak of improvements in natural language and intention recognition. We think that one of the photo related advances users will enjoy is computers that can actually help us tell stories and have richer conversations. We created the video below to illustrate the concept, but thought it would be fun to explain our reasoning and how we found sneaky ways to quickly prototype experiences that may not be possible for another decade.
10 Years in the past
10 years is a fairly arbitrary amount of time in technology. Many things seem to perpetually be just a few years off. So to give ourselves some perspective we decided to look 10 years into the past. What we realized was that in 2001 many thought we would have flying cars and HAL9000. However the major advancement of the time was actually the launch of Google’s search algorithm.
So we had a conversation early on as to what technologies seem to have made legitimate advances in the last few years. Among others we found facial recognition, analytics applied to natural language, and voice recognition. You can see these in Facebook’s photo tagging, IBM’s Watson and Siri (released as we were wrapping up the project) respectively. However all of these are in their infancy. Facebook’s face recognition is not always accurate, Watson’s undisclosed cost is estimated to be $100 million and $1.5 billion, and Siri can often struggle to compose a text message.
User Research Focused on Behavior
While we believe it is impossible to accurately predict technological breakthroughs that are going to take place in the coming 10 years, we are convinced that human behavior will change at a much less rapid pace. For this reason we decided to start the project by observing current user behaviors surrounding pictures and photography. As we conducted interviews, one factor that quickly become evident is the social nature of photos. People share photos online but also during face-to-face conversations. We found many users handling photos or smart phones in an effort to enhance a conversation. Pictures provide common references, visual cues, and rich details that support and enrich storytelling.
We re-framed the brief in order to address the core need, asking ourselves: what will storytelling look like 10 years from now? How will technology, which already heavily influences the way we tell stories, assist us in our narrations?
The devices that surround us have become increasingly aware of the context in which they exist. Yet the scale to which this works is still coarse and devices are not aware of what is immediately surrounding them. What if our devices knew more about the few foots around them? We have started from the idea of a picture frame that simply listens and shows pictures pulled from keywords in the dialogue.
*These are some rough mockup's we used to pitch the concept early on.*
At CIID we really like to rapid prototype. Most future scenarios out there are just a lot of special effects and some ideas that seem promising on paper. We wanted to legitimately iterate through several versions of a technology that does not exist yet, so we had to be sneaky. At first we investigated the possibility of building a fully working prototype for very specific scenarios. After acquiring the best voice recognition solutions available and training the software on our state of the art Macs we quickly realized that they were still far too computationally intensive and slow. When two people spoke at once, our computers fans sounded like they were ready for lift off and promptly locked up.
We quickly switched to a secret “Wizard of Oz” experiment in which voice recognition was carried out by a hidden human actor. We invited users to sit in a room and start a conversation with Chris. On the table was our first prototype: a laptop (in the role of a picture frame) showing full screen images. What our guinea pigs subjects did not know was that Marco was sitting in another room, listening through the laptop microphone, and manually changing pictures on the display.
For the geeks: this prototype was created by writing a Python script on the Wizard of Oz computer that would query Google Images and download the first result. This top result image was then pushed over the network on to the prototype laptop in the interview room. A Processing sketch on the prototype would load the most recent picture and display it full-screen.
The learnings from this experiment were:
Speed is crucial. Even a 2-second lag can be annoying and can slow down or derail the conversation.
Pictures must be personal. Photos from Google Images are generic while storytelling is tied to events, people, and places as we see them through our pictures.
In the next iteration we focused on content at least one speaker was familiar with. We thus carried out a new “Wizard of Oz” experiment where Marco would pick images from a curated library of 100 pictures taken at a party the subjects had attended. Chris steered the conversation towards talking about that same party and, once again, we observed the reactions of our guests as they helped retell the events of the night.
This prototype was dramatically more successful and showed great potential for creating user delight, but we still had more to learn. This time we noticed that back lit displays can be preeminent and distracting. Whenever the picture frame switches picture, the speakers’ attention is drawn to the device, even when it sits in the periphery of their vision.
In the last prototype we have switched from a LCD laptop screen to an E Ink screen, in our case the Kindle’s. Through the “Experimental” mode in this reader and through the help of BERG’s James Darling we have managed to have the Kindle load images that could be changed remotely. For the nerds, the Kindle was loading a web page that was refreshing every time a new image was loaded on the webserver.
The non-backlit E Ink display nicely fades in the environment in which it exists. This picture frame prototype did not distract the speakers. When useful, it was within easy reach to help the discussion.
Living Frame Concept
The prototyping sessions brought us close to experiencing the potential of a reactive picture frame. We learned that the frame should be quick, personal, and ambient. In other words, it should feel like a living entity dedicated to gently offering pictures. With frames like the ones illustrated in our concept video places all over the house, it would start to feel as if the home were reacting to your current experiences with past memories. Homes would become incredibly personal. So much so that we even had several conversations about how frames might hide certain photos while you had guests over.
Technology Behind it
Since the beginning of this project we have tried to avoid predicting major technological breakthroughs. We have focused instead on trying to understand existing behaviors and how they might evolve in the future. Similarly, we imagine Living Frames as being built with existing technologies, improved and made more reliable by ten years of evolution.
We envision a picture frame that is aware of its physical surroundings. Kinect-inspired sensors could give all sort of information about the people in the room: who they are, whether they are facing the frame or not, and so on. Natural language recognition could enable the frame to understand the context and meaning of conversations. However, even with this knowledge, the system must know enough about an image’s content be able to retrieve the relevant one amongst libraries containing thousands.
Most of our photos now reside online, often, for instance, in Facebook albums that are increasingly communal. For this reason we envision a frame that is acutely aware of our digital life. Yet we also imagine an evolution on the kind of metadata available with each picture. At present we know when and where the picture was taken. The increasing digitization of our life (for instance, in invitations to events) and technologies such as face recognition can provide new awareness on the social context in which a picture was taken. Information such as who is in the photo, who was nearby, and so on become available.
Value for Intel
Digital photography can be split in three phases: capturing, editing, and consuming. Intel has an interest in photography since it involves computing power in all three phases. At present comparatively little computation goes in capturing and consuming pictures, while editing often requires a great deal of computational power.
Living Frames re-invents the way we currently consume pictures by introducing a listening entity that reacts to the environment around it. This has an impact on people, who have a completely new way of interacting with their memories. Yet Living Frames also creates the need and desire for devices powerful enough to handle the technology necessary to be able to “listen” and react appropriately.
Intel does some very interesting work. Here is a great economist article on the team we worked with.