This project is funded by Precarn-IRIS as part of their Institute for Robotics and Intelligent Systems - Phase 4 mission. The research is being conducted jointly with Prof. Jim Clark, McGill, Prof. Sidney Fels, UBC, and Roel Vertegaal, Queen's University.
Intelligent Environments are spaces in which machine perception and reasoning capabilities are used to enhance human activity through background computation. Perhaps the most challenging technical problem facing researchers in this area is the need for reliable, high-resolution, and unobtrusive vision systems that can track objects and people as they move about such a space. This capability is required in order to recognize user actions and gestures and to perform synthetic reconstruction of a (visual) scene in the environment. Current applications typically rely on a small number of (often expensive) video cameras or laser rangefinders for this purpose, which tend to restrict both the visual coverage and the total resolution available. This usually imposes undesirable restrictions, for example, in the case of eye-tracking, the need for the user to remain stationary in front of a camera. Such restrictions limit that technology's potential for application in Collaborative Virtual Environments (Vertegaal, 1999a).
When using computer vision for object tracking, two cameras will provide absolute depth information through stereopsis, which can be used to aid in the recognition of the object. The use of more than two cameras does not significantly improve the accuracy of depth information, but can provide a greater coverage of the object viewed. This increased coverage can be expected to yield improved object recognition performance or view reconstruction capabilities. However, techniques for integration of arbitrary multiple views are computationally expensive (Vedula, 1999) and are unlikely to scale well.
This project will consider the use of a large, parallel, distributed array of low-cost cameras with embedded image processing capabilities for use in object tracking and scene reconstruction for Intelligent Environments. The use of large camera arrays is rapidly becoming feasible as the cost of image sensors has been greatly reduced in the last few years (e.g., the cost of a CMOS camera is currently less than $25). However, integration of image data from a multitude of cameras is a major challenge that currently inhibits the creation of camera arrays that are sufficiently large.
The solution involves a number of smaller technical challenges. For example, to simplify integration of views of moving objects, we must be able to synchronize the image acquisition between a large number of cameras. To cope with large amounts of image data, we must (1) develop more efficient image compression techniques (e.g., knowledge-based); (2) design detection algorithms capable of dealing with low resolution image data; (3) develop parallel, distributed software and hardware for the processing of large images; and (4) develop communication infrastructures between cameras that can handle potentially high data rates with low latency.
To provide more robust and efficient object tracking for Intelligent Environments, we aim to create a set of networked low-cost camera arrays that collectively provide high resolution and large field-of-view image processing capabilities. Our approach involves the development of a number of novel technologies, such as:
At the same time, we plan to apply these technologies to current problems in the area, including:
Current work on the development of multi-camera arrays tends to be based either on physically clustered cameras, as in the USC panoramic vision system (Neumann et al, 2000) or on small arrays of expensive, high-resolution cameras connected to workstations where video processing is performed (Kanade, 1997). Neither of these approaches satisfies our objectives, for reasons of cost, viewing area coverage, and parallel processing capabilities.
Our first milestone requires the development of a scalable, integrated camera with on-board image processing hardware. We will draw upon our experience in developing the Local Positioning System (LPS) (Fels, 2001), which uses a CMOS camera connected to a field programmable gate array (FPGA) and digital signal processing (DSP) chip to track active infrared tags in large rooms in real-time. This initial platform will be scaled to the production of multiple camera modules that provide inter-camera synchronization, efficient inter-camera networking and high speed data communication to a host machine. We believe that reconfigurable FPGA hardware would provide a suitable alternative to standard programmable microprocessor or DSP chip implementations for this application, with the advantages of greater speed and more efficient usage of circuitry, leading to power consumption reduction. The benefits of this approach are especially important for embedded systems applications.
One of the difficulties in working with reconfigurable FPGA architectures has been the lack of high-level design tools. This is changing, however, and we plan to work with the Handel-C design tools from Celoxica Corp., which permit hardware descriptions (in VHDL or EDIF netlists) to be generated from C language programs. Celoxica, in cooperation with Xilinx will make available their RC1000-PP development board, which contains a Xilinx Spartan-II 200K gate device, flash RAM, and assorted peripherals, including an NTSC video digitizer. We propose to use these boards to develop our embedded system smart camera prototypes, the design of which can then be transferred to compact special-purpose camera units for the full network demonstration phase of the project.
There are a number of advantages to tightly interconnecting the processing circuitry of multiple smart cameras. A mechanism akin to the attentional process in the human brain can be employed to actively direct various image processing activities to different cameras as task requirements and environmental conditions warrant (Vertegaal, 2002). For example, in a person-tracking task, some cameras may be able to see the person's eyes, and could therefore engage in gaze tracking activities, while other cameras could track the person's body orientation, or be engaged in peripheral alerting tasks.
In addition, each smart camera unit can be thought of as a single node in a coarse-grained MIMD parallel computer. Such a distributed computer can effectively approach a computationally intensive problem by breaking it into a number of subtasks that are worked on cooperatively by each processor. The programming of such MIMD systems is challenging, but biological metaphors can be used to some benefit (Clark and Hewes, 1994; Clark, submitted). In such approaches, separate processors collaborate and share the work efficiently through the application of heuristic load balancing strategies, mimicking some of the ways in which biological cells or colonies of insects distribute their tasks (Bonabeau, Dorigo and Theraulaz, 1999). Our proposal is to consider similar techniques for parallel distribution of various image processing tasks in our network of smart cameras. The interconnection of both image data and control signals between the cameras in the network will permit the implementation of such decentralized load balancing strategies.
As part of our research, we plan to investigate architectural issues associated with the implementation of the distributed parallel processing approaches, above. Among the considerations will be the development of adaptive processing architectures that can easily and quickly change their process in accordance with task requirements, as well as communication structures that will facilitate collaboration between separate smart camera units. Reconfigurable FPGA architectures are well suited to this sort of on-line rearrangement of process activity, and the development of adaptive mechanisms for these devices will form a portion of the proposed research.
The development of our large camera array will serve as an important testbed for the Local Positioning System (LPS) developed by Prof. Fels. Building on the work of IBM Research Labs (Morimoto et al, 1998), which uses intermittent on- and off-camera axis illumination of the eyes with an infrared light source to obtain an isolated image of the user's pupils (the bright-dark pupil technique), we will experiment with the potential of a smart camera array used for low-cost, non-intrusive long-range eye-tracking.
The capability to track the user's eye gaze behavior is particularly important in Intelligent Environments that are shared between users at different locations (see Vertegaal et al., 2001). Communication of user eye gaze has been demonstrated to be crucial in the efficient management of group conversation (Vertegaal et al., 2000). Systems that use eye-tracking for this purpose tend to have a limited range of operation and are prohibitively expensive (Vertegaal, 1999a). Furthermore, they often restrict the user's head movement and position (e.g. the user must be sitting at a fixed distance in front of PC monitor). Systems that do allow free head movement typically require the user to wear head mounted optics and provide no easy way of reporting point of gaze in absolute scene coordinates. The tracking of the eyes of multiple users at great distance is a challenging task given the small size of the image processing target, the eyes of a user. As such it will provide a crucial benchmark for the performance of the proposed multi-camera tracking technology. One of the great values of ubiquitous eye tracking technology is that it may greatly improve the functionality of speech recognition in Intelligent Environments. The proposed sensing technology will be applied to disambiguate the target of voice commands between multiple appliances without the need for verbal labeling (Vertegaal, 2002).
The large camera array will also be used to further our user studies on the tracking of laser pointers (both infrared and visible red) within large projection displays [Fels et. al. in press]. We have initial data suggesting that the resolution of laser tracking significantly impacts task performance in these large projection systems. The large camera array will allow us to have sub-pixel resolution for tracking laser pointers on such displays, which is useful for collaborative work environments, virtual reality environments and teleconferencing applications (Vertegaal 1999a).
One of the major application benefits of the camera array is the potential for high quality view synthesis, i.e. the generation of a reconstructed view of an individual in the "scene" from an arbitrary viewing angle, with the background removed, as is required for the Shared Reality Environment (Cote et al, 2001). As the computational demands for this task are very high, current state-of-the-art approaches (Kanade, 1997) cannot operate in real-time, even on powerful machines. By distributing the processing task over a collection of smart camera nodes, which dynamically configure communication paths to appropriate neighbouring nodes, we intend to study the feasibility of real-time view synthesis using such an array. However, unlike other tasks in which the communication bandwidth is fairly low, view synthesis requires high bandwidth transfer of regions of computed pixels, from the processing elements to a host PC with low latency. This will require serious architectural consideration, in particular, regarding the structure of "taps" into the array so that reconstructed high resolution video can be extracted from any particular region.