Computer Vision has matured from a research topic in the early 60’s to a mature field of research and application. Today, computer vision, image processing, and pattern recognition are addressing many societal and technological problems. The recent desire to monitor people and their activities has led to an added interest in human activity recognition. Security and surveillance applications span monitoring persons in a subway, an airport, a bus station, and a parking lot to observing persons in a wide area from a camera mounted on an unmanned aerial vehicle (UAV). Monitoring the elderly in a ‘smart’ home equipped with multiple cameras and other sensors is a different flavor of an application. Analysis and understanding of sports video is still another flavor of monitoring. The content-based video summarization and retrieval, especially useful to video sharing websites, is again an area with ties to human activity recognition. The movie industry is interested in synthesizing a given person’s actions and gait based on a model video of the person. The added constraint in many applications is the need for real time delivery of the end product of processing, analysis, and understanding. Advances in several technologies have also stimulated this growth in interest and the speed with which the applications have been adopted. Cameras come in all different sizes, shapes, and prices. You can have cameras that are very small to cameras that produce very high quality images of distant subjects. PTZ cameras with remotely controllable pan, tilt, and zoom capability are readily available. One can use infra red cameras (a bit expensive) for night applications. Memory has become relatively inexpensive and computers have become relatively fast and inexpensive. One can buy an ‘off the shelf’ system that connects to your home computer, and one can monitor one’s home from a laptop over the internet. So the technology is providing an additional boost to the computer vision applications. Contemporary researchers are addressing new problems, for example, the recognition of human activities from a UAV based camera where the moving platform poses some difficult problems. At the same, time others are continuing to work on older problems, like recognizing pedestrians in a cross-walk from a moving vehicle for assistance in timely stopping of the vehicle. Images with limited resolution and low contrast pose serious low level image processing difficulties. In addition to monitoring people, one is naturally interested in the interaction of persons in a scene, the behavior of a crowd, and the possible interaction of a person with movable objects like a piece of luggage or an unmovable object like a fence or a wall. So working on recognition of human activities has some very challenging and interesting ongoing research. Before one reaches the stage of high level processing like understanding an activity, many low level image processing steps in a long chain of steps must be performed. In general, these steps pose severe challenges in themselves. It is not the intention of the author to soft pedal the difficulties associated with low level processing. A cursory look at the images obtained, for example, in a subway station without the benefit of bright lights will convince a person that low level segmentation is a serious problem. In addition, surveillance is a 24/7 problem, including night-time, rain, and fog if one is outdoors. In general, the duration of an activity varies with its type of activity and is normally a continuous chain of events and not a singleton event. Given that our recognition methodologies are bottom up in the sense that we recognize “micro activities” or “actions” and then build a concatenation of such recognitions to recognize an activity, several researchers have adopted to segment an activity at different levels. Earlier the paradigms of ‘change, event, verb, episode, history’; ‘movement, activity and action’; and ‘agent, motion, target’ were used to segment different activities. Our group has developed a more flexible context free grammar based description of activities. This has the advantage of describing an activity at a level of detail based on the problem under consideration. At a gross level a person is represented as a blob – the level of understanding attainable at this level of granularity is limited to gross level description of motion and activity. One may describe actions like depart, follow and meet, construct a system that distinguishes between the motion of a person, bicycle and a vehicle based on the blob and recognize certain football actions between players. For certain applications this is adequate. In fact, if one is trying to avoid actual recognition of a person to conform to privacy issues, these techniques are particularly useful. At the next level, a person is represented by body parts namely head, torso, arms legs, hands and feet. A number of methods have been proposed to address the segmentation of body into various parts. At times, one is interested in determining the major body joints or the extremities of the body since they carry a wealth of information about the activity being performed by the entire body. Semantic recognition of activities based on various body parts has produced some very good results that range from the recognition of simple activities to the recognition of fighting and recursive activities as continued fighting. The context of the activity is playing an important role in recognition of human activity. A variety of methodologies are being pursued and one may impose a number of taxonomies to gain insight into methods and results. One such taxonomy is based on dividing the methods into two classes - single layer or hierarchical approaches. If it is a single layer approach, it may be divided into two cases: space-time or sequential. In the case of hierarchical approaches, recognizing higher levels of activity is based on simpler sub-events related to the activity. The common parts are reused again in constructing the description of the overall activity. A general purpose recognition that can provide a semantic description of diverse human activities is far in the future. Most researchers have focused on special purpose systems addressing particular problems and considered single person activities and or two person interactions or crowd activities. The moving light display experiment of Johansson [1] certainly inspired neuroscience and computer vision based studies of human motion. It motivated Webb [2] to study human motion. An earlier review by the author [3] and more recent reviews by Garvila [4], Cedras and Shah [5] and Turga et al [6] provide an overview of the state of art. A paper outlining a different direction is presented by Ryoo [7]. This short review is presented with the idea of enticing the reader toward further reading and possibly embarking on research in this area. Several problems of computer vision / human activity recognition have proved to be difficult to solve. We have certainly made headway but a solution ala R2D2 is still far in future. Designing and building a system to detect and possibly prevent the drowning of a person in the neighborhood pool or the backyard pool would be a great contribution. Detecting a person having a heart attack in hotel room while alone would be a great boon. The problem of estimating intentions of a person from his appearance and outward behavior tickles our imagination at this time. These are difficult problems and I consider them to be the grand challenges of Human Activity Recognition. It is fair to assume that some of these problems would be solved in the near future. |
Getting to Know…
J.K. Aggarwal, IAPR Fellow |
by J.K. Aggarwal (USA) |
J. K. Aggarwal is on the faculty of The University of Texas at Austin College of Engineering and is currently a Cullen Professor of Electrical and Computer Engineering and Director of the Computer and Vision Research Center. His research interests include computer vision, pattern recognition and image processing focusing on human motion. A Fellow of IEEE (1976), IAPR (1998) and AAAS (2005), he received the Senior Research Award of the American Society of Engineering Education in 1992, the 1996 Technical Achievement Award of the IEEE Computer Society and the graduate teaching award at The University of Texas at Austin in 1992. More recently, he is the recipient of the 2004 K S FU prize of the International Association for Pattern Recognition, the 2005 Kirchmayer Graduate Teaching Award of the IEEE and the 2007 Okawa Prize of the Okawa Foundation of Japan.. He is a Life Fellow of IEEE and Golden Core member of IEEE Computer Society. He has authored and edited a number of books, chapters, proceedings of conferences, and papers. |
Recognition of Human Activities: A Grand Challenge |
References: [1] G. Johansson, ‘Visual perception of biological motion and model for analysis,” Perception Psychophysics, vol.14, no.2, pp.201-211, 1973. [2] Jon A. Webb and J. K. Aggarwal, “Structure from Motion of Rigid and Jointed Objects” Artificial Intelligence, vol. 19, pp. 107-130 1982. [3] J. K. Aggarwal and Q. Cai, “Human Motion analysis: A review,” Computer Vision and Image Understanding, vol.73, no. 3, pp. 428-440, 1999. [4] D. M. Gavrila, “The visual analysis of human movement: A survey,” Computer Vision and Image Understanding, vol. 73, no.1, pp. 82-98, 1999. [5] C. Cedras and M. Shah, “A Motion-based recognition: A survey,” Image and Vision Computing, vol. 13, no.2, pp. 129-155, 1995. [6] Pavan Turga, Rama Chellappa, V. S. Subrahmanian and Octavian Udren, “ Machine Recognition of Human Activities: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol.18, no.11, pp. 1473-1488, 2008. [7] M. S. Ryoo and J. K. Aggarwal, “Semantic Representation and Recognition of Continued and Recursive Human Activities” International Journal of Computer Vision, vol. 82, pp. 1-24 2009. |
This article is the first in a new series, Getting to Know...IAPR Fellows. This series was begun in response to the question posed by Walter Kropatsch, IAPR Fellow, in the last issue of the IAPR Newsletter, “How many IAPR Fellows do you know?”.
I invite all IAPR Fellows to contribute to this exciting new series. |