Navigating a 3d virtual environment of l

Int J Advanced Media and Communication, Vol x, No x, xxxx Navigating a 3D virtual environment of learning objects by hand gestures Qing Chen*, ASM Mahfujur Rahman, Xiaojun Shen, Abdulmotaleb El Saddik and Nicolas D Georganas DiscoverLab, MCRLab, School of Information Technology and Engineering, University of Ottawa, 800 King Edward, Ottawa Ontario, K1N 6N5, Canada E-mail: qchen@discover.uottawa.ca E-mail: shen@discover.uottawa.ca E-mail: georganas@discover.uottawa.ca E-mail: kafi@mcrlab.uottawa.ca E-mail: abed@mcrlab.uottawa.ca *Corresponding author Abstract: This paper presents a gesture-based Human-Computer Interface (HCI) to navigate a learning object repository mapped in a 3D virtual environment With this interface, the user can access the learning objects by controlling an avatar car using gestures The Haar-like features and the AdaBoost learning algorithm are used for our gesture recognition to achieve real-time performance and high recognition accuracy The learning objects are represented by different traffic signs, which are grouped along the virtual highways Compared with traditional HCI devices such as keyboards, it is more intuitive and interesting for users using hand gestures to communicate with the virtual environments Keywords: gesture recognition; environment; learning objects human-computer interface; virtual Reference to this paper should be made as follows: Chen, Q., Rahman, A.M., Shen, X., El Saddik, A and Georganas, N.D (xxxx) ‘Navigating a 3D virtual environment of learning objects by hand gestures’, Int J Advanced Media and Communication, Vol x, No x, pp.xxx–xxx Biographical notes: Qing Chen is a PhD Candidate at the DiscoverLab, the School of Information Technology and Engineering, University of Ottawa He obtained his MASc Degree in Electrical and Computer Engineering in 2003, from the School of Information Technology and Engineering, University of Ottawa He received his BE Degree in Electrical Engineering from Jianghan Petroleum Institute, Hubei, China, in 1994 and the M.E degree in electrical engineering from China University of Mining and Technology, Beijing, China, in 1999 His research interests include computer vision and image processing His current research topic is focused on vision-based hand gesture recognition in real-time Copyright © 200x Inderscience Enterprises Ltd 2 Q Chen et al ASM Mahfujur Rahman is a Master Student at the MCRLab, the School of Information Technology and Engineering, University of Ottawa His research is focused on learning object visualisation in a 3D virtual environment, which relates computer graphics, information visualisation, distributive computing, and knowledge management issues Xiaojun Shen is a postdoctoral fellow at the DiscoverLab, the School of Information Technology and Engineering, University of Ottawa He obtained his PhD in Electrical and Computer Engineering in 2002, from the School of Information Technology and Engineering, University of Ottawa His research interests include distributed simulations, collaborative virtual environments, tele-haptics and advanced multimedia objects Abdulmotaleb El Saddik is University Research Chair and Associate Professor at SITE, University of Ottawa He is the recipient of the Friedrich WilhelmBessel Research Award from Germany’s Alexander von Humboldt Foundation (2007), the Premier’s Research Excellence Award (PREA 2004), Canada Foundation for Innovation (CFI) Award (2004) and the National Capital Institute of Telecommunications (NCIT) New Professorship Incentive Award (2004) He is the director of the Multimedia Communications Research Laboratory (MCRLab) He has authored and co-authored three books and more than 160 publications His research has been selected for the BEST Paper Award at the ‘Virtual Concepts 2006’ and ‘IEEE COPS 2007’ Nicolas D Georganas is Distinguished University Professor and Associate Vice-President, Research (External), University of Ottawa, Canada He received the Dipl.Ing Degree in Electrical Engineering from the National Technical University of Athens, Greece, in 1966 and the PhD in Electrical Engineering (Summa cum Laude) from the University of Ottawa in 1970 He is a Fellow of IEEE, Fellow of the Canadian Academy of Engineering, Fellow of the Academy of Science (Royal Society of Canada) and Fellow of the Engineering Institute of Canada He is a Laureate of the 2002 Killam Prize for Engineering, Canada’s highest award for career achievements in research Introduction Virtual Environments (VE) provide a new paradigm for human communication, interaction, learning and training To interact with VEs, besides traditional human-computer interaction devices such as keyboards and mice, different sensing modalities and technologies can be utilised and integrated for a more natural user experience, Turk (2001) Devices that detect body position and orientation, speech and sound, facial expression, haptic response and other aspects of human behaviour and state can be used for interactions between humans and VEs These devices and techniques make natural and immersive Human-Computer Interfaces (HCI) for applications in 3D VE promising, Pavlovic et al (1996), Kirishima et al (2005) and Wu and Huang (2001) The Three-Dimensional (3D) information provided by VEs offers several possibilities such as perceiving more information at a time, displaying meaningful patterns in the data, and understanding the relationship among different data items, Card et al (1999) These possibilities may be utilised in different contexts especially in visualising learning objects as they require more novel and intuitive presentation techniques than what is provided by the traditional 2D approaches, Klerkx et al (2004) To access the learning objects in a 3D Navigating a 3D virtual environment of learning objects by hand gestures VE, the traditional mouse and keyboard are limited because the mouse itself is a 2D device and the arrow keys on the keyboard are not an intuitive approach for humans To overcome these limitations, a multimodal-based approach can be employed to achieve a more powerful and natural interaction between the user and the virtual environment Besides the mouse and the keyboard, other modalities can be the human voice, hand gestures, haptic devices, etc Figure shows this multi modal-based HCI architecture Figure The architecture of multimodal-based manipulation of learning objects in a 3D VE Hand gestures are a powerful human to human communication modality For example, sign language has been used extensively among the people who are speech and hearing impaired People that can talk and listen also use many kinds of gestures to help their communication in daily life However, the expressiveness of hand gestures has not been fully explored for virtual environment applications Compared with traditional HCI devices, hand gestures are less intrusive and more convenient for exploring 3D VEs, Wu and Huang (2001) To use the human hand as a natural human computer interface, data gloves such as the CyberGlove from the Immersion Corporation have been used to capture human hand motions, Chen et al (2005), Metais and Georganas (2004) and Yang et al (1994) With attached sensors, the joint angles and spatial positions of the hand can be measured directly from the glove However, the data glove, with its attached wireless components, is cumbersome and awkward for users to wear, and moreover, the cost of the data glove is often too expensive for regular users Vision-based hand gesture recognition can be a feasible and efficient alternative for human-computer interaction, especially for applications in 3D VEs, Wu and Huang (2001) With video cameras as the input device, hand movements and gestures can be captured and analysed with different image features and hand models Many existing approaches for vision-based hand gesture recognition need the help of markers or coloured gloves to make the hand detection and tracking easier, Joslin et al (2005) and Keskin et al (2003) In this paper, we are more focused on tracking the bare hand directly and recognising hand gestures without the help of any markers or gloves 4 Q Chen et al Learning objects and the virtual environment Learning objects are entities that are generally suitable in the context of mathematics, engineering, technology, and health science for learning, education and training, IEEE Learning Technology Standards Committee (2002) Learning object metadata is comprised of some standardised elements for searching, managing and retrieving learning objects With the advancement in Internet and computing technologies, learning resources are now easy to share and reuse Learning object repositories can store these learning resources as well as their metadata records, Neven and Duval (1999) Recently, a lot of research has been focused on information visualisation that is defined as the use of computer-supported, interactive, visual representations of abstract data to enrich the users cognition experience, Card et al (1999) For the vast volume of learning objects available nowadays, information visualisation schemes can assist in building an interactive construct that establishes a relation between the user and the learning objects stored in the repository Research shows that visual metaphors such as graphs and charts are more effective and easier for people to understand abstract numerical information, Bauer and Johnson-Laird (1993) and Larkin and Simon (1987) These visual metaphors can motivate people, increase memory and focus the attention of the learner To exploit the advantages brought by visual metaphors, appropriate information visualisation tools need to be selected to represent the abstract data efficiently To facilitate the information transformation process, we have adopted a 3D visualisation scheme, which uses a 3D VE layout The layout provides an attractive and large display space as well as natural and cognitive aspects of visualising more information at a time, Cellary et al (2004) Furthermore, a visually organised representation of the information allows users to get insight into the data, interact with it directly, draw conclusions, and come up with new hypotheses Its target is not only to reinforce the traditional presentation concept but also to open up multiple avenues to foster a better understanding of the information presented based on preferences and contexts Meanwhile, the learning experience could also be enhanced by presenting a game-like user avatar model to entertain the learner We have presented a 3D gaming metaphor for visualising search results in a VE, and gaming is one of the most effective ways of teaching complex scenarios, while keeping users engaged in the searching and learning process A peer-to-peer network architecture is used to tie together all the components of our framework (see Figure 2) With this architecture, the learner’s experience can be facilitated by sharing, searching and browsing interesting learning objects The framework adopted an algorithm that can group the searched learning object metadata together and cluster these metadata along the highways in the 3D VE This framework offers several perspectives of the extracted information and enables learners to perceive more information from many dimensions at a time The strategy of ‘divide and conquer’ is used in the framework so that the overall system can be decomposed into its individual components, and some of the components may be optional to implement for certain individuals The goal of mapping the information into a 3D metaphor is to allow the user to perceive the information and find related resources in an intuitive and entertaining manner by navigating the avatar car through the virtual highways in the VE Navigating a 3D virtual environment of learning objects by hand gestures Figure The peer to peer searching environment: peers can register in the group’s address mapping peer and voluntarily serve as a search service peer The employed framework allows other institutions to use the provided services To promote the ‘share and reuse’ of multimedia learning materials, the peer-to-peer network can be logically categorised into three main types – user peer, address mapping peer, and search service peer Any peer requesting services is termed as a user peer A user peer sends the search keywords to the address mapping peer of a particular group Ĝ The address mapping peer applies some procedures and returns the path information The user peer then uses this information to send the search keywords to the search service peers of Ĝ The search service peer in the system allows registered users to search and retrieve learning object metadata The shared information from this peer can be used to access the content of the standard learning object repositories and to browse the 3D VE The distributed processes combine their computational processing power to respond to the search queries The discovery peers are responsible for producing reliable peer addresses that can serve search services The search service peers are grouped according to various requests that they can handle Whenever a peer wants to provide service to a peer group, it registers itself to the address mapping peer of that group by providing information on how its service will be mapped As depicted in Figure 3, to access a search service from a group Ĝ, the user peer first sends its search keywords to the address mapping peer Ĝ The discovery system then uses the heuristics to map keywords to the relevant search service peer’s address of Ĝ and includes the authentication information By altering the optional flags, the user peer can request address rediscovery so that the returned search service addresses will be validated before sending Peer address encoding and XML conversion are another two optional services that the peer can request The address mapping peers employ lexical keyword sense mapping to find relevant subject matter on the search keywords The process is inspired by current psycholinguistic theories of human lexical memory, Cognitive Science Laboratory, Princeton University (2006) English nouns, verbs, adjectives and adverbs are organised into synonym sets, which can represent underlying lexical concepts The hyponym relations are used to find the link in the synonym sets and the primary focus is on nouns and verbs 6 Figure Q Chen et al Different functional components of a discovery peer The 3D VE user interface takes the query and sends it to the search service discovery module and the keyword sense-mapping module Figure is the information visualisation architecture A search is initiated by using all the possible senses of the search keywords The search query is sent to the search service peers, which return the XML learning object metadata The obtained information is then grouped together using an algorithm that considers the keyword senses The information visualisation engine uses these groups and maps them into the 3D VE Figure The architecture of information visualisation The Java platform and the Java bindings of the OpenGL library are used for the development of the VE model shown in Figure Learning object metadata can be grouped along virtual roads, and each metadata is represented as a 3D traffic sign The text and icon of the traffic sign describes the content of the learning object metadata The user is represented by the avatar car, and the world layout gives the current position of the avatar car in the VE Navigating a 3D virtual environment of learning objects by hand gestures Figure The VE model: search results are grouped along the virtual highways according to the keywords and are associated with different traffic signs Virtual environment navigation by hand gestures As the user is represented by the avatar car in the VE, we implemented a vision-based hand gesture recognition system to navigate the avatar car by a set of hand gesture commands To use the human hand as an HCI device for VE applications, the hand gesture recognition system must meet the requirements in terms of real-time, accuracy, and robustness Vision-based hand gesture recognition techniques can be grouped into two categories: 3D hand model-based approaches and appearance-based approaches, Zhou and Huang (2003) 3D hand model-based approaches employ an estimation-by-synthesis strategy, and recover the hand parameters by aligning the appearance projected by the 3D hand model with the observed images, and minimising the discrepancy between them, Imai et al (2004) Generally speaking, 3D hand model-based approaches offer a rich description that potentially allows a wide class of hand gestures However, as 3D hand models are articulated deformable objects with many degrees of freedom, a very large image database is required to include all the characteristic shapes under different views Matching the query image with all images in the database is time-consuming and computationally expensive The appearance-based approach is based on direct registration of hand gestures with 2D image features such as skin colour, hand shape/contour or a combination of these features Compared with 3D hand model-based approaches, appearance-based approaches have a more simplified model to implement and therefore, the real-time performance is easier to achieve Originally, for the task of face tracking and detection, Viola and Jones (2001a, 2001b) employed a statistical approach to handle the large variety of instances of human faces In their algorithm, the concept of integral image is used to compute a rich set of image features Compared with other approaches, which must operate on multiple image scales, the integral image can achieve true scale invariance by eliminating the need to compute Q Chen et al a multi-scale image pyramid, and significantly reduces the initial image processing time Another technique used by this approach is the feature selection algorithm based on the AdaBoost learning algorithm Boosting is an aggressive and effective feature selection technique that can improve the accuracy of a given learning algorithm The AdaBoost learning algorithm is a variation of the regular boosting algorithm, which can adaptively select the best features in each step and combine them into a strong classifier The Viola and Jones algorithm has been primarily used for face detection, which is approximately 15 times faster than any previous approaches while achieving equivalent accuracy as the best published results For hand gestures, generally speaking, the reproducibility under practical situations is very poor due to the high degree of freedom of the human hand as well as the difficulty of duplicating the same working environment, such as background and lighting conditions In these situations, a statistical approach can be employed to attack the reproducibility problem Statistical model-based training algorithms take a set of ‘positive’ samples, which contain the object of interest (in our case: the human hand), and a set of ‘negative’ samples, i.e., images not contain objects of interest, Bradski et al (2005) During the training process, distinctive features are selected to classify the images containing the object When the trained classifier misses an object or detects a false object, adjustments can be made easily by adding corresponding positive or negative samples to the training set The simple Haar-like features (so called because they are computed similarly to the coefficients in the Haar wavelet transform) are used in the Viola and Jones algorithm There are two motivations for the employment of the Haar-like features rather than raw pixel values The first reason is that the Haar-like features can encode ad-hoc domain knowledge, which is difficult to describe using finite quantity of training data Compared with raw pixels, the Haar-like features can efficiently reduce/increase the in-class/out-of-class variability thus making classification easier, Lienhart and Maydt (2002) The Haar-like features describe the ratio between the dark and bright areas within a kernel One typical example is that the eye region in the human face is darker than the cheek region, and one Haar-like feature can efficiently catch that character The second motivation is that a Haar-like feature-based system can operate much faster than a pixel-based system with the concept of ‘integral image’ Besides the above advantages, the Haar-like features are also relatively robust to noise and lighting changes because they compute the grey level difference between the white and black rectangles The noise and lighting variations affect the pixel values on the whole feature area, and this influence can be counteracted Each Haar-like feature is described by a template, which includes or rectangles, its relative coordinates to the origin of the search window and the size of the feature Figure shows the extended Haar-like feature set proposed by Lienhart and Maydt (2002) The value of a Haar-like feature is the difference between the sum of the grey level values within the black and white rectangular regions The concept of ‘integral image’ is used to compute the Haar-like features containing upright rectangles, Viola and Jones (2001a, 2001b) The ‘integral image’ at the location of pixel(x, y) contains the sum of the pixel values above and left of this pixel inclusively (see Figure 7(a)): P ( x, y ) = ∑ x ′ ≤ x , y ′≤ y p( x′, y ′) Navigating a 3D virtual environment of learning objects by hand gestures Figure The extended set of Haar-like features Figure The concept of ‘Integral Image’ (a) (b) According to the definition of ‘integral image’, the sum of the grey level value within the area ‘D’ in Figure 7(b) can be computed as: P1 + P4 − P2 − P3 For the Haar-like features containing 45° rotated rectangles, the concept of “Rotated Summed Area Table (RSAT)” was introduced by Lienhart and Maydt (2002) RSAT is defined as the sum of the pixels of a rotated rectangle with the bottom most corner at pixel(x, y) and extending upwards to the boundaries of the image, which is illustrated in Figure 8(a): R ( x, y ) = ∑ p ( x′, y ′) y ′ ≤ y , y ′ ≤ y − | x − x ′| Figure The concept of “Rotated Summed Area Table (RSAT)” (a) (b) According to the definition of ‘RSAT’, the sum of the gray level value within area ‘D’ in Figure 8(b) can be computed as: 10 Q Chen et al R1 + R4 − R2 − R3 To detect an object of interest, the image is scanned by a sub-window containing a specific Haar-like feature (see the face detection example in Figure 9) Based on each Haar-like feature fj, a corespondent weak classifier hj(x) is defined by: 1 if p j f j ( x) < p jθ j h j ( x) =  0 otherwise where x is a sub-window, and θ is a threshold pj indicates the direction of the inequality sign Figure Detect a face with a sub-window containing a Haar-like feature In machine learning, it is a very difficult task to find a single accurate classification rule based on a training set However, it is not hard to find rules of thumb with classification accuracy just slightly better than random guessing We call these rules of thumb ‘weak classifiers’ Boosting is a general method to improve the accuracy of a given learning algorithm, stage by stage, based on a series of weak classifiers, Freund and Schapire (1997) A weak classifier is trained with a training set at each stage The trained weak classifier is then added to the learned function, with a strength parameter that is proportional to the accuracy of this weak classifier Then, each training sample is reweighted: training samples missed by the current weak classifier are ‘boosted’ in importance so that the future weak classifier will attempt to fix the errors made by the current weak classifier The AdaBoost learning algorithm introduced by Freund and Schapire (1999) solved many practical difficulties of the earlier boosting algorithms In the Viola and Jones algorithm, a variant of AdaBoost is employed to select the features and to train the classifiers Their AdaBoost learning algorithm initially maintains a uniform distribution of weights over each training sample (in our case, the hand gesture images) In the first iteration, the algorithm trains a weak classifier using one Haar-like feature that achieves the best recognition performance for the training samples In the second iteration, the training samples that were misclassified by the first weak classifier receive higher weights so that the newly selected Haar-like feature must focus more computation efforts towards these misclassified samples The iteration goes on and the final result is a cascade of linear combinations of the selected weak classifiers (i.e., a strong classifier, which achieves the required accuracy) (see Figure 10) Navigating a 3D virtual environment of learning objects by hand gestures 11 In practical implementation, the attentional cascade is employed to speed up the performance of the Viola and Jones algorithm In the first stage of the training process, the threshold of the weak classifier is adjusted low enough so that 100% of the target objects can be detected, Viola and Jones (2001a, 2001b) The trade-off of a low threshold is that a higher false positive detection rate will accompany the 100% true positive detection rate A positive result from the first classifier triggers the evaluation of a second classifier, which has also been adjusted to achieve very high detection rates A positive result from the second classifier triggers a third classifier, and so on To be detected by the trained cascade, the positive sub-windows must pass each stage of the cascade A negative outcome at any point leads to the immediate rejection of the sub-window (see Figure 11) Figure 10 The description of the AdaBoost learning algorithm 12 Q Chen et al Figure 11 Detection of positive sub-windows using the trained cascade The reason for this strategy is based on the fact that the majority of the sub-windows are negative within a single image, and it is a rare event for a positive instance to go through all of the stages With this strategy, the cascade can significantly speed up the processing time as the initial weak classifiers try to reject as many negative instances as possible and more computation will be focused on the more difficult sub-windows that passed the scrutiny of the initial stages of the cascade To control the avatar car, we defined three different hand gestures shown in Figure 12 The ‘palm’ gesture is to activate the car to move forward, the ‘two-fingers’ gesture turns the car to the left, and the ‘little finger’ gesture turns the car to the right The camera used for the video input in our experiment is a low cost Logitech QuickCam web-camera This web-camera provides video capture with a maximum resolution of 640 × 480 up to 15 frames-per-second For the experiment, we set the camera parameters at 320 × 240 with 15 frames-per-second Figure 12 Three different hand gestures used to move the avatar car: (a) two-fingers; (b) palm and (c) little finger (a) (b) (c) We collected the training samples from a user’s hand for the preliminary experiment The experiments and testing were implemented in the laboratory with a natural fluorescent lighting condition To variate the illumination condition, we installed an extra incandescent light bulb to create a tungsten lighting condition For the ‘two-fingers’, ‘palm’ and ‘little finger’ gestures, we collected 480, 412 and 420 positive samples respectively with different scales To increase the robustness of the classifier, we purposely included a number of positive samples with certain in-plane rotations and out-of-plane rotations Figure 13 shows some positive samples of the ‘two-fingers’ gesture To simplify the task at the initial stage of the experiment, we keep the white wall as the background for the testing The hand area in each positive sample is cropped out and scaled to the unified resolution of 15 × 30 Navigating a 3D virtual environment of learning objects by hand gestures 13 Figure 13 Part of the ‘two-fingers’ positive samples used in the training We collected 500 negative samples for the training process These samples are random images that not have the hand gesture representations Figure 14 shows some negative samples used in the training process All negative samples are passed through a background description file, which is a text file containing the filenames (relative to the directory of the description file) of all negative sample images Figure 14 Part of the negative samples used in the training process 14 Q Chen et al With all of the positive/negative samples ready, we set the required false alarm rate at 1×10–6 to terminate the training, which means the accuracy of the classifier will meet the requirements when out of million negative sub-windows are mistakenly detected as a positive sub-window A 15-stage cascade is obtained for the ‘two-fingers’ gesture with the AdaBoost learning algorithm When the final required false alarm rate of 1×10–6 is reached, the true-positive detection rate of the final classifier is 97.5% (468 out of 480) For the ‘palm’ gesture and the ‘little finger’ gesture, 10-stage and 14-stage cascades are obtained with 412 and 420 positive samples The true-positive detection rate is 98% and 97.1%, respectively In order to evaluate the performance of the obtained classifiers, 100 testing images for each gesture are collected with similar backgrounds but different illumination conditions Table shows the performance of the three trained classifiers Figure 15 shows some of the detection results for the ‘two-fingers’ gesture Table The detection results of the trained classifiers Gesture name Two-fingers Hits (%) Missed (%) False Time (s) 100 29 3.049 Palm 90 10 1.869 Little finger 93 2.452 Figure 15 Some of the detection results of the trained ‘two-fingers’ classifier Navigating a 3D virtual environment of learning objects by hand gestures 15 By analysing the detection results, we found that some of the missed positive pictures are caused by the excessive in-plane rotations For the false detections, the majority of them only happened in very small areas, which have a higher probability of containing similar colour patterns to be detected by the selected Haar-like features These small false detection boxes can be easily eliminated by defining a size threshold The maximum time required to detect 100 testing images of the ‘two-fingers’ gesture classifier is 3.049 s The time required for the other classifiers are all within s We tested the real-time performance with live input from the Logitech QuickCam web-camera with 15 frames-per-second and a resolution of 320 × 240, and there was no detectable pause and latency to track and detect the hand gestures with all our trained classifiers The trained classifiers showed a certain degree of robustness against the in-plane rotation as long as the rotation range was within the scope of ±15° The classifiers also showed a certain degree of tolerance for out-of-plane rotation and very good robustness against lighting variance With the quick detection speed, we implemented a parallel cascade structure to classify different gestures (see Figure 16) In this structure, multiple cascades are loaded into the system simultaneously, and each cascade is responsible for a single hand gesture Rectangles of different colours are used by the program to tell which gesture is detected Based on the experimental results, we found that the real-time performance of the classification is not impaired when we load all three trained cascade classifiers and the virtual environment at the same time, and there is no confusion detected among our gesture commands, as illustrated in Figure 17 Figure 16 The parallel cascades structure for hand gesture classification Figure 17 The recognition result with the parallel cascades structure 16 Q Chen et al The Java Native Interface (JNI) framework is used to integrate our C-based gesture recognition component with our Java-based virtual environment With this framework, the Java native methods can call applications and libraries written in C/C++ We tested the navigation of the VE with the proposed gesture commands (see Figure 18) Compared with controlling the avatar car with the arrow keys on the keyboard, it is more intuitive and interesting for users to use hand gestures to control the movement of the avatar car Figure 18 Navigating the learning object repository with proposed gesture commands Conclusions In this paper, a novel vision-based hand gesture interface for navigating the learning object repository mapped in a 3D VE is designed and implemented To take the advantages brought forth by visual metaphors in exploiting information to learners, a game motivated 3D highway metaphor has been presented to map the learning object repository This 3D virtual environment considers the semantic meaning of the query and applies context-based peer address mapping to search metadata The 3D VE displays these interactive constructs, visually presents relationships that exist among the sets of information, and provides an intuitive and entertaining way of searching, and browsing information With the vision-based hand gesture interface, the user is able to navigate the 3D VE and access the learning objects by controlling the movements of the avatar car using a set of hand gesture commands Three different hand gestures are tested by the cascade classifiers trained using the AdaBoosting algorithm and the Haar-like features under constrained background conditions The experimental results indicate that the trained cascade classifiers can achieve a detection accuracy no lower than 90% for our three hand gesture commands A parallel cascade structure is implemented for the task of gesture recognition There is no confusion detected and the real-time performance is satisfactory Compared with traditional HCI devices such as keyboards and mice, this vision-based hand gesture interface provides more natural movement and entertainment for the user Navigating a 3D virtual environment of learning objects by hand gestures 17 References Bauer, M and Johnson-Laird, P (1993) ‘How diagrams can improve reasoning’, Psychological Science, Vol 4, No 6, pp.372–378 Bradski, G., Kaehler, A and Pisarersky, V (2005) ‘Learning-based computer vision with intel’s open source computer vision library’, Intel Technology Journal, Vol 9, No 2, pp.119–130 Card, S.K., Mackinlay, J.D and Shneiderman, B (1999) Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers, New York Cellary, W., Wiza, W and Walczak, K (2004) ‘Visualizing web search results in 3D’, IEEE Computer, Vol 37, No 5, pp.87–89 Chen, Q., El-Sawah, A., Joslin, C and Georganas, N.D (2005) ‘A dynamic gesture interface for VE based on hidden markov models’, Proc IEEE International Workshop on Haptic, Audio and Visual Environments and their Applications (HAVE2005), pp.110–115 Cognitive Science Laboratory, Princeton University (2006) Wordnet, a Lexical Database for the English Language, http://wordnet.princeton.edu Freund, Y and Schapire, R.E (1997) ‘A decision-theoretic generalization of online learning and an application to boosting’, Journal of Computer and System Sciences, Vol 55, No 1, pp.119–139 Freund, Y and Schapire, R.E (1999) ‘A short introduction to boosting’, Journal of Japanese Society for Artificial Intelligence, Vol 14, No 5, pp.771–780 IEEE Learning Technology Standards Committee (2002) IEEE Learning Object Metadata, Final Draft Standard, IEEE 1484.12.1, http://ieeeltsc.org/wg12LOM/ Imai, A., Shimada, N and Shirai, Y (2004) ‘3-D hand posture recognition by training contour variation’, Proc 6th IEEE International Conference on Automatic Face and Gesture Recognition, Seoul, Korea, pp.895–900 Joslin, C., El-Sawah, A., Chen, Q and Georganas, N.D (2005) ‘Dynamic gesture recognition’, Proc IEEE Instrumentation and Measurement Technology Conference (IMTC2005), Ottawa, Canada Keskin, C., Erkan, A and Akarun, L (2003) ‘Real time hand tracking and gesture recognition for interactive interfaces using HMM’, Joint International Conference ICANN-ICONIP 2003, Istanbul, Turkey Kirishima, T., Sato, K and Chihara, K (2005) ‘Real-time gesture recognition by learning and selective control of visual interest points’, IEEE Trans on Pattern Analysis and Machine Intelligence, Vol 27, No 3, pp.351–364 Klerkx, J., Duval, E and Meire, M (2004) ‘Using information visualization for accessing learning object repositories’, Proc 8th Int’l Conf Information Visualization (IV’04), Vol 14, No 5, London, UK, pp.465–470 Larkin, J and Simon, H (1987) ‘Why a diagram is (sometimes) worth ten thousand words’, Cognitive Science, Vol 11, No 1, pp.65–99 Lienhart, R and Maydt, J (2002) ‘An extended set of haar-like features for rapid object detection’, Proc IEEE International Conference on Image Processing ICIP, Vol 1, New York, USA, pp.900–903 Metais, T and Georganas, N.D (2004) ‘A glove gesture interface’, Proc Bienneal Symposium on Communication (23rd), Kingston, Ontario, Canada Neven, F and Duval, E (2002) ‘Reusable learning objects: a survey of LOM-based repositories’, Proc 10th ACM Int’l Conf Multimedia, Juan les Pins, France, pp.291–294 Pavlovic, V., Sharma, R and Huang, T (1996) ‘Gestural interface to a visual computing environment for molecular biologists’, Proc Second International Conference on Atuomatic Face and Gesture Recognition, Killington, Vermont, USA, pp.30–35 Turk, M (2001) Gesture Recognition in Handbook of Virtual Environment Technology, Lawrence Erlbaum Associates, Inc., Mahwah, NJ, USA 18 Q Chen et al Viola, P and Jones, M (2001a) ‘Rapid object detection using a boosted cascade of simple features’, Proc IEEE Conference on Computer Vision and Pattern Recognition CVPR2001, Kauai, HI, USA, pp.511–518 Viola, P and Jones, M (2001b) ‘Robust real-time object detection’, Proc Cambridge Research Laboratory Technical Report Series CRL2001/01, pp.1–24 Wu, Y and Huang, T (2001) ‘Hand modeling analysis and recgonition for vision-based human computer interaction’, IEEE Signal Processing Magazine, Special Issue on Immersive Interactive Technology, Vol 18, No 3, pp.51–60 Yang, J., Xu, Y and Chen, C.S (1994) ‘Gesture interface: modeling and learning’, Proc IEEE International Conference on Robotics and Automation, Vol 2, San Diego, CA, USA, pp.1747–1752 Zhou, H and Huang, T (2003) ‘Tracking articulated hand motion with eigen dynamics analysis’, Proc of International Conference on Computer Vision, Beijing, China, pp.1102–1109 ... is a Fellow of IEEE, Fellow of the Canadian Academy of Engineering, Fellow of the Academy of Science (Royal Society of Canada) and Fellow of the Engineering Institute of Canada He is a Laureate... them, Imai et al (2004) Generally speaking, 3D hand model-based approaches offer a rich description that potentially allows a wide class of hand gestures However, as 3D hand models are articulated... display space as well as natural and cognitive aspects of visualising more information at a time, Cellary et al (2004) Furthermore, a visually organised representation of the information allows

Định dạng
Số trang	18
Dung lượng	1,78 MB