Alistair D N Edwards
Department of Computer Science, University of York,
Heslington, York, England, YO1 5DD,
alistair at cs.york.ac.uk
I. Wachsmuth and M. Frölich (eds.) Gesture and Sign Language in Human-Computer Interaction. Bielefeld, Germany: Springer. pp. 13-21..
Abstract. The automatic recognition of sign language is an attractive prospect; the technology exists to make it possible, while the potential applications are exciting and worthwhile. To date the research emphasis has been on the capture and classification of the gestures of sign language and progress in that work is reported. However, it is suggested that there are some greater, broader research questions to be addressed before full sign language recognition is achieved. The main areas to be addressed are sign language representation (grammars) and facial expression recognition.
Keywords. Sign language, gesture, facial expression recognition, feature-based grammars, richly grounding symbols.
Alistair D N Edwards
Department of Computer Science, University of York, Heslington, York, England, YO1 5DD, alistair at cs.york.ac.uk
Abstract. The automatic recognition of sign language is an attractive prospect; the technology exists to make it possible, while the potential applications are exciting and worthwhile. To date the research emphasis has been on the capture and classification of the gestures of sign language and progress in that work is reported. However, it is suggested that there are some greater, broader research questions to be addressed before full sign language recognition is achieved. The main areas to be addressed are sign language representation (grammars) and facial expression recognition.
The subtitle of the 1997 Gestures Workshop was Gesture- and sign-language based communication in human-computer interaction. The combination of gesture and sign was deliberate; it was agreed in discussions at the first Gesture Workshop, GW '96, [1] that the two forms of interaction have sufficient commonality to belong in the same workshop but that if the link were not made explicitly people working in the fields might not appreciate the connection. In particular, sign-language users generally do not consider what they do to be gesturing. For this and other reasons, any discussion of these forms of interaction inevitably attracts people from a variety of disciplines - who may have gaps in their background knowledge. Such heterogeneity is perhaps a symptom of an area of research which has not yet developed into a `field' (Wexelblat, this volume).
The purpose of this paper is to review sign language recognition. For those working directly in that field some of the review not be new, but I will also state some personal views about the likely progress of the field.
Sign language apart, gestural interaction falls into three categories:
These categories are not always hard and fast. For instance Böhme et al (this volume) have devised synthetic gestures to control a robot, but they have tried to design them such that they are natural, everyday gestures, which can be interpreted in the context of the task. In other words, the gestures have a readily inferable meaning (see below).
The essential difference of sign language compared to all the above forms of interaction is that it conveys rich and precise meaning - as rich and precise as spoken language. That is to say that a signer can in effect convey any meaning capable of expression in a natural language. A limitation is that the number of people who learn the skill of doing this is quite small. To a large extent it is confined to those people who have a need to use sign language because they cannot (easily) communicate in a spoken form; that is to say mostly people who are deaf. Another implication of the communication power of sign language is that it must have properties of syntax and semantics. This implies that it will lend itself to study and formalization. However, the truth is that in the current state of knowledge, little is known and understood about the form of the syntax and semantics of sign languages. This is a point I will return to later in this paper.
Fels & Hinton [3] tackled the problem of sign language recognition by inventing a new language, GloveTalk. In other words, their language was composed of synthetic gestures. Obviously its drawback was that it was not a language that anyone used. (Evidently their motivation was not to create a new language that people would use, but simply to investigate the feasibility of sign language recognition).
Later, in Section 6, I will discuss the role of richly grounding meaning in sign language [4], but it is appropriate at this point to mention one of the related concepts, that of readily inferable meaning (RIM). Symbols (including gestures) have more- or less-inferable meanings. That is to say that a symbol with a high degree of RIM needs little or no explanation. Somebody seeing the symbol for the first time can infer all or most of its intended meaning without explanation from another party. This contrasts with arbitrary conventional symbols (ACSs) which have no natural mapping to their meaning. A written word is a good example of an ACS; there is no natural mapping (for instance) between the symbol apple and the fruit to which it refers. The alternative symbol grape would be just as valid a symbol if it were not contrary to the convention.[footnote 1]
The assertion is that another difference between sign language and natural gesture is that the latter contains a high degree of RIM. It has been found that coverbal gesture styles are highly idiosyncratic of the gesturer and yet they are nevertheless useful and meaningful to listeners. This is not true of sign languages which are rather more arbitrary - as spoken languages are. This may come as a surprise to some non-sign-language users who often assume that sign language is largely mimetic - though it is quite evident if one observes a signer and try to guess what they are saying.
There are two principal motivations for the study of automatic sign language recognition. First there is the practical consideration that there are a number of uses of automatic sign language recognition that would be of great human value. A sign-to-speech system (see Section 4) would enable a sign language speaker to talk to non-signers. Another valuable application is the creation of sign language documents. (See [5] and Groble & Hienz, this volume). Since spoken language is often a second language of deaf people, they find reading conventional text difficult. Documents composed of sign language pictures (their first language) can be more readable and a sign recognizer would be the natural input device to create such documents (as the keyboard is for textual languages).
The second reason to study sign language recognition is that it gives a structure to gestural input. There is a clear goal to the research - that of interpreting the exact meaning of the utterances. Yet that goal is a challenging one; it is not possible to `cheat'. For instance a researcher cannot choose to ignore a class of gestures because they are hard to recognize if that class of gestures is included in the sign language.
Most of the effort so far in sign language recognition research has concentrated on the manual gestures. As described in [6] sign language gestures can be divided into four categories, as in Table 1. The most difficult class to recognize is DPDL - but to recognize any real sign language, this class must be included.
Class |
Description |
SPSL |
Static posture, static hand location |
DPSL |
Dynamic posture, static hand location |
SPDL |
Static posture, dynamic hand location |
DPDL |
Dynamic posture, dynamic hand location |
Capture of gestures and postures can be done using datawear, position sensors, video analysis, or combinations of all of these technologies. Datawear (such as the CyberGlove or the TCAS Datasuit) has the advantage that the data obtained is very exact. The disadvantage is that one has to wear clothes that are cumbersome and restrictive. In the long term video based approaches - which do not require such clothing - are likely to be more practical. Their problem is that the processing required to get from the visual image to the data required is much more difficult. (A number of papers in this volume and in [1], cover techniques for video analysis and for the use of datawear).
Signs are in some senses discrete. In order to categorize one sign it is necessary to separate it from those which precede and follow it. A variety of techniques have been used, some more practical than others. For instance, the use of an additional input device - such as a button - to mark the transitions between signs will give a clear signal, but will interfere with fluent signing. Other systems require the signer to pause between signs (rather as most speech input systems require exaggerated pauses between words). The system of Matsuo et al. (this volume) is one example. It might be argued that talking to a speech recognition system with long inter-word pauses is unnatural but not impractical in many applications. However I would suggest that such pauses are not acceptable in sign language recognition - for reasons that are elaborated in Section 6.
The segmentation techniques developed at the University of York seem to offer the possibility of good and natural segmentation. They are based on a hand-tension model and fingertip acceleration. The hand tension approach is based on the observation that when making an intentional posture, the hand will be in a state of high tension. Therefore, while the hand is in transition between two such postures it moves from one high-tension state to another and it must pass through a minimum of tension. The hand model can characterize that point and segment there. This works with DPSL and DPDL gestures (Table 1). In SPDL gestures the hand posture does not change, so its tension is constant and so gives no cue as to the sign boundary. For such gestures, fingertip acceleration can be used instead. As the hand moves (in a fixed posture) from one position to another it must accelerate and decelerate. The point of maximum velocity can be used as the segment boundary. (A similar technique is used by Hofmann & Hommel, this volume). More details of these segmentation techniques can be found in [6].
The final requirement for manual gesture recognition is classification. A wide variety of techniques have been applied to this. The most common ones are Hidden Markov Models (HMMs) and artificial neural networks. There are many examples of both of these techniques being used in both this volume and [1].
Evidently much progress is being made in the recognition of the manual gestures of sign language and hence one might assume that the prospects for achieving automatic sign language recognition are good. However, I have to assert that this is not the case - because recognizing the manual component of sign language is insufficient. There are other components which most researchers are either unaware of or have chosen to ignore. The two major missing components are facial recognition and sign language grammars. What has been achieved so far based only on gesture recognition is summarized briefly in the next section, but that is followed by a fuller discussion of the broader requirements of facial recognition and grammar representation
Concentrating on the manual aspects of sign languages, a number of projects have achieved some degree of success. Most notable are two Japanese groups, Matsuo et al. (this volume) and the Hitachi Laboratory.
There are a number of related projects under way at Hitachi. There is an interest in translation both from sign to speech and vice-versa. These might be combined in a sign language telephone which would enable a deaf signer and a hearing non-signer to communicate [7]. They are exploring facial expression recognition [8, 9]) though to date their only attempt to integrate this with sign gesture recognition appears to be confined to using video tracking of the head to measure the relative positions of the face and hands (Ohki, op. cit.).
Details of the techniques used can be found in the above references, as well as [10-12]. For one of their systems, which uses dynamic programming matching, they state a word detection rate of 97.3% when using word patterns of 17 and 25 sentences. One of their other attempts [11, 12] gives similarly high recognition rates for the recognition of samples from a set of 60 sign language morphemes. Such high recognition rates encourage further efforts along these directions as a step towards automatic sign language translation. Nevertheless we must bear in mind Wexelblat's observation that speech recognition is only becoming acceptable with recognition rates very near to 100%; that last few percentage points can be very important.
Most researchers seem either unaware or choose to ignore the importance of non-manual components of sign. This is not the case for sign language linguists. For instance, the very first paragraph of [13] includes `...there are many cases where facial expression, head position and movement, and body position and movement are significant in forming signals which carry linguistic information necessary for an understanding of the structure of ASL utterances. The importance of this nonmanual [sic] activity can be appreciated when it is seen that neither word order, subordination, nor relativization can be discussed in depth without reference to nonmanual signals.' (op. cit., p. 1, original emphasis). Of course, it is a conventional approach in development projects to defer tackling difficult problems until such time as the simpler ones have been solved. Thus, one might tackle the basic problems of manual gesture recognition first. However, it is my fear that if the (more difficult) problems including facial recognition are not faced up to soon, then automatic sign language recognition is in danger of becoming a `footnote in interface history' - as is the danger with other forms of gesture recognition, according to Wexelblat (this volume).
All of the meaning of a sign language utterance is not contained within the manual gestures and a major additional contribution is contained in the facial expression. For example, [14], pp. 19-20), shows that the difference in American Sign Language between the statement `The woman forgot the purse' and the question `Did the woman forget the purse?' is contained in non-manual signals. The facial expression carries part of that signal (raised eyebrows and chin pressed forward), but so does the bodily posture - the head and shoulders leaning forward.
It would be unfair to suggest that all sign language recognition researchers are evidently unaware of the importance of the face and head. Sweeney & Downton [15] included head-tracking in their BSL recognizer as a step towards including a facial component. A number of other research groups are looking at ways of capturing and classifying facial expression. Notably Sako & Smith [9] are attempting to do this in the context of sign language recognition. Most of these projects follow a similar approach. The facial expression is captured using a video camera. Significant features (such as the mouth, nose and eyes) are picked out and their patterns are matched against stored prototypes. Using this kind of approach faces can be classified as `normal', `happy', `angry', `surprised' and such-like. One point working in favour of facial recognition is that the expressions used by signers tend to be exaggerated caricatures.
The expression is not the only component of meaning that may be carried in the face. British Sign Language (BSL) uses the signer's eye gaze too as a way of representing pronouns [16]. Just as with gesture capture, in measuring gaze direction there is a choice between using video, which is unobtrusive, and instrumentation - which gives more accurate data but is more restrictive to the user. Eye tracking devices exist, but they are prone to problems such as forcing the user to keep his or her head quite still. The work of Varchmin et al. (this volume) is encouraging in that they measure gaze direction by video analysis.
These efforts are encouraging, but will need to be refined much further to capture the rather more subtle facial changes which convey information in sign. Ohki [7] is developing a sign language telephone which uses data gloves to capture the manual signs but combines these with video tracking of the face. As yet, though, only the proximity of the head and hands is captured in this way.
Signs in a sign language are symbols. It is tempting to believe that if one can recognize and categorize those symbols then one can understand the language. However, it is suggested by Macken and colleagues [4] that the symbols have the property of being richly grounding and that consequently sign language understanding is somewhat more difficult.
Briefly, the property of richly grounding means that the form of a symbol carries extra information over and above that implicit in the symbol. An example is that of a road sign. The signs in Figures 1 and 2 both indicate a curve ahead. However, the latter form has the property of richly grounding, in that additional information can be encoded without changing the essential form of the symbol. In particular, it indicates the direction of the curve. Furthermore it would be possible to indicate also a suggestion of the severity of the curve.
Figure 1. A road sign which is purely symbolic. Its meaning is arbitrary but conventional; without access to the convention (in this case, the English language) it might have any meaning.
Figure 2. An alternative road sign. This symbol has a readily inferred meaning but is also conveys additional information.
Macken and colleagues suggest that the property of richly grounding is essential to sign language. That is to say that the form of a sign carries essential meaning over and above its simple lexical identity. They suggest that the following are examples of significant modifications that occur in American Sign Language:
The implications of the richly grounding nature of sign are illustrated by the following example from Macken, Perry et al. (op. cit.) A signer gave the following narrative,
`Something awful happened yesterday. My car was stopped at a red light. Suddenly from out of nowhere, another car came from behind it on the right and crashed into its rear right bumper.'
However, they describe the signed narrative as follows. Notice that classifiers used are the words highlighted in italics.
The signer first identified the nature of the event, the subject matter, and the time: happen awful yesterday me car. Then she used a vehicle classifier with her left hand (this signer's subdominant hand), moving her left forearm out from her body and stopping it with the slight backward motion of a car coming to a stop. With her right hand above and in front of her left, she signed red light. Then she formed a vehicle classifier with her right hand, and moved her right forearm briskly from a position in back of the standard signing space so that her right hand collided with the back of her left palm (corresponding to the right rear of the vehicle), causing her hand to bounce a short distance and showing on her face the shock of the situation.
In other words, if one had captured only the lexical signs used, the above would have been rendered as
`happen awful yesterday me car - vehicle - red light - vehicle'
So, even an ideal gesture capture system would produce only this near-nonsense stream. In order to capture the full meaning, as above, the system would have to measure the form of the signs and to have a mapping from those forms to meanings. That further implies that there is a need to be able to represent the full richness of the rules of sign language in a form that is suitable for computer interpretation.
The richly grounded nature of sign language explains why pausing between signs to facilitate segmentation is not practical. The very flow between signs, the coarticulatory effects contains meaning. To enforce neutral pauses between signs would make it difficult or even impossible to sign meaningful utterances.
Feature-based grammars [17] are used by linguists to describe spoken natural languages, and it is suggested that they might be applied to sign languages too. One approach might be to have a sign recognizer capturing, segmenting and classifying signs, but in parallel be capturing features of the way the signs were made. For instance, in the above example, the sign classifier would capture two instances of the sign for vehicle while the feature detector would detect the motion of one of them - culminating in a collision between the two. Of course, there would still be a big gap to bridge between that and a textual transcription, such as that above!
To date much of the research and development of automatic sign language recognition has been driven by the technology; the availability of hardware such as datawear and the development of classifications techniques has led us to a position where automatic recognition seemed feasible. Good progress is being made on the basic sign capture and classification, but it is appropriate at this point to think more carefully about the nature of sign language and what is required to truly recognize sign language. Two main additional components have been identified in this paper, facial expression recognition and sign language grammar, though there are others which will eventually have be considered, such as body posture.
We should not be over optimistic, though. Real-time 100% recognition of full-vocabulary sign language is at least as challenging as a similar level of recognition of natural speech. Yet even to be able to recognize signs in a more restricted way - based on a limited vocabulary, for instance - could be most valuable, and more attainable in the foreseeable future.
1. Harling, P. A. and Edwards, A. D. N., (ed.) Progress in Gestural Interaction: Proceedings of Gesture Workshop '96. 1996, Springer: London. pp. 250.
2. MacDonald, L. and Vince, J., (ed.) Interacting With Virtual Environments. Wiley Professional Computing, John Wiley & Sons: New York. 291.
3. Fels, S. and Hinton, G., Glove-Talk: A neural network interface between a data-glove and a speech synthesizer. IEEE Transaction on Neural Networks, 1993. 4: p. 2-8.
4. Macken, E., Perry, J. and Haas, C., Richly grounding symbols in ASL. Sign Language Studies, 1993. (December).
5. Cracknell, J., et al., The development of a glove-based input system as part of the SignPS Project, in Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, P.A. Harling and A.D.N. Edwards, Editor. 1996, Springer: London. p. 207-216.
6. Harling, P. A. and Edwards, A. D. N., Hand tension as a gesture segmentation cue, in Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, P.A. Harling and A.D.N. Edwards, Editor. 1996, Springer: London. p. 75-88.
7. Ohki, M., The sign language telephone. Telecommunication Forum, Telecom 95, 1995. : p. 391-395.
8. Sako, H., et al. (1994)Real-time facial-feature tracking based on matching techniques and its application. in Proceedings of the 12th IAPR International Conference on Pattern Recognition. Jerusalem: IEEE.
9. Sako, H. and Smith, A. (1996)Real-time facial expression recognition based on features' position and dimension. in Proceedings of the International Conference on Pattern Recognition, ICPR'96.
10. Sagawa, H., et al., Pattern recognition and synthesis for a sign language translation system. Journal of Visual Languages and Computing, 1996. 7: p. 109-127.
11. Sagawa, H., Takeuchi, M. and Ohki, M. (1997)Description and recognition methods for sign language based on gesture components. in Proceedings of IUI 97. Orlando, Florida: ACM.
12. Sagawa, H., Takeuchi, M. and Ohki, M. (1997)Sign language recognition based on components of gestures - integeration of symbols and patters. in RWC '97.
13. Liddell, S., K, American Sign Language Syntax. Approaches to Semiotics, 1980, The Hague: Mouton. pp. 194.
14. Stokoe, W. C., Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf. 1960, University of Buffalo:
15. Sweeney, G. J. and Downton, A. C., Towards appearance-based multi-channel gesture recognition, in Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, P.A. Harling and A.D.N. Edwards, Editor. 1996, Springer: London. p. 7-16.
16. Kyle, J. G. and Woll, B., Sign Language: The Study of Deaf People and their Language. 1988, Cambridge: Cambridge University Press. pp. 318.
17. Gazdar, G. and Mellish, C., Natural Language Processing in Prolog: An Introduction to Computational Linguistics. 1989, Wokingham, England: Addison-Wesley. pp. 504.
[1] The mappings of words to meanings is not always arbitrary. Onomatopoeia is an example where the mapping is not so arbitrary.
[2] Virtual Technologies, Inc., 2175 Park Boulevard, Palo Alto, California, USA 94306, http://www.virtex.com/.
[3] TCAS, 130 City Road, Cardiff, Wales, CF2 3DR
[4] Eyetracker manufactuers include: Applied Science Laboratories, 175 Middlesex Turnpike, Bedford Massachusetts, USA 01730, http://world.std.com/~asl/; LC Technologies, 9455 Silver King Court, Fairfax, Virginia, USA 22031, http://lctinc.com/; SensoMotoric Instruments GmbH, Potsdamerstrasse 18a, 14513 Teltow Germany, http://www.smi.de.