One of the main concerns in developed countries is population ageing. Elder people are susceptible of suffering conditions which reduce quality of life such as apraxia of speech, a burden that requires prolongued therapy. Our proposal is intended to be a first step towards automated solutions that assist speech therapy through detecting mouth poses. This work proposes a system for vowel poses recognition from an RGB-D camera that provides 2D and 3D information. 2D data is fed into a face recognition approach able to accurately locate and characterize the mouth in the image space. The approach also uses 3D real world measures obtained after pairing the 2D detection with the 3D information. Both information sources are processed by a set of classifiers to ascertain the best option for vowel recognition.