Dental Imaging Project – Caries Classification
As outlined in the previous post, we are able to remove most of the teeth and gum regions leaving us with segments which are potentially caries. Now we are developing a classifier, which given the segment gives a Yes/No answer.
Since caries are distinctively different from other regions of the mouth in terms of its color, we started exploring color based features that can be useful for the classification.
One such feature is average color. We manually labeled segments given by our algorithm. We tried several machine learning algorithms such as Logistic Regression, Support Vector Machines to develop the classifier using this data, but none of them worked well. We started to suspect that the average color did not contain enough information to build a good classifier. Since the data has 3 dimensions (R_avg, B_avg, G_avg), it is easy to visualize. Here is a 3-D scatter plot of the data points we have. The non-caries points are blue and the caries points are red.
Exploring the structure of this scatter-plot reveals that there is no boundary (not even non-linear) separating the two classes. Thus, we can conclude that average color is not a good feature for detection.
We then went for higher order features, like the color histogram, where we bin the R, G, and B dimensions of the segments, with 16 bins each. So we have a total of 16*16*16 = 4096 dimensional feature vector. The problem with this approach is that we have too many degrees of freedom and too less data, leading to overfitting i.e. the classifiers trained on such features give ~100% accuracy on training set, but fail to generalize well. It is not easy to visualize such high dimensional data, so it is not clear if histogram features are sufficient or not. We also tried dimensionality reduction using Principal Component Analysis, but the problem persists.
Here is a possible explanation. The following figure is standard in machine learning practice. (image from the internet)
We are on the far left, with very small training set; we have very low training error, but the test error is high. We need to get to the far right, where the test set errors are small.
After facing a sequence of failures, we discussed this problem with Dr. Rajiv Gupta. Considering his valuable inputs, we think that collecting more data is essential to building a good classifier. Moreover only expert dentists can give reliable labels. We are thus trying to develop a GUI that doctors or dental students can easily use to label image regions.
We envision a system where the images collected by the intraoral camera are processed and shown on a GUI. If some abnormal segments are detected, the images are sent to the doctor, where he/she labels the images. If the doctor thinks that caries or other problems exist, the patient is referred to the doctor who labeled it. This way, patients benefit due to early detection and treatment, doctors benefit because they get more patients, and we benefit because we get data labeled by experts.
We believe that such a GUI can be used to collect data at a massive scale and the dataset would greatly benefit efforts towards an intelligent oral health monitoring device. Next up, we will present the GUI prototype (with basic functionality) that we built and a lot of other exciting work!
Here is our team having a discussion with Dr. Rajiv Gupta.
The Dental Imaging Team: Hyunsung, Avi, Ujjwal, Raghu, Meghasyam, Ashish, Ashray, Pratima, Samriddh