Raniero's Research Blog: November 2014

Wednesday, November 19, 2014

Using Entropy to Distinguish Shape Versus Text in Hand-Drawn Diagrams

Bibliographical Information:
Bhat, Akshay, and Tracy Hammond. "Using Entropy to Distinguish Shape Versus Text in Hand-Drawn Diagrams." In IJCAI, vol. 9, pp. 1395-1400. 2009.

URL:
http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/download/592/906

This paper written by the Sketch Recognition Lab identifies text and free-hand diagrams using only one feature: zero-order entropy rate. The paper found that relying solely on this entropy rate yields correct classification between text drawings and diagram sketches 92.06% of the time. This feature also performed favorably in domains for which no training examples were provided.

The main motivation behind this paper is the fact that many sources of sketches introduce both hand-written text as well as shape-based diagrams. These examples include engineering diagrams where the author writes annotations and labels components with text, military course of action diagrams, and UML sketches have both text and shapes as considerable components in their respective sketches. Separating what is a sketch and what is hand-drawn text is necessary for accurate sketch recognition in both domains. Despite the fact that recognition of both types of sketches vary greatly in procedure and final classification, these two types are found frequently in traditional sketching, providing a strong motivation to be able to separate them successfully.

This paper uses zero-order "entropy rate" to classify sketches between shapes and text. Entropy measure is essentially a measure of "the degree of randomness from an information source". With regards to sketch drawing, this randomness appears with far higher frequency with text sketches rather than sketch shapes. The paper develops an entropy model "alphabet" to characterize how these shapes tend to be drawn. This essentially means that the range of angles of any stroke are divided into discrete thresholds and assigned a symbol as a classification (in this case, the range from 0 to pi/6, for instance, is given the symbol "A". Different ranges correspond uniquely to B, C, D, E, F, and a range corresponding to an "end-point" is given the symbol X.

With this classification, the strokes are grouped and an "alphabet" of each of their angles is shown. The representation of text strokes yield more varied letters than shape strokes, and that is what is used as the "entropy" measure in this instance. The strokes were labeled as "shape", "text", or "unclassified", the last of which is the classification where the entropy measurement is above the threshold to classify as a "shape", but below the threshold of "text". The results are promising, with 92.06% correct classification when the system is required to produce a classification. Military course-of-action diagrams and other types of sketches that haven't been assigned a "domain" and therefore not studied performed favorably as well.

I think that this paper is both well motivated and implemented, with the idea of "entropy" a particularly clever one that ends up yielding very favorable results. I think that the paper's explanation in how the classification worked was easy to understand and well elaborated, with a surprising amount of intuition applied to the general back-end of how this classification technique works. Using higher-order entropy models to determine accuracy that way would be an interesting continuation of this study.

Sunday, November 16, 2014

SketchREAD: A Multi-Domain Sketch Recognition Engine

Bibliographical Information:
Christine Alvarado and Randall Davis. 2004. SketchREAD: a multi-domain sketch recognition engine. In Proceedings of the 17th annual ACM symposium on User interface software and technology (UIST '04). ACM, New York, NY, USA, 23-32. DOI=10.1145/1029632.1029637 http://doi.acm.org.lib-ezproxy.tamu.edu:2048/10.1145/1029632.1029637

URL:
http://dl.acm.org.lib-ezproxy.tamu.edu:2048/citation.cfm?id=1029637

SketchREAD is a system developed by Christine Alvarado and Randall Davis, which uses many different principles in the field of sketch recognition to present an algorithm that is as least intrusive to the user sketching experience as possible. The main motivation behind this paper is the fact that most systems regarded as "accurate" are either overly restrictive in the sketches the consider (e.g., one can only draw the primitive shape using one stroke), require the user to input a significant amount of additional data (e.g., training data or labeling shapes), or severely sacrifice response time or accuracy. SketchREAD was designed as a way for the recognizer to work "in the background", and to continuously update its own recognition with 1) no prior training data and 2) no feedback provided to the user, as the latter is considered to be distracting to the sketching experience.

This algorithm makes extensive use of Bayesian Networks, which are networks of predictions based on uncertainty that use background data to calculate said uncertainty. They consist of two parts, a "directed acyclic graph that encodes" which world factors influence each other, and a set of probabilities that specify how these factors affect each other. This is used especially in one of the two test cases for the SketchREAD system: the formulation of a family tree. Because the relationship between family trees and, therefore, the identification of object attributes, depends on how the objects are translated, the system continues to update its interpretation as more sketch components are added. Bottom-up and top-down approaches are applied to help the run time of the recognition algorithm. This is distinctly similar to how the ShortStraw algorithm is implemented. Additionally, pruning is performed on probability calculations, which we have also seen in HMMs.

Overall, I think this paper is an amalgam of a large number of different techniques employed in several other recognition algorithms. The system is shown to significantly improve the user experience in both the family writing domain and the circuit design domain as well. Finally, it employs heavy use of the hierarchical system of sketch recognition, which has also been seen in other systems such as the one currently in place for SRL's Mechanix. I think this system can easily be applied to several other kinds of domains, and the fact that it draws from many other sketch recognition algorithms to create its own implementation is an indication of a solid application of the best aspects of these algorithms.

Saturday, November 15, 2014

ShortStraw: A Simple and Effective Corner Finder for Polylines

Bibliographical Information:
A. Wolin, B. Eoff, and T. Hammond. 2008. ShortStraw: a simple and effective corner finder for polylines. In Proceedings of the Fifth Eurographics conference on Sketch-Based Interfaces and Modeling (SBM'08), Christine Alvarado and Marie-Paule Cani (Eds.). Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 33-40. DOI=10.2312/SBM/SBM08/033-040 http://dx.doi.org/10.2312/SBM/SBM08/033-040

URL:
http://dl.acm.org.lib-ezproxy.tamu.edu:2048/citation.cfm?id=2386308

ShortStraw is an algorithm for corner finding that relies entirely on one "feature". This technique was originally intended for use in conjunction of other features that helped identify corners in a closed shape, but this particular feature was so accurate that the corners would be identified by using only ShortStraw.

The technique for using ShortStraw is relatively simple. Similar to $1, the algorithm begins by resampling the shape, except unlike $1, the shape is always resampled to N=40 points. From then, "straws" are created, which is essentially small Eucledian Distances between two points. One point is chosen as the "center", while an additional three points are chosen as the "window". The "positive window" are the three points in front of the "center" point, and the "negative window" are the three prior ones. For short, the frontmost point is p[i+w], where i is the current center point and w is the "window", in this case w=3. The backmost point is p[i-w]. Straws are calculated as follows: straw 1 is the eucledian distance between p[i+w] and p[i-w], straw 2 is distance between p[i+w-1] and p[i-w+1], and straw 3 is p[i+w-2] and p[i-w+2].

ShortStraw is divided into two approaches: top-down and bottom-up. Bottom-up is the process by which the straws are compared to first identify a "first pass" of detected corners, while the top-down approach takes these corners and runs another pass to eliminate false positives and identify missed corners.

The bottom-up process iterates through each point and creates the three straws as mentioned above for each point. If the largest straw is nearly identical (with a threshold of 95%) in size with that of the distance along all three points, we know the straw is "laying flat" on top of the shape thus it cannot be a corner. If the eucledian distnace is describing a corner, however, the straw will be much smaller. When these straws are smallest is when these points are considered to be corners.

The top-down approach then takes these corners and runs another pass, ensuring that every single point yields a "flat" line. If this is not the case, corners and removed/added depending on whether this fault was a false positive or false negative. This top-down approach is vital in removing erroneous results from the initial pass of the bottom-up approach.

One of the points raised by this paper is the fact that its definition of "accuracy" is strictly "all-or-nothing". In previous corner finding algorithms by other authors, very high accuracy rates were reported, but this paper feels that these are somewhat inflated due to the fact that there's no "penalty" for false positives. By this logic we could return every single point as a corner and still have a reported 100% accuracy. While this may make the corner finding algorithm sound accurate on paper, in practice reporting too many corners is identically detrimental to the user experience to reporting too few, or the wrong ones, or none at all. In any case, the user has to re-draw or correct their input diagram. The more false positives reported, the worse the user experience, even if the reported accuracy is 100%. Because of this, this algorithm is scrutinized under an all-or-nothing. If the corners found do not match the actual corners exactly (no more, no less), this is counted as a "miss". Under this much more practical definition of accuracy, ShortStraw's accuracy is 0.741 out of 1, while Sezgin's is 0.278 and Kim&Kim's is 0.297. As can be easily seen, ShortStraw is far more accurate at finding corners with an all-or-nothing accuracy model than previously established algorithms.

I believe this paper to be a very interesting one, both in concept and in execution. The concept of "straws" and the identification of local maxima and minima of straw length may at first sound somewhat unorthodox, but in practice it's shown to be highly important. Most intriguing of all is the fact that the algorithm uses only the length of straws to determine corners, whereas other corner-finding algorithms rely heavily on multiple features to make their decisions. Finally, the discussion on the importance of all-or-nothing is one that may have been inconvenient for several established algorithms, but since these techniques are ultimately intended for use by a layman, a technical definition of accuracy that omits false positives does not help the practical usability of these algorithms at all. In some ways this adoption of all-or-nothing changes the discussion of what it means for a recognition algorithm to be "accurate".

Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes

Bibliographical information:
Jacob O. Wobbrock, Andrew D. Wilson, and Yang Li. 2007. Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. In Proceedings of the 20th annual ACM symposium on User interface software and technology (UIST '07). ACM, New York, NY, USA, 159-168. DOI=10.1145/1294211.1294238 http://doi.acm.org.lib-ezproxy.tamu.edu:2048/10.1145/1294211.1294238

URL:
http://dl.acm.org.lib-ezproxy.tamu.edu:2048/citation.cfm?id=1294238

The $1 recognizer was designed to be small, fast, and easily implementable with under 100 lines of code. With only 1 loaded template, $1 had 97% accuracy, and with 3+ loaded templates achieved as high as 99%. Although this paper is one of the most referenced in the field of sketch recognition, at the time of publication none of the algorithms involved were particularly novel. However, due to the fact that nobody had ever actually published this information in a paper, this paper was in fact the first to fully document this well-known recognition technique. It can be credited with standardizing this procedure as well as providing a base from which other researchers developed recognizers such as $2 and $n.

The algorithm works as follows. The sketches are divided into two sets of points: the "candidate points" C given by the user sketch, and "template points" T that matches the candidate points most. The candidate points are sampled by a combination of hardware and software. Because a simple template comparison would rarely provide a match, there are several more steps that must be taken to successfully make such a comparison:

Resampling the point path. The paper describes that the number of samples for each shape to resampled to varies in optimal use depending on the shape, but as long as N number of samples is between 32<N<256, the algorithm can recognize the best. The optimal number of sample points was found to be N=64
Rotate Once based on the Indicative Angle. The recognizer searches "over the space of possible angles" to identify the optimal angle through which both the candidate and template shapes will be compared against.
Scale and Translate. The candidate shape is rotated about its centroid based on the "reference square" that gets determined. The shape is then translated to a reference point to match the two shapes as closely as possible.
Find the Optimal Angle for the Best Score. The "score" is determined based on the distance between the template and corresponding candidate points, and the size of the points between corresponding template and candidate points as well.

The paper identifies several limitations of this recognizer. First, because the recognizer is meant to be flexible with regards to rotation and scale, any shapes whose identifying features include a specific rotation (such as the difference between a rhombus and a square only being the rotation of the shape) will not be considered a separate shape. Non-uniform scaling can also abuse horizontal or vertical line, so shapes purely consisting of one of these lines can have trouble being recognized. Third, because the system does not have any time-based recognition, any shapes that rely on drawing time or speed will not be recognized.

The $1 recognizer, as the paper initially claims, is used today mostly to get a recognizer easily integrated into a system. This recognizer can be used easily with simple shapes such as the letters of the alphabet. However, due to the significant limitations of this particular algorithm, any significant research in sketch recognition and most modern formats do not use something as simple as $1 by itself. Most modern systems use a combination of these along with several other kinds of algorithms such as feature-based ones, to generate the most accurate and useful recognition abilities for specific applications.

Monday, November 10, 2014

HMM-Based Efficient Sketch Recognition

Bibliography Information:
Sezgin, Davis. "HMM-Based Efficient Sketch Recognition". 2005 International Conference on Intelligent User Interfaces.

URL:
http://rationale.csail.mit.edu/publications/Sezgin2005HMM-extended.pdf

This paper focuses on the usage of Hidden Markov Models (HMM) to help classify and group sketched objects together. The idea is that, even though there exists a large variety of different methods for sketch recognition that classify individual shapes together, the ordering of the shapes and how this can be used to predict the shape itself was not something that had been explored at the time.

The paper begins by describing what it calls a "sketching style", which is the user's preferred, though "not necessarily conscious" order in which the strokes of a shape are drawn. Sketch recognition in the paper is described as three steps: segmentation, classification, and labeling. In this sense, what the paper talks about is largely similar to what most every other paper read in this course talks about. However, the clue lies in the ordering in which the strokes are completed. Indeed, by analyzing the order in which the sketches were produced, it would give insight into the style of the individual's sketch as well as the general process by which all users draw. Though technically the amount of permutations of a shape of n strokes is n!, the actual number of permutations in a real-world sketching scenario is far less. For instance, when drawing a stick figure the vast majority of people either draw the stick representing the body, or the circle representing the head. One would rarely see the right leg or arm being drawn first.

Sezgin et. al. conducted a research study first in which this behavior is observed before proceeding into classifying the drawn shapes using HMMs. The sketches that the users were asked to draw were diagrams of finite state machines, Unified Modeling Language diagrams, scenes with "stick figures playing certain sports", digital circuit diagrams, and emoticons. This produced a total of 180 sketches. This user study revealed that users draw these sketches in a "highly stylized" fashion, that these styles persist across different sketches, that users preferred a particular order when drawing sketches, and that enclosed shapes (e.g. circles, squares, triangles) were drawn first.

In order to apply HMMs to object recognition, the system used had to support for multiple classes and different orders of drawing the same sketch, supporting any variations in drawing/encoding length, being able to compute a probabilistic matching score, and be able to support system-level learning of the user's intuition in drawing order. In terms of encoding, a total of 13 symbols were used as observations to encode the output sequences. 4 were used to encode lines, 3 encoded ovals, 4 for polylines, and one to encode any complex approximations that consisted of two or more of the above. Additionally, there were two kinds of HMMs: fixed input length and variable input length. Because the construction of the grouping graphs is made easy due to the "hard coding" of the fixed length data, and because it requires artificial grouping on a per-case basis, the former has somewhat of a limited use in "real world" sketch recognition. The sketching data is grouped by similar "sketching styles" in order to avoid this artificial partitioning, and Sezgin et. al. introduce weights in the graph to account for the differing lengths in the data. This HMM was implemented via BNT in Matlab. The Bakis left-to-right topology was used due to the incremental nature of sketching.

This work was evaluated by both evaluating real data and comparing the performance of this new algorithm to an existing "baseline method". With the real data coming from a total of 88 sketched objects from users, the fixed input HMM method yielded an accuracy of 96%, while the variable input method yielded 81% if they had no explicit ending states, and 97% if they did. "Negative examples" were introduced into the input data, such as superfluous or intentionally unrecognizable shapes to gauge how well the algorithm performed. The results varied depending of the size of the "neighborhood of misrecognition": 1 in 57% of the trials, 2 in 35%, 3 or more of 8%. With the simulated "low level errors", 43.5% of them were within a neighborhood of 1, 22.5% within 2, and 3% within 3 or more.

In order to test the "baseline recognizer", a feature-based one was used. This would assign spacial context to different shapes and would provide additional structural object descriptions. The running time for the baseline method increased exponentially, whereas the HMM-based recognition system scaled linearly.

Overall, I think the work presented in this paper was substantial. It presented not only a very accurate method of classification, but it also opens the door to many future experiments in identifying and exploring the differing "sketching styles" of people and how they affect sketch recognition. User studies can become far richer when we begin applying HMMs not only to describe their sketching styles, but also to predict them in the future. Using neural networks these can also be used to create original sketches made by a computer in the "style" of a specific user's sketch.

Wednesday, November 5, 2014

A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Bibliographical Information:
Rabiner. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition". Preceedings of the IEEE, Vol. 77, No. 2, February 1989.

URL:
http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf

This paper covers the explanations and basic tutorials regarding Markov Models, namely, the Hidden Markov Model that either derives information from or attempts to provide information regarding a hidden state, usually the origin.

The paper begins by explaining a standard Markov model, which is essentially a state machine with probabilities attached to each state, and one that can likely change its probabilities depending on current and previous states reached. This probability can be evaluated much in the same way that any probabilistic problem can be evaluated, by multiplying certain probabilities together based on a set of rules. The paper then extends this explanation to Hidden Markov Models, which requires that some piece of information, usually the origin point from which the states are determined, is unknown. Using the probabilities of the outputs, we can use Hidden Markov Models to numerically estimate the probabilities of the inputs. The total number of unknown parameters in an HMM system depends on the different kinds of models. For instance, an HMM system that estimates the coins used in a coin-toss experiment varies depending on whether we have 1, 2, or 3 coins. With 1 coin, we have only one unknown parameter. With 2 coins we have 4 unknowns, and with 3 coins we have 9 unknown parameters.

The paper describes the basic elements of an HMM:
1. The number of states in the model. Though these will be frequently the unknowns, there is also a state of "physical significance". The coins themselves represent the number of states in the coin-toss experiment, since the coins will only have two states (heads or tails)
2. The number of "observation symbols", or rather, the number of state types. "Heads, tails", means this number will be two.
3. The state transition probability distribution. The multiplication of the probabilities of how likely it is for the HMM problem to move from one state to the next. If a biased coin is more likely to switch sides if the same side has been acquired three times in a row, for instance, that probability will show up here.
4. Observation symbol probability distribution. The multiplication of the probabilities of the observations. For instance, the total probability of "how likely is this coin to come out heads, tails", etc.
5. Initial state distribution. The multiplication of the probabilities of the first, initial state.

The paper continues to describe three main "problems" that can be solved through an HMM:
1. Computing the probability of the observation sequence given a model
2. Computing the "best" observations given a model (i.e., the ones likely to yield meaningful training data)
3. How do we adjust the parameters of the model to maximize the probability of the observation sequence?

Rabiner describes all three of these main problems using basic examples that highlight the main concepts that the problems describe. It also goes over some general steps in how one may solve each of these three problems.

The paper goes over some additional "types" of HMMs. For instance, it describes the "left-right", or "Bakis" model, called left-right due to the underlying state sequence increasing as the state index increases. No transitions are allowed to states whose indices are lower than the current state. It also discusses the many instances in which the observations are not expected to be discrete, but rather continuous observations whose data and probability changes over time. Applicable to speech processing, another type was briefly discussed, the "auto-regressive HMM", where the observations are drawn from what the paper calls an "auto-regressive process", which ensures that each sample signal is counted exactly once.

Additional, more advanced information is covered that involves optimization of models, variants on the discussed HMM structures, scaling to include additional sequences, accounting for multiple observation sequences, choosing models, covering what happens when there is too little data, and an extended example of an application of an HMM structure to speech recognition.

Overall, I think this paper was very informative, if only a little exhaustive. The basics of Markov models and, more specifically, HMMs, are not as daunting as this paper may initially present them to be. Some of the examples that the paper went over, such as the ball-urn model or the coin-toss model, were only used to establish some terminology, not so much run through a bottom-up description of an entire HMM structure with these easy-to-understand models. Indeed, the only real extended example is that of speech recognition, which in itself is difficult to understand, so posing the understanding of an HMM structure to speech recognition is likely too esoteric for a beginner of HMMs. In fact, someone likely to be unfamiliar with HMMs is very likely to be unfamiliar with speech recognition, so this extended example is likely not the best to use as a "tutorial". Some additional information was then presented to the reader, such as the different types of HMMs, before a sufficient enough explanation was provided for the more simple types of HMMs, so that also hindered the learning experience somewhat. The subject is, however, very relevant to the field of sketch recognition and important in deriving the likelihood of what a user meant to write on a digital pen system rather than relying on what the user actually writes.

Saturday, November 1, 2014

LADDER, a sketching language for user interface develoeprs

Bibliography Information:

Hammond, Davis. "LADDER, a sketching language for user interface developers". Computers & Graphics 29. Pages 518-532. 2005.

URL:

http://www.sciencedirect.com/science/article/pii/S0097849305000865

LADDER was designed around the increasing ubiquity and extended use cases for applied sketch recognition. In areas for sketch recognition where the user and creator are both expected to share some knowledge of a particular domain (such as a characters in a new language, mathematical symbols, or free-hand physics diagrams), the requirement for someone to create a sketch recognition system in the past would require the "instructor" (that is, the individual providing the domain-specific materials) to have additional significant experience in the concepts behind sketch recognition. LADDER was built specifically for user-interface developers to more easily create sketch recognition domains without the need for expertise in the field of recognition. Instead they would be asked to describe how the diagrams in the domain are drawn, displayed and edited. To this end, the LADDER system handles shape and editing recognizers, shape exhibitors (that is, rules in how shapes are displayed to the screen), and an extensive sketch recognition system. A code generator was built so these descriptions can be input into the base recognition system.

LADDER's system relies mostly on recognizing as few strokes as possible (such that an "artistic rendition" of a cat, complete with thousands of small strokes for added facial and shadow detail, etc., would not be considered), relying instead on the most essential geometric rules of a shape to help craft the definition of a shape. Shapes are identified as a list of components (for instance, an arrow is identified as being "built from three lines"), constraints that define the relationships between the components, a set of aliases that simply elements in the description, editing behaviors that differentiate user input between "drawing" and "editing" actions (cut and paste, translation, rubber-band resizing, etc.), and display methods such as beautification of input strokes (strokes intended by the user to be lines but exhibiting minor hand tremors are "redrawn" as perfectly straight lines).

These recognized shapes are then identified into shape "groups" that make up more complex shapes, usually these consisting of domain-specific shapes unique only to one particular field (such as shapes depicting chemistry diagrams, Kanji characters, etc.) The language that describes LADDER shapes is based on a hierarchical system, providing descriptors such as the bounding box, center point, width and height, etc. LADDER is has several predefined constraints built into it, such as perpendicular, parallel, sameSide, oppositeSide, etc. Some constraints are valid only in a particular direction, and some directions can be specified in degrees relative to other shapes. The researchers found that a description system based on orientation is most useful to content providers that would use a system like LADDER.

Recognition of domain shapes happens over the course of several bottom-up data triggers where a recognized shape in the drawing can represent a particular "fact about the world". This shape recognition uses the Jess rule-based system that performs "clean up" on these shapes. These recognized shapes are added into the Jess system for later recognition. Additional similar rules exist on user input intended to perform edits on existing shapes, except these rules are applied to the user input rather than the resulting sketch.

I believe the work provided in LADDER is an extremely important advancement on the field of sketch recognition, since it works a great deal toward 1) contextualizing sketch input that makes this technology useful for real-world applications and 2) making it easier for content providers to create and support their own domains without expertise required in the field of sketch recognition. This system and its underlying principles have been used as the driving force behind much of the projects in the Sketch Recognition Lab.