Automatic generation of analogy questions for student assessment: an Ontology-based approach

Different computational models for generating analogies of the form ‘‘A is to B as C is to D’’ have been proposed over the past 35 years. However, analogy generation is a challenging problem that requires further research. In this article, we present a new approach for generating analogies in Multiple Choice Question (MCQ) format that can be used for students’ assessment. We propose to use existing high-quality ontologies as a source for mining analogies to avoid the classic problem of hand-coding concepts in previous methods. We also describe the characteristics of a good analogy question and report on experiments carried out to evaluate the new approach.


Introduction
Effective assessment of students is an ongoing process that should be carried out in different phases of education: planning as in diagnostic assessment, teaching and learning as in formative and self-assessment, reporting and recording as in summative assessment.At the same time, designing and implementing effective assessments, with increased numbers and higher expectations of students, is time consuming and expensive (i.e.hard).Adding an ''e'' prefix to assessment is not magical; interested practitioners still face some knotty problems.Typically, e-assessment refers to using technology to manage and deliver assessment.It can also provide automatic grading and instant feedback, especially with objective tests (e.g.multiple-choice).However, a major and yet unsolved problem of e-assessment is the generation of highquality assessment items automatically (or at least semi-automatically).We argue that moving from a delivery model to a generation model is the key to the transition from e-assessment systems of today to those of the next generation.Moreover, technologyaided generation of assessment items is useful only if backed by an accepted pedagogical theory, which is usually missing in current generation models.In fact, this applies to both automatic and manual generation methods.For example, Paxton (Paxton 2001) carried out some empirical evaluations and reported that multiplechoice tests are often not well-constructed.
Part of the problem is the dearth of evaluation metrics.One possible solution is to use Item Response Theory (IRT) (Kehoe 1995;Miller, Linn, and Gronlund 2008;Mitkov, An Ha, and Karamani 2006) which describes the statistical behaviour of good/bad questions by following a procedure to measure three parameters: (1) possibility of guessing the correct answer, (2) tuned difficulty of test items and (3) proper discrimination between good and poor students.IRT can guide us in the evaluation of test items.However, we need a theory that guides us in the generationprocess of test items that conform to the desired characteristics.A possible solution that we conjectured is to use a similarity-based approach to generate questions of different characteristics.For example, it is expected that questions with high similarity between the stem and key parts and less similarity between stem and distractors are easy questions (or perhaps guessable).Note that in MCQ terminology, the question part is called the stem, the correct answer is called the key and wrong answers are called distractors.Similarly, the question would be more difficult if the distractors were more similar to the stem compared to the key answer (e.g.lexical similarity).
To alleviate the burden of manual generation of assessment items, we propose an approach to automatically generate MCQs from Description Logics (DL) (Baader et al. 2007) ontologies.DL ontologies are engineering artefacts that provide formal and machine processable descriptions of the basic notions of a domain of interest.Many high-quality ontologies already exist, which suggests that mining such rich resources for assessment questions might be fruitful.Recently, a handful of studies explored the generation of MCQs from ontologies (Al-Yahya 2011;Fairon 1999;Holohan 2005Holohan , 2006;;Papasalouros, Kotis, and Kanaris 2008;Zitko et al. 2008;Zoumpatianos, Papasalouros, and Kotis 2011), but very little research has been done on theoretical, empirical and evaluation aspects.Most of the proposed methodologies generate questions of the form ''What is X?'' or ''Which of the following is an example of X?'' based on classÁsubclass and/or classÁindividual relationships.These types of questions can be criticised as assessing lower levels only (e.g.recall) of Bloom's taxonomy of learning objectives (Bloom and Krathwohl 1956).Moreover, it is unlikely that a real test will consist of items that are all of this kind; hence, it is crucial to design approaches capable of generating questions of other kinds.
In this article, we describe the design and report on evaluation of a new approach for generating questions that require higher cognitive ability such as retrieving and mapping analogies of the form ''A is to B as C is to D''.

Analogy questions
Analogical reasoning is based on comparing two different types of objects and identifying points of resemblance.Hence, similarity plays a major role in analogical reasoning.In multiple-choice analogy questions, the student is given a pair of words and is asked to identify the most analogous pair of words among a set of alternative options.The required task is to recognise the relation between the pair of words in the stem and to find the pair of words that has a similar underlying relation.Multiple-choice analogy questions are used in various educational tests (e.g.college entrance tests such as SAT, GRE).As an example, see the question (GREguide 2012) in Table 1 taken from a sample of GRE verbal analogy questions: Different computational models (Falkenhainer 1988;Gentner 1983;Larkey & Love 2003;Winston 1980) for analogy-making have been proposed over the past 35 years.These models are based typically on comparing two structured representations encoded in predicate logic statements [e.g.Structure Mapping Theory (SMT) (Falkenhainer, Forbus, and Gentner 1989;Gentner 1983)].The SMT is more sensitive to higher order relations (e.g.cause, imply).These models are founded on the premise that detecting analogies are useful for transferring knowledge between two domains (usually called base and target).In this article, we take a different approach: first we define Analogy as a function that takes two representations and returns a numerical value [0,1] representing their analogy.Examples of such functions will be discussed later.Then we show how to use this function to develop an MCQ generator that is capable of controlling the difficulty of questions.In addition, the Analogy function can be used to generate only plausible (i.e.expected to be functional) distractors.To achieve this, we use thresholds D 1 , D 2 , D 3 , to parameterise our notion of analogy question (see Definition 1 below).We also define the function Relatedness that takes two concepts and returns their relatedness value [0,1].This function can be used to filter the generated pairs in the stem, key and distractors according to a threshold D R .Again, examples will be discussed later.
Definition 1 Let Q be an analogy question with stem S 0(A,B), key K 0(X,Y) and a set of distractors D 0{D i 0(E i ,F i ) j 1Bi 5 max}.We assume that Q satisfies the following conditions: (1) The stem S, the key K, the distractor D i are all good (i.e.Relatedness(A,B) The key K is significantly more analogous to S compared to the distractors (i.e. Analogy(S,K)]Analogy(S,D i )'D 1 ).
(4) The distractors should be analogous to S to an extent (i.e.Analogy(S,D i ) ]D 3 ).(5) Each distractor D i is unique (i.e.Analogy(S,D i ) "Analogy(S,D j )).
As an example of a Relatedness function, one can consider pairs of class names that are referenced together in at least one ontological axiom (perhaps in different sides of the axiom) as closely related classes.For instance, if we have an axiom in our ontology that defines X in terms of Y (e.g.X < × r.Y) then Relatedness(X,Y) is greater than zero.For our current purposes, we designed a Relatedness function that captures class-subclass relations between pairs of named classes that correspond to one of the structures in Figure 1.As you might notice, we restricted our attention to those structures that have at most one change in direction and at most two steps in each direction.We also ignored some structures caused by multiple inheritances (e.g.1d1u).These restrictions were considered to avoid too difficult (and probably confusing) questions.Also, these restrictions seem to be more aligned with humangenerated analogy examples.While in the most general case, one should consider As discussed above, we would like to be able to control the difficulty of the questions.According to Definition 1 and Propositions 1, 2, 3 we can control the difficulty of Q by increasing or decreasing D 1 , D 2 and D 3 .Proposition 1 Increasing D 1 decreases the difficulty of Q. Proposition 2 Increasing D 2 decreases the difficulty of Q. Proposition 3 Decreasing D 3 decreases the difficulty of Q.
The Analogy function can be defined in different ways.For example, we can compare the number of steps between classes in each pair; pairs with similar number of steps in their representations would be more analogous.In this paper, we define the function Analogy in terms of similarities in number of steps and changes in direction (see Definition 2).
Definition 2 Let Analogy(x,y) be a function that takes two pairs of concepts and returns a numerical score for their analogy value [0,1].The score is determined according to Table 2 in

Question generation
Our proposed approach to the generation of multiple-choice analogy questions consists of two phases: (1) extraction of interesting pairs of concepts by using the Relatedness function, those pairs can be used as stems, keys or distractors and (2) generation of multiple-choice questions based on the similarity between pairs which can be derived from the proposed Analogy function.The general algorithm is presented below (see Algorithm 1).The difficulty of the generated questions can be controlled by setting the parameters D 1 , D 2 and D 3 .In addition, the number of distractors can be controlled by setting the parameter ''max''.Note that avoiding non-functional (i.e.not picked by any student) is preferred (Haladyna & Downing 1993;Paxton 2001).
Return AQ; We used three different ontologies to test the proposed analogy-generation engine.
The three ontologies are presented in  questions).The table also shows the percentage of questions that our proposed solver agent can correctly solve.The details of the approach used to simulate question solving are explained in the following section.

Corpus-based evaluation
In order to evaluate the proposed approach for analogy generation, we follow the method explained by Turney and Littman (Turney and Littman 2005) for evaluating analogies using a large corpus.In their study, Turney and Littman reported that their method can solve about 47% of multiple-choice analogy questions (compared to an average of 57% correct answers solved by high school students).The solver takes a pair of words representing the stem of the question and five other pairs representing the answers presented to students.Their proposed method is inspired by the Vector Space Model (VSM) of informational retrieval.For each provided answer, the solver creates two vectors representing the stem (R 1 ) and the given answer (R 2 ).The solver returns a numerical value for the degree of analogy between the stem and the given answer.Then, the answers are ranked according to their analogy value and the answer with the highest rank is considered the correct answer.To create the vectors, they proposed a table of 64 joining terms that can be used to join the two words in each pair (stem or answer).The two words and joined by these joining terms in two different ways (e.g.''X is Y'' and ''Y is X'') to create a vector of 128 features.The actual values stored in each vector are calculated by counting the frequencies of those constructed terms in a large corpus (e.g.web resources indexed by a search engine).
To improve the accuracy of their proposed method, they suggested using the logarithm of the frequency instead of the frequency itself.
In this article, we follow a similar procedure.First, we constructed a table of joining terms relevant to the relations considered in our approach (e.g.''is a'', ''type'', ''and'', ''or'').Based on these joining terms, we create vectors of 10 features for the stem, the key and each distractor.The constructed terms are sent as a query to a search engine (Yahoo!) and the logarithm of the hit count is stored in the corresponding element in the vector.The hit count is always incremented by one to avoid getting undefined values.Following this procedure, our proposed solver agent solved 8% of the questions generated from the Gene Ontology, 67% of the questions generated from the People and Pets Ontology and 88% of the questions generated from the Pizza Ontology.We argue that this is caused by the specific terminology used in the Gene Ontology and lack of web resources that have information regarding it compared to the other ontologies.

Conclusion and future work
In this article, we presented a new approach for generating multiple-choice analogy questions from existing ontologies.We described the design of analogy-generator and analogy-solver.The solver achieved a maximum accuracy of 88%.However, it achieved a low accuracy value of 8% when used to solve analogies generated from the Gene Ontology.We assume that the difficulty of the domain is considered as an additional dimension to our difficulty controlling model.
For future work, we are going to generalise our approach for analogy generator to include user-defined relations.To evaluate analogies generated from arbitrary relations, we suggest using Latent Relational Similarity (LRS) (Turney 2005) which has the advantage of learning relations instead of using predefined joining terms.

Figure 1 .
Figure 1.Closely related structures of class-subclass relations [labels represent no. of steps and direction (up or down)].

Table 2 .
Values returned by the proposed function Analogy(x,y).
*These values were not calculated using equation 1 but were manually coded because they correspond to similar but scaled structures.
Table 3 below with some basic ontology statistics.The first ontology is the Gene Ontology which is a structured vocabulary for the annotation of gene products.It has three main parts: (1) molecular function, (2) cellular component and (3) biological role.The second and third ontologies are the People & Pets Ontology and Pizza Ontology which are very simple ontologies that were built to be used in ontology development tutorials.The table shows the number of satisfiable classes in each ontology and the number of sample questions generated by the engine (this is only a representative sample of all the generated