Learning Words Right with the Sketch Engine and WebBootCat: Automatic Cloze Generation from Corpora and the Web
As I mentioned, one of the things that has been keeping me from posting regularly has been a hectic conference schedule. This includes work I've been doing as part of a research team headed by my Ming Chuan college Dr. Simon Smith and Dr. Adam Kilgariff of the University of Brighton. This spring we presented a series of papers on computer applications in TEFL.
In addition to our presentation at TALC, last week, Simon, Adam, and I presented a related paper at the 25th Conference on English Teaching and Learning in the ROC. Tbe conference was held at the extremely beautiful campus of National Chung Cheng University in Chiayi. Our paper dealt with the use of Sketch Engine and WebBootCat software designed by Dr. Kilgariff to automatically generate fill-in-the-blank questions for classroom teaching.
Pasted below is a copy of the paper we presented. In addition, you can download the Power Points from the presentation here
Learning words right with the Sketch Engine and WebBootCat: Automatic cloze generation from corpora and the web (ppt)
Typepad has a lot of trouble handling documents copyed from Word, so there may be some errors in the paper below. Because of the conversion process, I had to delete the diagrams we used in the presentation. These can be found in the Power Point slides. But if there is any problem reading the text or understanding significant points, let me know and I'll mail you the Word-formatted version of the paper.
This project was generously funded by the ROC National Science Council.
Learning words right with the Sketch Engine and WebBootCat: Automatic cloze generation from corpora and the web
Simon Smith*, Scott Sommers* and †Adam Kilgarriff†
*English Language Center, Ming Chuan University
†Lexical Computing Ltd, UK
Abstract
Cloze exercises are widely used in language teaching, both as a learning resource and an assessment tool. It has been shown that they can cultivate and test a wider range of skills than immediately meets the eye. Cloze has a particularly useful role to play in Taiwan, and other Asian countries, where students of English expect and are expected to memorize a lot of vocabulary. Cloze encourages acquisition of vocabulary through context, rather than the memorization of synonyms or translations. Unfortunately, it is time-consuming and difficult for teachers and materials designers to make up large numbers of cloze exercises.
The present paper briefly reviews the literature on cloze in language learning. It then describes how the authors used corpus resources to generate lists of vocabulary items which are salient to a particular topic, and presents an algorithm for automatically generating cloze exercises from corpora.
Introduction
Problems with Manual Cloze Preparation
In Smith, Sommers & Kilgarriff (2008) we reported how to extract corpora, on a specified topic, from the world-wide web, using WebBootCat (WBC; Baroni et al 2006). The corpora were then used to generate wordlists containing vocabulary which is salient to the topic. We showed that these wordlists are a better tool for language acquisition than many existing, manually derived lists; the latter tend to include items which are not truly relevant to the specified topic, and which may be too rare or obscure to be useful. Moreover, it is extremely difficult for teachers to create topic-specific vocabulary lists through introspection or brainstorming alone.
If that is true of wordlist generation, it must be doubly difficult for teachers to think up cloze exercises from scratch. After the correct answer (the key) has been selected, the teacher must compose a convincing and authentic carrier sentence, and generate distractors which, while incorrect, are somehow viable alternatives for completion of the carrier sentence. Quite often, the fruit of what can be a time-consuming and tedious process is an inauthentic and implausible carrier sentence; teachers often have difficulty, too, thinking of appropriate distractors, and are sometimes tempted to use of distractors which could not possibly be correct.
It has been suggested that the distractors should appear in the language with approximately the same frequency as the key (Coniam, 1998), as frequency is a reasonable correlate of difficulty level; alternatively, that distractors should represent the types of errors typically occurring in a non-native English corpus, such as the Japanese Learners of English corpus used by Lee & Seneff (2007); or that distractors should have a similar semantic coverage to the key, and should be drawn from a thesaurus (Sumita et al, 2005) or similar resource. We take a comparable approach to this last, but instead of consulting a published thesaurus which cites synonyms and near-synonyms of the key, we search a large corpus for distractors which have a similar lexical distribution; that is to say, words which typically form the same collocational partnerships as the key. Thus, the words read and write could not be said to be synonymous in any way, but they do share a lexical distribution, because they both often collocate with complements like letter and book.
Here is an example of a cloze item generated by our system.
(1) Reality manages the home delivery operations of a range of GUS organisations, along with an enviable ____ of blue-chip clients.
Ans: investment infrastructure asset portfolio
The learner is asked to complete the underscored gap with one of the four answers given. The reader will agree that only the (key) answer portfolio is possible, and that if any of the three distractors were inserted, the sentence would become meaningless.
System Architecture
In this work, we make use of the Sketch Engine (SkE) suite of corpus query tools described by Kilgariff et al (2004). SkE has been in use by lexicographers for dictionary production and related applications, and because of its ability to highlight the most salient collocational patterns, is also well adapted to language learning. The suite allows inspection of linguistic corpora through four distinct modules: concordancing (line by line detailed view of the corpus contents), Word Sketch (short summary of collocational behaviour of the search term), Thesaurus and Sketch Differences (both are explained in greater detail presently). Our algorithm makes use of three of those modules.
SkE interfaces to a number of very large corpora, in several languages. We experimented with two of the English corpora offered: the 100 million word British National Corpus (BNC), as well as a much larger corpus harvested from the world-wide web, ukWaC, which runs to over 2 billion words. The BNC has served as a gold standard corpus for many years now: it has been used for countless linguistic, lexicographical and literary research endeavours. Disadvantages are that its contents are somewhat dated (the news stories, for example, concern the Great Britain of the 1980s), and that it is probably too small for this purpose. ukWac is large enough to provide a sample of English from which many, many collocational patterns emerge (although one would always get added value from an even larger corpus, were one available). However, web corpora have an inherent disadvantage when compared to compiled corpora like the BNC: they contain a lot of non-textual data, including forms, long price lists and inventories. Some of the text will not be in formal English, and a proportion will not have been written by native speakers of English. The makers of ukWaC were at great pains to keep non-textual data out of ukWaC, but did not succeed in every case.
The item at (1) was generated from the ukWaC corpus. Some other experiments were performed, using both corpora, and these will be described presently.
It needs to be made clear at this point that our system is not computationally implemented. The procedure for deriving the carrier sentences and distractors currently involves the manual implementation of rules which will be automated when we have the necessary time and resources available; we have taken care to set the system up in such a way that it can be readily programmed. Ultimately, the teacher will be able to enter at a computer the key (correct answer) of their choice, and be presented with a cloze item like (1) above.
From the teacher’s perspective, the system works like this. The teacher types in the key, or specifies a file containing a list of keys to be processed. Thus, in (1) above, the teacher would have entered portfolio. The carrier sentence and the three incorrect answers (distractors) are returned by the system. Subsequently, in the interactive mode, the teacher would be asked if they were satisfied with the item, or whether they wanted to generate a new item using the same key, or whether they were happy with the sentence but would like to create a new set of distractors.
Internally, we start to search for potential distractors (PDs), with the same kind of lexical distribution as the key, using the Thesaurus module of SkE. Armed with a number of PDs, we then compare each one with the key, using Sketch Differences, looking at the same time for potential carrier sentences (PCSs) in the corpus where the PD and key do not share a collocate: that is, we extract from the corpus sentences in which all three PDs and the key are mutually exclusive, on contextual grounds. Given the key write, therefore, and the PCS John decided to (write) a book we would reject read as a distractor, because John decided to read a book is a perfectly good sentence of English. If, however, the PCS had been John decided to write a symphony, the word read would indeed be an eligible distractor, because reading a symphony is not a plausible activity.
If a PCS can be found in which all three distractors, if inserted, would make nonsense or would be rejected by a native speaker, the task is complete, and it remains only to verify that with the teacher. If no such sentence can be found, new distractors are introduced from the Thesaurus-derived list.
We now describe each step of the algorithm used for generating cloze items in detail.
Thesaurus Module
The reader will have realized that the Thesaurus module of SkE, capable as it is of indicating common distributional patterns such as those of read and write, is not a thesaurus in the traditional (Roget) sense. That does not in any way detract from its utility. It can still be used to search for synonyms, as long as a cross-check is performed (just as a wise user would make with a traditional thesaurus). Its primary function, though, is to output words which typically occur in the same context as the search term. Thus, on searching for write, we might expect to see such output as scribble (one can both write and scribble a note), author (one can write and author a book), as well as read and play, non-synonyms of write which can nonetheless occur in the context of book and symphony respectively.
We now examine the actual the SkE Thesaurus output for portfolio (the key for the cloze item presented at (1) above. Figure 1 reveals that most of the words with similar distribution to portfolio are in fact not synonyms or near synonyms: only collection and package really seem to qualify. A number of the words, as one might expect, have to do with business and the world of investment, with investment itself and asset ranking high on the list. The presence of the word curriculum on the list reflects the fact that the term portfolio is now widely used in the education domain.
The three top-ranking list members – investment, infrastructure and asset are retained for use as distractors PDs (potential distractors).
Sketch Differences Module
We next consult the Sketch Differences display. Figure 2 shows sketch differences for portfolio and investment, in contexts where either can occur in the ukWaC corpus. Notice how the display divides the output into grammatical relations between keyword and collocate. Figure 2 shows us that portfolio occurs 34 times in a PP_IN relation with excess, while investment occurs in this collocation 25 times. Typical contexts are “… an investment/ a portfolio in excess of n million dollars”.
Figure 1 SkE Thesaurus entry for portfolio
Figure 2 Part of SkE Thesaurus entry for portfolio and investment
Of course, we are interested in situations where the two words do not share a collocate, and for this we glance down at the “portfolio only” patterns. Alongside each collocating word, in Figure 2, is shown the frequency of the collocation (an underlined integer) and the salience (an index of the number of times portfolio occurs with the collocating word, as opposed to other words, given to one decimal place).
We now search for the collocate appearing only with portfolio (and never with investment) with the highest salience. We apply the condition that the collocate must be a correctly spelled English word, not a proper name. Thus, the non-alpha character with salience of 10.6 is rejected, as is harrah, a proper name (salience 9.6). The third-ranking in salience (8.8), diversified, is selected, and labelled Potential Key Collocate (PKC).
We now consider the second PD, infrastructure. The PKC diversified also does not occur in ukWaC in collocation with this PD, so it remains a candidate. However, when we move on to consider the third PD, asset, we find that diversified assets does indeed occur in the corpus. This means that asset cannot be used as a distractor for the key portfolio in the context diversified portfolio.
We therefore consider the collocate appearing only with portfolio with the fourth highest salience: this turns out to be enviable. This time, we find that the PKC does not occur in collocation with any of the PDs, so it is adopted as key collocate (KC).
So far, we have decided on the key, as well as the three distractors. We also intend for our carrier sentence to include the collocation enviable portfolio.The next step is to inspect potential cloze carrier sentence (PCSs), and we can do this by consulting a concordance.
Concordance Module
A concordance is simply a list of all the sentences (or lines) in a corpus that include a particular pattern. It is not surprising, therefore, that when calling up a concordance, one is often faced with sentences that are long or unwieldy, and include rare vocabulary or obscure proper names. This is particularly true of corpora that are harvested from the web, such as ukWaC.
With a view to generating good dictionary examples, the SkE concordancing software is equipped with a feature called GDEX (Husak et al, forthcoming) that prefers certain types of sentences. Sentences between 10 and 25 words long were preferred, and rare words and anaphora were penalized, along with a number of other measures described in detail by Husak et al.
From the concordance output of Figure 3, we may now extract the sentence shown at (1) above. If the user was dissatisfied with the first sentence, as a cloze exercise, they could be prompted to select the second or a subsequent sentence.
Figure 3 Part of SkE concordance entry for portfolio and enviable
BNC cloze example
In our experiments, we also generated (2) from the British National Corpus
(2) Albert E Sharp Fund Managers have launched AES European unit trust, which seeks long-term capital growth from a diversified _____ of European Securities.
Ans: asset portfolio stock holding
Unlike ukWaC, the corpus used to generate (1), the BNC does not contain any examples of the adjective diversified modifying any of the PDs. However, the concept of a “diversified holding of European Securities” does seem quite plausible; it is unlikely that many teachers would find (2) an acceptable cloze exercise.
The way in which the BNC was compiled means that it consists mostly of clean text, and relatively little noise, while ukWaC contains a fair amount of duplication and non-textual data. This might be taken as a compelling argument for preferring the BNC as a source corpus. However, the GDEX software does a good job of ensuring that the most meaningful sentences from a ukWaC concordance are presented first. What is more, if we posit that certain collocations have a vanishingly small chance of occurring – for that is the claim that one makes when setting the distractors for a cloze exercise – we should be using the very largest corpus available.
Next Steps
We have described an algorithm which is capable of generating a carrier sentence and distractors, given a user-supplied key (correct answer), showing how modules of the SkE corpus query tool can be used to generate these components.
As mentioned above, we will shortly prepare an implementation of the algorithm that will allow a user to supply a key at a computer, and be presented with a suggested cloze item. If the item is not satisfactory, the user will be able to run the program again and generate a new exercise.
Beyond straightforward programming, some work will be necessary to ensure that distractors match the key in terms of inflectional morphology (plural –s and the like). A review of any copyright issues involved will also be necessary.
Once implemented, this work can be put to good use immediately. Teachers who use the program will be able to generate authentic cloze items in very short order. By supplying a list of vocabulary items pertinent to the topic of a unit or lesson, such as the “Business” or “Getting started at university” lists described in Smith et al (2008), it will be possible to produce a set of highly relevant cloze exercises. These exercises can be used for assessment, or simply as part of day to day teaching, making students aware of the collocational patterns in which the topic vocabulary commonly participates.
References
Baroni, M., Kilgarriff, A., Pomikálek, J. & Rychlý, P. (2006). WebBootCaT: instant domain-specific corpora to support human translators. In Proceedings of EAMT 2006, Oslo, 247-252.
Coniam, D. (1998) From Text to Test, Automatically—An Evaluation of a Computer Cloze-Test Generator. Hong Kong Journal of Applied Linguistics 3(1):41-60.
Husak, M., Kilgarriff, A., McAdam, K., Rundell, M., Rychlý, P. (forthcoming) GDEX: Automatically finding good dictionary examples in a corpus. EURALEX, Barcelona. July 2008.
Kilgarriff, A., Rychlý, P., Smrž, P. & Tugwell, D. (2004). The Sketch Engine. Paper presented at EURALEX, Lorient, France, July 2004.
Smith, S., Sommers, S. & Kilgarriff, A. (2008) Learning words right with the Sketch Engine and WebBootCat: Meaningful lexical acquisition from corpora and the web. 2008 CamTESOL conference, Phnom Penh.
Sumita, E., Sugaya, F., and Yamamoto, S. (2005) Measuring Non-native Speakers’ Proficiency of English by Using a Test with Automatically-Generated Fill-in-the-BlankQuestions. Proc. 2nd Workshop on Building Educational Applications using NLP, Ann Arbor.
Comments