NAMBU - The General Union

May 14, 2008

Learning Words Right with the Sketch Engine and WebBootCat: Automatic Cloze Generation from Corpora and the Web

As I mentioned, one of the things that has been keeping me from posting regularly has been a hectic conference schedule. This includes work I've been doing as part of a research team headed by my Ming Chuan college Dr. Simon Smith and Dr. Adam Kilgariff of the University of Brighton.  This spring we presented a series of papers on computer applications in TEFL.

In addition to our presentation at TALC, last week, Simon, Adam, and I presented a related paper at the 25th Conference on English Teaching and Learning in the ROC. Tbe conference was held at the extremely beautiful campus of National Chung Cheng University in Chiayi. Our paper dealt with the use of Sketch Engine and WebBootCat software designed by Dr. Kilgariff to automatically generate fill-in-the-blank questions for classroom teaching.

Pasted below is a copy of the paper we presented.  In addition, you can download the Power Points from the presentation here
Learning words right with the Sketch Engine and WebBootCat: Automatic cloze generation from corpora and the web (ppt)

Typepad has a lot of trouble handling documents copying from Word, so there may be some errors in the paper below. Because of the conversion process, I was not able to enlarge the text in the post, as I have been doing recently. In addition, I had to delete the diagrams we used in the presentation. These can be found in the Power Point slides. But if there is any problem reading the text or understanding significant points, let me know and I'll mail you the Word-formatted version of the paper.

This project was generously funded by the ROC National Science Council.

 
Learning words right with the Sketch Engine and WebBootCat: Automatic cloze generation from corpora and the web
Simon Smith*, Scott Sommers* and
Adam Kilgarriff†
*English Language Center, Ming Chuan University
†Lexical Computing Ltd, UK

Abstract
Cloze exercises are widely used in language teaching, both as a learning resource and an assessment tool. It has been shown that they can cultivate and test a wider range of skills than immediately meets the eye. Cloze has a particularly useful role to play in Taiwan, and other Asian countries, where students of English expect and are expected to memorize a lot of vocabulary. Cloze encourages acquisition of vocabulary through context, rather than the memorization of synonyms or translations. Unfortunately, it is time-consuming and difficult for teachers and materials designers to make up large numbers of cloze exercises.

The present paper briefly reviews the literature on cloze in language learning. It then describes how the authors used corpus resources to generate lists of vocabulary items which are salient to a particular topic, and presents an algorithm for automatically generating cloze exercises from corpora.

Introduction

Problems with manual cloze preparation

In Smith, Sommers & Kilgarriff (2008) we reported how to extract corpora, on a specified topic, from the world-wide web, using WebBootCat (WBC; Baroni et al 2006). The corpora were then used to generate wordlists containing vocabulary which is salient to the topic. We showed that these wordlists are a better tool for language acquisition than many existing, manually derived lists; the latter tend to include items which are not truly relevant to the specified topic, and which may be too rare or obscure to be useful. Moreover, it is extremely difficult for teachers to create topic-specific vocabulary lists through introspection or brainstorming alone.

If that is true of wordlist generation, it must be doubly difficult for teachers to think up cloze exercises from scratch. After the correct answer (the key) has been selected, the teacher must compose a convincing and authentic carrier sentence, and generate distractors which, while incorrect, are somehow viable alternatives for completion of the carrier sentence. Quite often, the fruit of what can be a time-consuming and tedious process is an inauthentic and implausible carrier sentence; teachers often have difficulty, too, thinking of appropriate distractors, and are sometimes tempted to use of distractors which could not possibly be correct.

It has been suggested that the distractors should appear in the language with approximately the same frequency as the key (Coniam, 1998), as frequency is a reasonable correlate of difficulty level; alternatively, that distractors should represent the types of errors typically occurring in a non-native English corpus, such as the Japanese Learners of English corpus used by Lee & Seneff (2007); or that distractors should have a similar semantic coverage to the key, and should be drawn from a thesaurus (Sumita et al, 2005) or similar resource. We take a comparable approach to this last, but instead of consulting a published thesaurus which cites synonyms and near-synonyms of the key, we search a large corpus for distractors which have a similar lexical distribution; that is to say, words which typically form the same collocational partnerships as the key. Thus, the words read and write could not be said to be synonymous in any way, but they do share a lexical distribution, because they both often collocate with complements like letter and book.

Here is an example of a cloze item generated by our system.
(1) Reality manages the home delivery operations of a range of GUS organisations, along with an enviable ____ of blue-chip clients.

Ans: investment   infrastructure    asset   portfolio

The learner is asked to complete the underscored gap with one of the four answers given. The reader will agree that only the (key) answer portfolio is possible, and that if any of the three distractors were inserted, the sentence would become meaningless.

System architecture

In this work, we make use of the Sketch Engine (SkE) suite of corpus query tools described by Kilgariff et al (2004). SkE has been in use by lexicographers for dictionary production and related applications, and because of its ability to highlight the most salient collocational patterns, is also well adapted to language learning. The suite allows inspection of linguistic corpora through four distinct modules: concordancing (line by line detailed view of the corpus contents), Word Sketch (short summary of collocational behaviour of the search term), Thesaurus and Sketch Differences (both are explained in greater detail presently). Our algorithm makes use of three of those modules.

SkE interfaces to a number of very large corpora, in several languages. We experimented with two of the English corpora offered: the 100 million word British National Corpus (BNC), as well as a much larger corpus harvested from the world-wide web, ukWaC, which runs to over 2 billion words.  The BNC has served as a gold standard corpus for many years now: it has been used for countless linguistic, lexicographical and literary research endeavours. Disadvantages are that its contents are somewhat dated (the news stories, for example, concern the Great Britain of the 1980s), and that it is probably too small for this purpose. ukWac is large enough to provide a sample of English from which many, many collocational patterns emerge (although one would always get added value from an even larger corpus, were one available). However, web corpora have an inherent disadvantage when compared to compiled corpora like the BNC: they contain a lot of non-textual data, including forms, long price lists and inventories. Some of the text will not be in formal English, and a proportion will not have been written by native speakers of English. The makers of ukWaC were at great pains to keep non-textual data out of ukWaC, but did not succeed in every case.

The item at (1) was generated from the ukWaC corpus. Some other experiments were performed, using both corpora, and these will be described presently.

It needs to be made clear at this point that our system is not computationally implemented. The procedure for deriving the carrier sentences and distractors currently involves the manual implementation of rules which will be automated when we have the necessary time and resources available; we have taken care to set the system up in such a way that it can be readily programmed. Ultimately, the teacher will be able to enter at a computer the key (correct answer) of their choice, and be presented with a cloze item like (1) above.

From the teacher’s perspective, the system works like this. The teacher types in the key, or specifies a file containing a list of keys to be processed. Thus, in (1) above, the teacher would have entered portfolio. The carrier sentence and the three incorrect answers (distractors) are returned by the system. Subsequently, in the interactive mode, the teacher would be asked if they were satisfied with the item, or whether they wanted to generate a new item using the same key, or whether they were happy with the sentence but would like to create a new set of distractors.

Internally, we start to search for potential distractors (PDs), with the same kind of lexical distribution as the key, using the Thesaurus module of SkE. Armed with a number of PDs, we then compare each one with the key, using Sketch Differences, looking at the same time for potential carrier sentences (PCSs) in the corpus where the PD and key do not share a collocate: that is, we extract from the corpus sentences in which all three PDs and the key are mutually exclusive, on contextual grounds. Given the key write, therefore, and the PCS John decided to (write) a book we would reject read as a distractor, because John decided to read a book is a perfectly good sentence of English. If, however, the PCS had been John decided to write a symphony, the word read would indeed be an eligible distractor, because reading a symphony is not a plausible activity.

If a PCS can be found in which all three distractors, if inserted, would make nonsense or would be rejected by a native speaker, the task is complete, and it remains only to verify that with the teacher. If no such sentence can be found, new distractors are introduced from the Thesaurus-derived list.
We now describe each step of the algorithm used for generating cloze items in detail.

Thesaurus module

The reader will have realized that the Thesaurus module of SkE, capable as it is of indicating common distributional patterns such as those of read and write, is not a thesaurus in the traditional (Roget) sense. That does not in any way detract from its utility. It can still be used to search for synonyms, as long as a cross-check is performed (just as a wise user would make with a traditional thesaurus). Its primary function, though, is to output words which typically occur in the same context as the search term. Thus, on searching for write, we might expect to see such output as scribble (one can both write and scribble a note), author (one can write and author a book), as well as read and play, non-synonyms of write which can nonetheless occur in the context of book and symphony respectively.

We now examine the actual the SkE Thesaurus output for portfolio (the key for the cloze item presented at (1) above. Figure 1 reveals that most of the words with similar distribution to portfolio are in fact not synonyms or near synonyms: only collection and package really seem to qualify. A number of the words, as one might expect, have to do with business and the world of investment, with investment itself and asset ranking high on the list. The presence of the word curriculum on the list reflects the fact that the term portfolio is now widely used in the education domain.

The three top-ranking list members – investment, infrastructure and asset are retained for use as distractors PDs (potential distractors).

Sketch Differences module

We next consult the Sketch Differences display. Figure 2 shows sketch differences for portfolio and investment, in contexts where either can occur in the ukWaC corpus. Notice how the display divides the output into grammatical relations between keyword and collocate. Figure 2 shows us that portfolio occurs 34 times in a PP_IN relation with excess, while investment occurs in this collocation 25 times. Typical contexts are “… an investment/ a portfolio in excess of n million dollars”.
 
Figure 1 SkE Thesaurus entry for portfolio
   
Figure 2 Part of SkE Thesaurus entry for portfolio and investment
Of course, we are interested in situations where the two words do not share a collocate, and for this we glance down at the “portfolio only” patterns. Alongside each collocating word, in Figure 2, is shown the frequency of the collocation (an underlined integer) and the salience (an index of the number of times portfolio occurs with the collocating word, as opposed to other words, given to one decimal place).
We now search for the collocate appearing only with portfolio (and never with investment) with the highest salience. We apply the condition that the collocate must be a correctly spelled English word, not a proper name. Thus, the non-alpha character  with salience of 10.6 is rejected, as is harrah, a proper name (salience 9.6). The third-ranking in salience (8.8), diversified, is selected, and labelled Potential Key Collocate (PKC).

We now consider the second PD, infrastructure. The PKC diversified also does not occur in ukWaC in collocation with this PD, so it remains a candidate. However, when we move on to consider the third PD, asset, we find that diversified assets does indeed occur in the corpus. This means that asset cannot be used as a distractor for the key portfolio in the context diversified portfolio.

We therefore consider the collocate appearing only with portfolio with the fourth highest salience: this turns out to be enviable. This time, we find that the PKC does not occur in collocation with any of the PDs, so it is adopted as key collocate (KC).

So far, we have decided on the key, as well as the three distractors. We also intend for our carrier sentence to include the collocation enviable portfolio.The next step is to inspect potential cloze carrier sentence (PCSs), and we can do this by consulting a concordance.

Concordance module

A concordance is simply a list of all the sentences (or lines) in a corpus that include a particular pattern. It is not surprising, therefore, that when calling up a concordance, one is often faced with sentences that are long or unwieldy, and include rare vocabulary or obscure proper names. This is particularly true of corpora that are harvested from the web, such as ukWaC.

With a view to generating good dictionary examples, the SkE concordancing software is equipped with a feature called GDEX (Husak et al, forthcoming) that prefers certain types of sentences. Sentences between 10 and 25 words long were preferred, and rare words and anaphora were penalized, along with a number of other measures described in detail by Husak et al.

From the concordance output of Figure 3, we may now extract the sentence shown at (1) above. If the user was dissatisfied with the first sentence, as a cloze exercise, they could be prompted to select the second or a subsequent sentence.

Figure 3 Part of SkE concordance entry for portfolio and enviable
BNC cloze example

In our experiments, we also generated (2) from the British National Corpus
(2) Albert E Sharp Fund Managers have launched AES European unit trust, which seeks long-term capital growth from a diversified _____ of European Securities.

Ans: asset     portfolio      stock      holding

Unlike ukWaC, the corpus used to generate (1), the BNC does not contain any examples of the adjective diversified modifying any of the PDs. However, the concept of a “diversified holding of European Securities” does seem quite plausible; it is unlikely that many teachers would find (2) an acceptable cloze exercise.

The way in which the BNC was compiled means that it consists mostly of clean text, and relatively little noise, while ukWaC contains a fair amount of duplication and non-textual data. This might be taken as a compelling argument for preferring the BNC as a source corpus. However, the GDEX software does a good job of ensuring that the most meaningful sentences from a ukWaC concordance are presented first. What is more, if we posit that certain collocations have a vanishingly small chance of occurring – for that is the claim that one makes when setting the distractors for a cloze exercise – we should be using the very largest corpus available.

Next steps

We have described an algorithm which is capable of generating a carrier sentence and distractors, given a user-supplied key (correct answer), showing how modules of the SkE corpus query tool  can be used to generate these components.

As mentioned above, we will shortly prepare an implementation of the algorithm that will allow a user to supply a key at a computer, and be presented with a suggested cloze item. If the item is not satisfactory, the user will be able to run the program again and generate a new exercise.
Beyond straightforward programming, some work will be necessary to ensure that distractors match the key in terms of inflectional morphology (plural –s and the like). A review of any copyright issues involved will also be necessary.

Once implemented, this work can be put to good use immediately. Teachers who use the program will be able to generate authentic cloze items in very short order. By supplying a list of vocabulary items pertinent to the topic of a unit or lesson, such as the “Business” or “Getting started at university” lists described in Smith et al (2008), it will be possible to produce a set of highly relevant cloze exercises. These exercises can be used for assessment, or simply as part of day to day teaching, making students aware of the collocational patterns in which the topic vocabulary commonly participates.

References
Baroni, M., Kilgarriff, A., Pomikálek, J. & Rychlý, P. (2006). WebBootCaT: instant domain-specific corpora to support human translators. In Proceedings of EAMT 2006, Oslo, 247-252.

Kilgarriff, A., Rychlý, P., Smrž, P. & Tugwell, D. (2004). The Sketch Engine. Paper presented at EURALEX, Lorient, France, July 2004.

Smith, S., Sommers, S. & Kilgarriff, A. (2008) Learning words right with the Sketch Engine and WebBootCat: Meaningful lexical acquisition from corpora and the web. 2008 CamTESOL conference, Phnom Penh.

Coniam, D. (1998) From Text to Test, Automatically—An Evaluation of a Computer Cloze-Test Generator. Hong Kong Journal of Applied Linguistics 3(1):41-60.

Sumita, E., Sugaya, F., and Yamamoto, S. (2005) Measuring Non-native Speakers’ Proficiency of English by Using a Test with Automatically-Generated Fill-in-the-BlankQuestions. Proc. 2nd Workshop on Building Educational Applications using NLP, Ann Arbor.

Husak, M., Kilgarriff, A., McAdam, K., Rundell, M., Rychlý, P. (forthcoming) GDEX: Automatically finding good dictionary examples in a corpus. EURALEX, Barcelona. July 2008.

May 09, 2008

Topics on Taiwan English Blogging

Topics is the English-language magazine of the American Chamber of Commerce in Taiwan. This month's issue features an article on blogging in Taiwan. The writer, Steven Crook, features interviews with a range of English-language bloggers that will probably be familiar to readers including Michael Turton, Jason Cox, Greg Talovich and me.

May 01, 2008

8th Teaching and Language Corpora Conference

As I have said, I have been extremely busy on a series of presentations for local and international conferences. These presentations come from a project funded by the National Science Council of the ROC headed by my colleague Dr. Simon Smith and involve the use of corpus linguistics to develop language teaching procedures. In particular, we have been trying to develop practical applications for Sketch Engine and WebBootCat developed by Dr. Adam Kilgarrif of Lexical Computing and the University of Bristol.

I plan on posting papers and Power Point slides from this project, but I am encountering technical problems with incomparability between Typepad blogging and Word-formatted documents. Anyway, this post is the abstract of our presentation at the 8th Teaching and Language Corpora Conference (TALC) coming up this July.

 

Automatic cloze generation: getting sentences and distractors from corpora

by

Simon Smith (Ming Chuan University), Scott Sommers (Ming Chuan University) and Adam Kilgarrif (Lexical Computing)

Keywords: cloze, vocabulary, language testing, WebBootCat, Sketch Engine

In the ELT programme at Ming Chuan University, Taipei we have found cloze exercises to be a useful learning and assessment tool. We are required to conduct formal English examinations twice per semester, and student numbers are large. Earlier research (Bachman, 1985; Hughes 1981) has indicated that cloze exercises can be used to assess a surprisingly wide range of language skills, including speaking; we lack the resources to examine all our students orally, but cloze provides a practical substitute. 

Currently, cloze exercises are prepared by hand. Not only is this time-consuming, but also the deleted item and distractors are chosen in an arbitrary way. A better solution would be to generate cloze exercises whose distractors are semantically related in some statistically demonstrable way. Ideally, the distractors would have features in common with the correct answer, determined by their similar distribution in a corpus, but would not normally occur in collocation with some other word in the sentence. By way of a simple example, take the cloze exercise “It’s a ___ day”. The correct answer might be sunny, and the distractors tepid, lukewarm and toasty. 

In this paper, after a brief review of the role of cloze, both at our university and in ELT generally, we present an algorithm for the automatic generation of cloze exercises. We use a web corpus builder (WebBootCat, described by Baroni et al, 2006) to download a set of texts on a specified topic, and select from this corpus a word w, determined by WebBootCat to be one of the most salient to the topic. Using the Sketch Engine corpus query tool (SkE, described by Kilgarriff et al, 2004), we identify a sentence S, containing w and a word c which collocates strongly with w. The student is then presented with a cloze version of S from which c has been deleted. Distractors are chosen from a set of words, also returned by a function of the Sketch Engine, which are similar in distribution to c, but do not occur in collocation with v. 

We give examples of ways in which the generated cloze exercises could be used in class, in the lab, or at home, and show how they could be incorporated into an interactive CALL interface, making students’ learning experience more enjoyable and fruitful.

References

Bachman, L. 1985. “Performance on Cloze Tests with Fixed-Ratio and Rational Deletions.” TESOL Quarterly, Vol. 19, No. 3, pp. 535-556.

Baroni, M., Kilgarriff, A., Pomikálek, J. and Rychlý, P. 2006. “WebBootCat: instant domain-specific corpora to support human translators.” In Proceedings of EAMT 2006, Oslo, 247-252

Hughes, A. 1981. “Conversational Cloze as a Measure of Oral Ability.” ELT Journal 1981 XXXV(2), pp 161-168

Kilgarriff, A., Rychlý, P., Smrž, P. and Tugwell, D. 2004. “The Sketch Engine.” Paper presented at EURALEX, Lorient, France, July 2004.

April 24, 2008

Aboriginal Drop-out Rates and Mother Tongue Language Education

You may have seen in today's Taipei Times statements from Vice Minister of Education Chou Tsan-der (周燦德) concerning the success of schools that educate aborigines. The vice-minister was involved in a dispute with opposition members concerning the drop-out rates and the numbers of aborigines who attend university and college in Taiwan. In the course of this dispute, he stated 171 Aboriginal students attended colleges and universities in 2002, and that the number nearly doubled to 332 in 2003 and jumped to 714 last year. This is entirely misleading.

Last week, I attended the 5th Annual Conference of the European Association of Taiwan Studies. I was fortunate enough to meet Scott Simon who is one of the leading scholars on Taiwan aborigines (although Scott is currently teaching at the Institut d'Asie Orientale de l'Ecole Normale Supérieure). I had the chance to talk with Scott about the examination for aboriginal language proficiency that I spoke about in this post. The significance of this test is that on top of the points they already receive for being aboriginal, students who pass it receive extra points - a lot of extra points - when applying to university. I have not been able to see any of these exams, but Scott has and described them to me.

His description was quite surprising, given my understanding that these tests are meant to promote the use of aboriginal languages in danger of disappearing. He told me that actual test questions themselves are incredibly simple. Typically students are asked to translate single words, like 'bird' or 'mother'. He also said that students have learned if they are asked a question that's more complex the best way to get points is to reply with a 'yes' or 'no'.

The issue is not the questions themselves. Rather, it is the study material prepared for these ridiculously simple tests. Apparently, production of the books that contain translations of various words and the money that's paid out for making these books, is more significant. The ability to get paid for this work is associated with power struggles between different clientelistic groups in the Taiwan aboriginal communities.

Returning to the my original point concerning the drop-out rate and university attendance of aboriginal students. It would certainly surprise me if the rate of university attendance among aborigines had not increased. After all, with virtually no effort on anyone's part, aboriginal students have been given a huge boost in the ability to get accepted by a university. And with this interpretation of the minister's words in mind, I am not sure how to read his final words that, "...the average Aboriginal student is unlikely to flunk because most colleges and universities operate resource-grouping classes to help them adapt academically.

April 19, 2008

Global Higher Education

I'd like to thank Kerim for bringing to my attention a new blog operated by one of my favorite researchers, Kris Olds. In addition to his blog Global Higher Education, Kris Olds has written widely on issues related to education in the Pacific Rim. I highly recommend all of his work.

April 12, 2008

European Association of Taiwan Studies

One of the conferences I've been preparing for is the European Association of Taiwan Studies (EATS) conference that I wrote about back in this post. The conference is being co-organized by Charles University in the Czech Republic and the School of Oriental and African Studies at the University of London. The conference involves presentations of papers and their discussions. These papers, including mine, can be viewed on the conference website, here.

April 03, 2008

The King Car Education Foundation and Rural Education

I haven't been posting very often because I've been preparing for a series of conferences at which I have to present. But the Taipei Times has recently featured several articles that highlight problems I've been talking about in rural education and the misguided efforts of the King Car Education Foundation.

King Car, if you remember, is the private foundation funding the placement of foreign English teachers in rural schools. They have also built an English theme park similar to parks that exist in Korea. The foreign teachers they are placing in these operations have been recruited through an American Christian missionary organization. At least some of these teachers would not qualify for a work permit to teach English. The concern I consistently expressed is that a private foundation is now making policy decisions concerning what aspects of rural education should receive funding.

Several articles in the Taipei Times have highlighted problems with education funding, particularly rural education, and point at serious policy directions in which money should be being spent. For example, large numbers of families in Taiwan need assistance with child care so that they can continue work. Apparently, there are underused government programs available to assist people in this situation. But more significant is the huge shortage of textbooks in rural Taiwan.

The Ministry of Education has announced that rural schools are suffering from a huge shortage of textbooks. In fact, the situation is so severe that they are asking citizens to donate books. The Taipei Times article states that 40,000 books are needed by some 219 junior high schools and 61,022 students in remote regions, including Nantou County, as well as Hualien and Taitung counties. I presume most of these students are aboriginal Taiwanese.

So while King Car is providing a Disneyland English experience, there are children in rural areas of Taiwan who can not get the basic education guaranteed them by the ROC Constitution. They are deprived of their constitutional rights because of a textbook shortage which has been going on for years. King Car will however make sure that some rural students have the experience of a not necessarily qualified or competent white foreign teacher.

Sure, King Car is a private group and they can spend their money any way they want. Should they be allowed to spend it in the ways they are spending it? No, they should not. When the MOE is forced to turn to Third World solutions for rural education - like asking for donated textbooks to service aboriginal children's Constitutional rights - they should not. If the MOE had any courage on this matter, they would tell King Car that while their offer is appreciated, rural Taiwan does not need marginally educated white kids helping with English instruction and instead needs private donors to buy such basic requirements as books for students.