OED Text Visualizer tool and the current state of OED Online
OED Online has recently put up a new tool on its website at www.oed.com.
The case for visualization tools such as these is that they represent different categories of quantitatively assessed data in a visually striking way. They are especially useful when they indicate groupings or relationships between constituent elements of the data that researchers might not previously have noticed or considered.
The OED Text Visualizer certainly has the potential to do this. Users can type in text of up to 500 words long to see the etymological source of each words (Germanic, Romance, etc) and when it first entered the language.
In its present form, however, the tool is problematic. The major issue is as follows. As its accompanying text explains, the Text Visualizer draws on two important components of OED Online entries: etymological origin of a word, and date of first recorded usage. What is not explained is that just under half of these two sets of OED data are significantly out of date, in some cases by a hundred years and more, since the entries from which they are derived are as yet wholly or partially unrevised.
It follows that the results produced by the Text Visualizer represent an undifferentiated mixture of internally inconsistent lexicography, some of it significantly out of date. The tool needs to be reconfigured so that users can distinguish between results derived from modern lexicographical scholarship (i.e. 2000 onwards) from those based on entries first published in earlier stages of the Dictionary (stretching from 1884 to 1989). In its current form the Text Visualizer delivers results which are not yet appropriate for use in academic research.
The Text Visualizer also provides information on the frequency of use of a word, both in the year the user has assigned to the text and in ‘modern English’. This is valuable, but no account or reference is made to the source of this information, which we may guess to have been Google N-grams, presumably manipulated or adapted in some way. Users of the tool need to know the source of the figures cited so that they can understand the assumptions on which they have been produced. This is a basic requirement for academic research.
One excellent feature of the new tool, nevertheless, is that its results are produced in csv and other formats and hence are far easier to work with for research purposes than the search results currently available on OED Online (see under Search tools below).
A more general comment is as follow. Setting aside the criticisms above, the OED visualization tools so far produced (e.g., geographical origin of vocabulary in English over time) have been captivating but over-determined. That is, they make assumptions about what researchers are interested in. By contrast, it is a widely acknowledged truism that good research comes out of giving researchers free and unfettered access to primary data, so that they can explore and think about it independently. The range of search tools on OED Online already provides a generous range of possibilities for new types of research, though of course we would all like more tools and more/better data to be available (for example, the currently provided information on frequency of head words is unsatisfactory). The problem is that these website tools don’t work well and the results are delivered in an unanalysable format, as described on EOED at OED Online.
OUP is now planning ‘a new suite of tools based on an OED Text Annotator engine,’ of which the Text Visualizer critiqued above is an example. Exciting as such tools are, there are other features of the OED Online website in its current form which are so unsatisfactory as to require immediate attention. Sorting these out is a priority of at least equal if not greater importance than a new set of tools, especially if the new tools repeat the flaws of the existing ones. Here is a list.
Urgent issues for OED Online
Transparency on date of entries and changes to entries
- OED Online needs to make it entirely clear to users that its website presents a mix of new, revised, and unrevised entries, some of which have been unchanged or little changed for over a hundred years. Electronic searches should distinguish between revised and unrevised entries, otherwise the results are not usable for research purposes. It is worth pointing out that if users were able to search OED3 independently of OED2, they would be in a position to appreciate the quality and characteristics of OED3’s lexicographical innovation and scholarship. The character and achievements of OED3 are currently under-recognized because they are impossible to identify systematically, i.e. across a range of entries.
- When significant changes are made to revised entries, these should be flagged. An example is the change made to the definition of marriage after new UK legislation in 2013. The entry continues to be dated 2000. Researchers need to be able to make use of and cite dictionary entries with confidence that the dates they bear are accurate.
- Similarly, unrevised entries frequently contain unidentified changes and additions (to definitions, editorial notes, quotations and other components) added since date of first or subsequent print/web publication. Again, OED Online needs to find a way of recording significant changes so that academic users can use and cite Dictionary entries with an understanding of their provenance and with confidence that the date-stamping provided by OED itself is accurate.
Quotation sources
- A pressing issue for the OED is the unevenness of balance in its most heavily cited quotation sources. These sources are listed on the OED website and accessible via a front-page link (‘Explore the top 1,000 authors and works quoted in the OED’). As of June 2020, only 28 are by identifiably female authors. The reasons for this imbalance are evidently not straightforward but the matter needs to be acknowledged and discussed and the editors should say what they are doing to tackle the issue. For example, it would be extraordinarily helpful if it were possible to search by gender of author, where known. See further EOED pages on Top sources, Fe/male sources.
- The question of the balance of quotations between white and non-white writers of English is also a salient issue, one that OED will certainly be thinking about. Geographical spread in sources quoted is not a reliable proxy, given that many quotations are from (colonial era) white authors.
Search tools
- Electronic searching of OED’s text continues to yield flawed results, even when using search pathways indicated by the website. For example, if you click on the top item on OED’s list of top 1,000 sources, which is The Times, and follow the directions to identify the quotations in question, many of the results turn out to be from unrelated publications (Musical Times, N.Y. Times, Financial Times, etc). With large bodies of evidence it is impracticable for users to weed out false results by hand or by subsequent searches.
- The form in which website results are provided is not usable for research purposes. By contrast, the Text Visualizer’s provision of different formats for search results is exemplary. Similar features should be imported into OED Online.
Editorial principles and practice; other accompanying information
- Description of editorial principles and practice. Over its initial 20 years OED3’s editorial practices – and by inference, editorial policies – have varied considerably, e.g. on the provision of and criteria for usage notes and labels of various kinds. Users need full information and guidance here, preferably in one location on the website which is easy to locate, access and search.
- The ‘About’ section of the website (https://public.oed.com/about/) contains much valuable material (e.g. on the history of the OED) but is hard to navigate. Users are often unaware of its contents. It needs to be completely reorganized, with content properly indexed and pages dated.
Last updated on 12 August 2020