Profiling
verb complementation constructions across New Englishes
A two-step random forests analysis of ing vs. to complements
Sandra C. Deshors and Stefan Th. Gries
New Mexico
State University / University of California, Santa Barbara
In this
paper, we explore verb complementation patterns with to and ing in native
English (British and American English) as compared to three Asian Englishes
(Hong Kong, Indian, and Singaporean English). Based on data from the
International Corpus of English annotated for variables describing the matrix
verb and the complement, we run two random forests analyses to determine where
the Asian Englishes have developed complementation preferences different from
the two native speaker varieties. We find not only a variety of differences
between the Asian and the native Englishes, but also that the Asian Englishes
are more similar (i.e. ‘better predicted by’) the American English data.
Further, as the first study of its kind to extend the MuPDAR approach from the
now frequent regression analyses to random forests analysis, this study adds a
potentially useful analytical tool to the often messy and skewed observational
data corpus linguists need to deal with.
Keywords: verb complementation, ing vs. to, New (Asian) Englishes, MuPDAR, random forests
1.
Introduction
The English language is rich in syntactic
alternations: the dative alternation, particle placement, the genitive
alternation, that/0-complementation,
to name but a few. Many of these alternations exhibit considerable variation
and much of that variation is co-determined by a large number of linguistic and
cognitive factors that cut across alternations (e.g. constituent lengths,
information statuses, animacy statuses, priming, etc.). What is more, some of
these alternations are lexically-specific
International Journal of
Corpus Linguistics 21:2 (2016), 192–218.
doi 10.1075/ijcl.21.2.03des issn 1384–6655 / e-issn 1569–9811 © John Benjamins
Publishing Company
in the sense that, all other things being equal,
particular words in the grammatical context of a given alternation pattern
increase the chance of a particular syntactic choice. For a long time now,
those alternations have been a point of focus in native language (L1) research
(see Green 1974, Ransom 1979, Collins 1995, Gries 2003, Bresnan et al. 2007 for
studies on the dative alternation and Noonan 1985, Duffley 1999, Smith &
Escobedo 2002 for studies on the ing vs.
to alternation). However, as scholars have gained knowledge on the
governing principles behind syntactic choices in L1, they have only recently
started to explore how speakers other than the “traditional L1 speakers” handle
the probabilistic uncertainty of these choices. As a result, there is now a
fast growing body of corpus-based studies on alternations in EFL (i.e. foreign
varieties of English learned in countries where English is not
institutionalized; see Deshors 2014a, Gries & Deshors 2015, Gries &
Wulff 2013) as well as in ESL (i.e. indigenized varieties of English learned as
a second language in countries such as Hong Kong, India or Singapore and where
English is institutionalized and learners have daily contact with the English
language; see Bernaisch et al. 2014, Deshors 2014b, Schilk et al. 2013, Nam et
al. 2013).
With
regard to ESL specifically, much of the research on syntactic alternations has
focused on the case of Asian Englishes. For instance, recent studies such as
Gries & Bernaisch (2016) have investigated the dative alternation with GIVE
(John gave Mary a book vs. John gave a book to Mary) across six
south Asian Englishes with a view to (i) identify factors triggering different
constructional choices across ESL varieties and (ii) capture the linguistic
epicenter of English in South Asia, using a logistic-regression approach with
random effects (described below as the ‘MuPDAR’ approach). The study clearly
illustrates that as researchers have probed more deeply into syntactic
alternations, their methodology has become more sophisticated. Although early
studies in both learner corpus research (targeting EFL speakers and their
alternation behaviors) and variety corpus research (targeting ESL speakers and
their alternation behaviors) are corpus-based, their “statistical design”
remained relatively simple: many studies did little more than cross-tabulation
of frequencies and percentages of constructional choices across different L1
backgrounds and varieties, maybe followed up by chi-squared or loglikelihood ratio
tests (see Shastri 1996 for an example of an early corpus-based study of ing vs. to complement patterns in Indian English). A first major
improvement was then to apply regression modeling approaches to these questions
of cross-variety variation with a view to model constructional choices as a
function of several predictors. With this regression approach, scholars have
started to do justice to the fact that those constructional choices are
multifactorial in nature (i.e. they investigate syntactic variants on the basis
of a variety of linguistic and cognitive factors; see Bernaisch et al. 2014,
Bernaisch & Gries 2015 and Deshors 2015 for such examples). A second
currently ongoing improvement involves computing mixed-effects regression
models. While mixed-effect models account for the multifactorial nature of
constructional choices, unlike fixed-effect models, they also allow analysts to
account for the hierarchical structure of their corpus data and to control for
effects that are speaker-specific. So in sum, the current state-of-the-art in
crossvariety variation research is essentially a regression-based approach
that, ideally, is (i) multifactorial1 (in the sense of including
multiple independent variables at the same time), (ii) multifactorial2
(in the sense of also including interactions of these independent variables),
and (iii) involves random effects for variables whose levels in the data do not
exhaust the levels observable for the phenomenon in the general population (cf.
Gries 2015a).
These
methodological strategies have already started to provide the ESL research
community with powerful tools to tease apart different populations of ESL
speakers and establish, with a great deal of precision and reliability, how
native and ESL speakers differ in their linguistic choices. Recent work in
learner corpus research and varieties research has taken this methodological
development to the next level with the so-called ‘MuPDAR’ (Multifactorial Prediction
and Deviation Analysis with Regressions). MuPDAR involves a two-step regression
approach. In a first step, a regression R1
is fit on the part of a dataset representing a reference level (e.g. native
speakers or a historical source variety), and the R1 is used to make predictions for the other part of the
same dataset, namely a target sample (e.g. EFL or ESL varieties). In other
words, with MuPDAR, one can determine for every utterance in the target sample
what the predicted “canonical” (or native-like) choice would have been, and a
second regression R2 can
be run to determine what leads to “non-canonical” choices by speakers. That is
to say, this approach answers one of the central questions for comparing
different speakers: in the multivariatelyannotated situation the non-native
speaker is in now, what would the (native) speaker of a certain reference
variety say (and why is it that the non-native speaker did not make the
“native-like” choice)? In the best of cases, both of these regressions involve
mixed-effects models (e.g. Gries & Adelman 2014, Gries & Deshors 2015,
Wulff & Gries 2015).[1]
In this paper, we apply a variant
of the MuPDAR approach to an alternation which has not been targeted much
outside of native English, the to vs ing alternation in Example (1), and
whose study, if not targeting native speakers, has not been corpus-based, has
focused mostly on lexically-specific preferences of the two alternating
patterns and has rarely included other predictors (with the exception of
Deshors 2015 and Khamis 2015).
(1)
a. She prefers to eat French locust.
b. She
prefers eating French locust.
That is, in this paper we are the first to bring
several strands and methodological aspects of research together in the study of
infinitive and gerundial verb complementation in ESL, namely:
i. a rigorously corpus-based/empirical analysis (as
recommended in Kaleta 2012); ii. a multifactorial statistical design in which
we model speakers’ constructional choices as a function of several predictors
at the same time;
iii. a
comparison of native speakers and ESL speakers from three Asian Englishes at
different development stages in Schneider’s (2007) evolutionary model,
specifically, Hong Kong English (HKE; phase II–III), Indian English (IndE;
phase III–IV) and Singaporean English (SingE; advanced phase IV); also, for
reasons to be explained in detail below, we extend the notion of native
speakers to not just British English speakers but also American English
speakers;
iv. a
MuPDAR-like approach in which we attempt to determine precisely if and how
Asian Englishes speakers’ uses of to
and ing complements differ from those
of native speakers.
Given the absence of ESL research on the two
complementation patterns in question and the fact that comparisons between EFL
and ESL are not unproblematic (since the two variants are not necessarily
directly (qualitatively) comparable; see Szmrecsanyi & Kortmann 2011, Gries
& Deshors 2015), the goal of this paper remains largely descriptive. To
date, there are just not many studies from which one can derive working hypotheses
on the principles that govern ESL speakers’ choices of to and ing complement
patterns and that are precise enough to be testable. Further, as will become
clear below, this paper also introduces a methodological alternative to the
currently prevalent state-of-the-art of regression approaches to address cases
where data turns out to be too noisy to be analyzable using regression
modeling. In the next section, we will briefly discuss previous work on verb
complementation patterns and of course in particular to vs. ing; the
discussion will be brief and focus on the factors that have been claimed to be
associated with the complementation choice under investigation and the
relevance of grammatical factors for a better understanding of non-finite verb
complementation patterns in ESL.
2.
To and
ing complementation patterns and
their grammatical environments
In the context of a discussion on the influence of
grammatical environments over speakers’ syntactic choices, it is important to
keep in mind that while certain grammatical features associate more strongly
than others with the choice of a particular complementation patterns, the
extent of this association can also vary depending on the speakers’ English
variety (i.e. native vs. ESL). Section 2.1 below focuses on the grammatical
features associated with the two syntactic variants in focus and Section 2.2
discusses the relevance of exploring verb complementation patterns through the
lens of cross-variety variation.
2.1 The
contribution of grammatical factors to to
and ing complementation
Traditionally,
the relations between predicates and complements, and the choice between
infinitival and gerundial complements specifically, have been investigated from
a semantic perspective, based on Noonan’s (1985: 88) observation that
“[c]omplementation is basically a matter of matching a particular complement
type to a particular complement-taking predicate” (see also Wierzbicka 1988,
Duffley 1999, Langacker 1991, Smith & Escobedo 2002). This line of approach
mainly focuses on the types of verbs that license a particular complement. For
instance, in infinitival cases expressing some kind of potentiality, complement
clauses tend to involve lexical verbs that are compatible with such notion of
potentiality (De Smet 2013). In gerundial cases, three major verb groups were
observed in De Smet (2013) to attract gerund complements: verbs associated with
negative implications (avoid, defend, omit, etc.), emotive verbs (enjoy,
hate, like, etc.), and aspectual verbs (begin, continue, stop, etc.). However, although semantically-based
research in complementation patterns has proved to be relatively fruitful,
semantic approaches are not sufficient to accurately predict speakers’ choices
of complements (Smith & Escobedo 2002).
Against
this background, studies such as Noël (2003), Vosberg (2003), Rohdenburg
(1995), Mindt (2000), and Mair (2002) have shown the need to include additional
non-semantic factors to the analysis of to
vs. ing alternations such as information
structure, the horro aequi principle,
the cognitive complexity principle, social and regional stratification, and
register. One particularly illuminating study is Cuyckens et al. (2014), which
focuses on the finite vs. non-finite complement patterns of remember, regret and deny in Late
Modern English (i.e. complement-taking predicate + to-infinitive, gerundial -ing
and that-clause). Cuyckens et al.’s
(2014) study investigates complement patterns based on a wide range of
language-internal and language-external factors (e.g. the meaning of the matrix
verb, the meaning of the complement clause, the relation between the time
reference of the complement and the meaning of the complement-taking predicate,
type of subject in the complement clause, structural complexity of the
complement clause, the type of subject in the main clause, the animacy of the
subject in the complement clause, the voice in the complement clause, the type
of complementation, and others). Using a binary logistic regression approach with
fixed effects, the authors pinpoint the factors that favor and to some degree
characterize nonfinite verb complements in native English. For instance, they
find that main clauses with a first-, second- and third- person pronoun as well
as a noun favor to and ing complements. Similarly, speakers
tend to prefer non-finite clauses with more complex complement clauses (i.e.
clauses including verbs with one argument/ modifier, verbs with an argument +
modifier, or two arguments, or two modifiers). Finally, another factor that
favors non-finite complements is passive voice. While Cuyckens et al. (2014)
clearly show the benefits of combining a contextualized approach to verb
complementation with a sophisticated statistical methodology, their results
urge us to take an even closer look at to
and ing complements and assess the
contribution of linguistic factors to each construction variant. In the context
of indigenized English varieties, this is a particularly important goal to
pursue, given recent developments in ESL research showing how specific
grammatical contexts can be the loci of differentiation of individual ESL
varieties from native English.
2.2 The
contribution of grammatical factors to alternating verb complements in ESL
Recent literature on post-colonial Englishes
recognizes the value of investigating the interface between lexis and grammar
to better characterize ESL varieties. With specific regard to complementation,
the field is currently witnessing a fast growing development of corpus-based
research aimed to track systematic deviations in the contexts of use of
complement constructions (see OlavarrÃa de Ersson & Shaw 2003; Mukherjee
& Hoffmann 2006; Mukherjee & Schilk 2008; Schilk et al. 2012, 2013; and
Deshors 2014a). One particularly interesting study is Schilk et al. (2012) in
which the authors focus on the complementation patterns of the lexical verbs convey, submit and supply (which
are typically used in the transfer-caused motion construction [TCM] such as give something to someone) across
Indian, Sri Lankan and British Englishes. Drawing from the verbs’ contexts of
use, the authors find that across ESL varieties there are “differences [that]
manifest themselves at the level of individual complementation patterns, and in
many cases these differences can be explained if we look at the actual
linguistic realizations of the patterns and the lexical items used in them”
(Schilk et al. 2012: 162). Ultimately, it emerges that a focus on the
lexico-grammatical environments of complement constructions offers new insights
into distinctive and so far largely neglected structures of English varieties
(Schilk et al. 2012).
As hinted
in our introduction, in order to explore cross-variety variation at the lexis-grammar
interface, ESL scholars are increasingly starting to adopt regression
statistical approaches that allow them (i) to include in their analyses wide
ranges of linguistic factors and (ii) to analyze alternating syntactic pairs by
cutting across the semantic and morphosyntactic levels. For instance, Nam et
al. (2013) use this approach to investigate three complementation patterns of
give (ditransitive, prepositional dative and monotransitive) and they begin to
unveil the governing principles behind ESL speakers’ choices of one
construction over another. However, despite their usefulness, multifactorial
approaches in ESL have mainly been used to investigate the dative alternation
and their application remains to be extended to other syntactic alternations.
One recent exception is Deshors (2015) which explores to vs. ing across native
and Hong Kong English using binary logistic regression modeling on the basis of
3,119 occurrences of the two complement patterns. Overall, the study reveals
that Hong Kong English speakers overuse to
complements with a non-finite verb form in the predicate, complement verbs
denoting a cognitive process and objects expressed in the form of either a
prepositional phrase, a noun phrase or a pronoun. Although Deshors’ (2015)
study is a first step towards distinguishing the linguistic features that
generally contribute to each complement construction in native and Hong Kong
Englishes, the study investigates only one ESL variety and does not follow the
newly developed MuPDAR protocol.
3.
Methodology
For the purpose of this study, we used data from the International
Corpus of English (ICE; see Greenbaum 1996). In Section 3.1, we explain how our
data were extracted as well as our annotation procedure and in Section 3.2 we
present our statistical approach.
3.1 Corpus
data and their annotation
The present work contrasts the uses of gerundial and
infinitival complement constructions across different subsections of the ICE
corpus, as recommended in Mukherjee & Gries (2009). Therefore, our study
includes the following sub-parts from the corpus: native English, British (BrE)
and American (AmE), and three ESL varieties, Hong Kong (HKE), Indian (IndE) and
Singaporean (SingE) Englishes. To some readers, our choice to include American
data as representative of native English may seem somewhat surprising, mainly
because, traditionally, BrE has been the default variety considered as the
historical source variety for Asian Englishes. However, even though Asian
Englishes have been historically most influenced by BrE, in an age in which
American culture leaves a mark on cultures everywhere — via music, television,
and cinematic culture — it is fair to assume that Englishes all over the world
are now beginning to not only look back to BrE as a historical source variety,
but also to AmE as a contemporarily perhaps even more influential source
variety (see Mair’s (2013) article on the world system of Englishes for a
justification of why, theoretically, it makes sense to use American English as
a reference variety in addition to BrE). Further, an increasing number of
contrastive studies on ESL varieties are beginning to go beyond BrE as the only
relevant native variety/standard of comparison: for our purposes, more
importantly, Edwards (2014) and Koch (2015) include AmE in their analyses, and
Hoffmann (2014) uses four native varieties, namely BrE, NZE, CanE, and IrE.
Thus, there is clearly the beginning of a trend to look beyond BrE only and we
are following this recent development. Finally, as we will show below in
Section 4.4, in the present data, it is in fact the AmE data that the Asian
Englishes are more similar to, not BrE. Therefore, it emerges that theoretical
considerations as well as empirical data converge in their “recommendation” to
at least exploratorily also consider AmE in the data (until proven otherwise).
The data
were extracted from the relevant written sub-sections of ICE. No distinction
across writers was made on the basis of age or educational background and all
material was extracted and statistically analyzed using the software R (R Development Core Team 2012). With
regard to the compilation of the data set, we followed MartÃnez-GarcÃa &
Wulff’s (2012) methodology: for all sub-corpora, we first retrieved all
instances of words ending in -ing and
all instances of the preposition to.
In a second step, true hits of either complementation construction were checked
manually for syntactic relevance, yielding a sample of approximately 7,400
instances of gerundial and infinitival complementation construction from each
of the five corpora. This data sample was then trimmed down so as to only
include the verbs that were attested in both to and ing constructions.
Table 1 provides an overview of the distribution of the two constructions
across all sub-corpora.
Table 1. Summary of the occurrences of ing and to constructions in the sub-corpora
Complement
pattern |
ICE-GB |
ICE-US |
ICE-HK |
ICE-IND |
ICE-SING |
Total |
ing |
84 |
187 |
126 |
102 |
128 |
627 |
to |
990 |
781 |
753 |
531 |
753 |
3,808 |
Total |
1,074 |
968 |
879 |
633 |
881 |
4,435 |
Each match was annotated against twelve grammatical
factors (see Table 2 for a list of all those factors and their respective
levels). To ensure a thorough treatment of the data, each factor was encoded
according to a taxonomy established to allow for its measurement and its
consistent treatment across the five sub-corpora. The annotation process was
carried out using a careful bottom-up approach and a taxonomy that is
theory-consensual in nature. In other words, our annotation is based on the
linguistic analysis of the context of use of each complementation pattern
extracted from the corpus and it reflects the unique combination of grammatical
components included within their context of use. So, similarly to Divjak &
Arppe (2013), the annotation scheme accounts for both exemplars (i.e. specific
instances of use) and their abstractions. This is an important aspect as no
other study (but Deshors 2015) has so far adopted this type of data annotation
to contrast gerundial and infinitival complementation patterns in ESL. Table 3
presents an abbreviated sample of the annotation table.
Table 2. Overview of the variables used in
the annotation of the native English and ESL data and their respective levels
Variable Variable levels
COMPLPATTERN
(complementation pattern; gerund, infinitive dependent variable)
LEMMACOMP
(lemma in the complement be,
sponsor, work, perform, … clause)
LEMMAMATRIX
(lemma in the matrix appear,
seem, accept, adopt, … clause)
COUNTRY (English variety) british, american, hong
kong, indian, singaporean
FINITEMATRIX
(finite or non-finite use of finite,
non-finite the verb in the matrix clause)
VOICE (voice of the matrix verb) active, passive
COMPVERBSEM (semantics of the
comple- abstract, action,
communication/informational, ment’s lexical verb) copula, cognitive/emotional, perception
MATRIXVERBSEM (semantics of
the matrix’s abstract, action, communication/informational,
lexical verb) copula,
cognitive/emotional, perception
MATRIXVERBTYPE
(type of the matrix verb) state, accomplishment, achievement, process
COMPTVERBTYPE (type of the complement state, accomplishment, achievement,
process verb)
neg (negation) neg,
affirm
OBJECTFORM (form of the object) pp (prepositional phrase), np
(noun phrase), do
(double
object), pr (pronoun), no(no object)
Table 3. Abbreviated sample of the
annotation table
COMPLPATTERN COUNTRYVOICE COMPVERBSEM MATRIXVERBTYPECOMPVERBTYPEOBJECTFORM
to |
sing |
active abstract |
state |
state |
np |
ing |
hk |
passive abstract |
process |
process |
np |
… |
… |
… … |
… |
… |
… |
It is
important to note here that at the semantic level, our taxonomy provides a more
fine-grained approach to lexical verbs in predicate and complement clauses than
is currently offered in existing studies. This is because the taxonomy
distinguishes between types of
lexical verbs and their semantics.
Broadly, the MATRIXVERBTYPE and COMPVERBTYPE variables mark the types of
lexical verbs used in the predicate and in the complement clause. Conceptually,
the two variables follow Vendler (1957: 143) in its recognition that the notion
of time is crucially related to the use of a verb and is “at least important
enough to warrant separate treatment”. Vendler (1957) identifies four types of
verbs namely state, accomplishment, achievement and process.
This verb classification distinguishes between time periods and time instants
on the one hand and uniqueness/definiteness and non-uniqueness/indefiniteness
of those time periods and time instants on the other hand. As Vendler (1957:
146) notes, “some verbs can be predicated for single moments in time, while
others can be predicated for shorter or longer periods of time”. In that
respect, accomplishment verbs encode verbal statements that imply a unique and
definite time period and achievement verbs encode verbal statements that imply
a unique and definite time instant. Similarly, process verbs identify
statements that reflect non-unique and indefinite time periods and state verbs
identify statements that reflect non-unique and indefinite time instants. Like
the MATRIXVERBTYPE and COMPVERTYPE variables, the MATRIXVERBSEM and the
COMPVERSEM variables target lexical verbs used in the predicate and in the
complement clause, and they identify the type of information that lexical verbs
convey in terms of abstraction, action, communication, etc.
3.2 Statistical
evaluation
The present data proved extremely difficult to analyze
by what, over the last few years, has become the standard statistical analysis,
i.e. regression modeling of the traditional or the MuPDAR kind, which initially
was our preferred choice. However, the data set proved to be extremely
recalcitrant. In a first attempt to explore if and how the complementation
patterns differ between the native and the Asian Englishes, we considered a
multifactorial2 regression approach with a bidirectional model
selection procedure based on AIC.
That is to say, we fit a very small model in which the COMPLPATTERN (ing vs. to) is predicted only from COUNTRY as well as varying intercepts
for FILE and LEMMAMATRIX (varying intercepts for LEMMACOMP were not attempted
to fit because of the large number of verb lemmas with extremely low
frequencies; for the same reason, no varying slopes were implemented) and then
tried to improve the model by adding or subtracting the predictors that
improved the model fit most (with AIC
as an indicator of quality). However, it quickly became obvious that this route
was not feasible because after having added five predictors this way, data
sparsity and collinearity led to models that did not converge and whose confidence
intervals covered the whole probability interval from 0 to 1.
In a
second attempt and to overcome such problems, we used a multimodel inferencing
approach (see Burnham & Anderson (2002) for a statistical introduction,
Kuperman & Bresnan (2012) for the first approach in linguistics we are
aware of, and Gries (2015b) for a recent application in cognitive/usage-based
linguistics). However, multiple attempts to use this strategy were not
successful either given the large number of possible combinations of predictors
and all their interactions with COUNTRY, the variable representing the variety.
In fact, a multimodel inferencing script was shut down after running for about
200 hours on 7 cores of an 8-core Intel i7 3.4 GHz processor with Multithreaded
BLAS/LAPACK libraries. Given these results, it became clear that the current
default of regression modeling was not going to be successful with the present
data; that means a lot of the conceptual machinery of regression modeling we
would have been interested in was not available to us anymore — interactions of
predictors, random-effects/multilevel structure, and user-defined a priori
contrasts — and an alternative approach to analyzing the data had to be
developed. We decided on an analysis involving random forests of the kind used
by Bernaisch et al. (2014) for the dative alternation in Asian Englishes.
Random forests is an approach that is similar to classification trees, but also
extends it considerably. Classification (and regression) trees are a
partitioning approach that consists of successively splitting the data into two
groups based on some independent variables such that the split maximizes the
classification accuracy regarding the dependent variable within the groups.
This process is recursive, i.e. repeated until no further split would increase
the classification accuracy sufficiently. Random forests in turn add two layers
of randomness to the analysis, which help (i) recognizing the impact of
variables or their combinations that a normal classification tree might not
register and (ii) protecting against overfitting. On the one hand, the
algorithm constructs many different trees (we set that parameter to 2,000),
each of which is fitted to a different bootstrapped sample of the full data. On
the other hand, each split in each tree could choose from only a randomlychosen
subset of predictors (we set that parameter to three predictors). The overall
result is then based on amalgamating all 2,000 trees that have been generated.
While this approach does not
provide us with all that regression analyses would have to offer (to those data
sets where they work), it comes with several advantages, too: random forests
are known to generate quite good accuracies, they can be computed fast, they do
not make the kinds of distributional assumptions that regression models do (and
that observational/corpus data often violate), and, as explained above, because
of the sampling procedures they overfit and overlook important predictors less
easily and are better able to handle collinear predictors (see Matsuki et al.
(2016) for similar conclusions and the additional finding that random forests
are well suited for many-predictors-few-datapoints problems); we are using the
implementation in the R package randomForest (Liaw & Wiener 2015, version
4.6–12). The perhaps thorniest issue of random forests is how to
interpret/visualize their results. Since random forests consist of thousands of
very different trees, there is no obvious way to compute p-values for predictors (main effects or interactions) and there
are no obvious ways to determine the effect of a predictor in isolation. The
solution we adopt here is a heuristic, but one that worked very well in
Bernaisch et al. (2014): we represent average predictions for the (combinations
of) variables of interest and plot them in dotcharts. While the potential
downside of this approach is that these predicted probabilities are based on
more predictors that the one they are visualized for — in other words, they are
not based on only the predictor being visualized with every other predictor
held constant as in effects plots for regressions — they nevertheless provide a
good approximation for the effects of predictors that is convenient and more
interpretable than other statistics; in addition, we provide variable
importance measures (as provided by Liaw & Wiener’s (2015) package).
To sum
up, in this paper we are extending Gries and colleagues’ (Gries & Adelman
2014, Gries & Deshors 2014, Gries & Bernaisch 2016, Wulff & Gries
2015) MuPDAR approach from one based on (two) regressions to one based on two
random forests (MuPDARF, with RF standing for random forests). Specifically,
we:
i. do a random forests analysis
on only the native BrE and AmE speakers and test whether its fit is good enough
to proceed; this analysis uses COMPLPATTERN as the dependent variable and the
following as predictors: FINITEMATRIX, VOICEMATRIX, NEGMATRIX, OBJECTFORM,
VERBSEMCOMP, VERBTYPECOMP,
VERBSEMMATRIX, VERBTYPEMATRIX, and
COUNTRY (for the variety); ii. if the fit is good enough, we apply the results
from the first random forests analysis to the HKE, IndE, and SingE speakers to
obtain predictions of what native speakers would have said in the contexts that
the indigenized variety speakers were in;
iii. compare
the native-speaker predictions against the indigenized-variety choices to see
how much the two coincide; for that we compute a numeric variable called
DEVIATION, which is:
a. set
to zero when the indigenized-variety speaker made the choice a native speakers
is predicted to have made;
b. between
-0.5 and 0 when the indigenized-variety speaker chose to although the native speaker is predicted to have chosen ing;
c. b
etween 0 and 0.5 when the indigenized-variety speaker chose ing although the native speaker is
predicted to have chosen to.
The exact value depends on how strongly the
native speaker was predicted to choose to/ing. Thus, higher absolute values of
DEVIATION indicate that indigenized-variety speaker made choices that are more
at odds with what native speakers were predicted to have said.
iv. do
a second random forests analysis that models the non-nativelike choices of
indigenized-variety speakers, i.e. DEVIATION in all cases where DEVIATION≠0, as
a function of, again, FINITEMATRIX, VOICEMATRIX, NEGMATRIX, OBJECTFORM,
VERBSEMCOMP, VERBTYPECOMP, ERBSEMMATRIX, VERBTYPEMATRIX, and COUNTRY.
4.
MuPDARF: A step-by-step presentation of our
results
As
explained in Section 3, our statistical approach involves a series of steps.
Therefore, in this section we report on our findings in way that reflects the
sequence of our various methodological steps. Accordingly, Section 4.1 presents
the results of the first random forests analysis (on native-speaker data),
Section 4.2 reports on predicted syntactic patterns in ESL, Section 4.3
presents the results of the second random forests analysis (on ESL-speaker
data), and Section 4.4, which takes a closer look at BrE vs. AmE, reports on
how, with a MuPDARF approach, we are able to pinpoint the specific native
English variety that predicts best verb complementation patterns in ESL.
4.1 Random
forests 1 on native-speaker data
The first analysis yielded a classification accuracy
of 88.5%, which is not much, but significantly higher than the baselines of
always choosing the more frequent complementation pattern (i.e. to) or choosing proportionally randomly
(pbinomial
test against baseline1<0.01, pbinomial
test against baseline2<10–40). More illuminating is the analysis’s C-value, which just about exceeds the
usually-assumed threshold value for “good” results of 0.8 with a value of 0.81.
We therefore proceeded with the analysis.
4.2 Applying
the first results to the indigenized variety data
We then used the above results to compute a random
forests-based prediction for every case in the ESL variety data. As in previous
MuPDAR analyses, the results here were more mixed in the sense that the
classification accuracy measure went down a bit (to 85.2%), which is little
surprising given that one would expect the ESL speakers to not behave
completely predictably from the native-speaker data. Correspondingly, the C-value also decreased to 0.76,
indicating that (aspects of) the choices that ESL speakers make are not
completely compatible with those predicted for the native speakers, and it is
precisely that difference that the second regression or, here, the second
random forests analysis explores. As mentioned above, we then computed the DEVIATION
variable that captures the degree, if any, to which the ESL speakers’ choices
differed from the native-speaker predictions. Descriptively, it was already
interesting to notice that there are significant differences between the three
Asian Englishes (according to a Kruskal-Wallis rank sum test: χ2=7.24, df=2, p=0.027) such that
the IndE speakers differ more from the native speaker predictions than the HKE
and SingE speakers.
4.3 Random
forests 2 on deviations from native-speaker predictions
The
final analysis consisted of trying to model non-zero cases of DEVIATION on the
basis of the same predictors as before. The overall summary results were very
encouraging in the sense that the statistical analysis could predict the
DEVIATIONvalues very well (adjusted R2=0.87),
which is why we felt justified to explore the results further, first, by
assessing the importance of individual variables and, second, by looking at how
the values of DEVIATION differ for the crossing of every predictor and COUNTRY.
As for the former, Figure 1 shows a
plot that summarizes two indices of variable importance: on the x-axis, we show a normalized measure of
the size of the prediction error, which, with some simplification, results from
how much worse predictions become if the variable in question has its values
randomly permuted; on the y-axis, we
show a measure that represents how clean the splits in all classification trees
are based on the residual sum of squares, a measure also commonly used in
regression analyses; for both axes, high values reflect high variable
importance.
As is obvious, both measures of
variable importance largely coincide, with VERBSEMMATRIX, VERBTYPEMATRIX,
VERBSEMCOMP, and OBJECTFORM having the strongest effects. Interestingly,
COUNTRY on its own does not appear to have a strong effect but, on the other
hand, just like it would be in a regression context, it is mostly the
interactions of COUNTRY with other predictors that one would be interested in
anyway. In what follows, we will discuss those roughly in order of variable
mention in Figure 1.
Figure 2 shows what is roughly
equivalent to the interaction VERBSEMMATRIX : COUNTRY in a regression context.
Both panels show the same result with predicted DEVIATION-values on the x-axis, but in the top one, the levels
of VERBSEMMATRIX are nested into COUNTRY (to facilitate comparisons between
semantic classes per
VERBSEMMATRIX
VERBSEMCOMP
OBJECTFORM
VERBTYPECOMPFINITEMATRIX
COUNTRY
VOICEMATRIX
NEGMATRIX
IncMSE
Figure 1. Variable Importance measures from
the second random forests analysis
country/variety),
whereas in the bottom one, the levels of COUNTRY are nested into VERBSEMMATRIX
(to facilitate comparisons of countries per semantic class).
The results show that SingE is
most similar to NatE, closely followed by HKE, with IndE deviating from NatE
more. Overall, this variable-specific result is compatible with the overall
result mentioned above. However, it is also obvious from the result that IndE
differs most from NatE in three semantic contexts:
i.
with action verbs in the matrix verb slot, IndE
speakers are much more likely to use to
than native speakers;
ii. with
abstract verbs in that slot, IndE speakers are more likely to use ing than native speakers;
iii. with
communication verbs, IndE speakers are a bit more likely to use to than native speakers.
Figure 3 is a corresponding representation of the
results for VERBTYPEMATRIX : COUNTRY. In this case, although the results for
the three ESL varieties are much more similar to each other, the same ordering
of similarity to NatE is observed: SingE > HKE > IndE. However, here, the
more interesting aspect is that variation across ESL varieties is limited to
specific semantic classes: for states and accomplishments, the three varieties
do not differ much from NatE, nor do they differ from each other. In the
specific case of states, all Asian speakers slightly overuse ing. For accomplishments, most Asian
speakers slightly overuse to. For
processes
The effect of matrix verb semantics and country
hk
abstract
action cogn
comm
emot perc
ind
abstract
action cogn
comm
emot perc
sing
abstract
action cogn
comm
emot perc
Predicted Deviation value (mean)
The effect of matrix verb semantics and country
abstract hk ind sing
action
hk ind sing
cogn
hk ind sing
comm
hk ind sing
emot
hk ind sing
perc
hk ind sing
Figure 2. The effect of matrix verb semantics
and country across HKE, IndE and SingE
The effect of matrix verb type and country
hk accomplishment
achievement process state
ind
accomplishment achivement process state
sing
accomplishment achivement process state
Predicted Deviation value (mean)
The effect of matrix verb type and country
accomplishment hk ind
sing
achivevement
hk ind sing
process
hk ind sing
state
hk ind sing
Figure 3. The effect of matrix verb type and
country across HKE, IndE and SingE
verbs (e.g. try,
resist, contemplate, seek), all
Asian varieties use ing more often
than native speakers would whereas, for achievements (e.g. decide, stop, die), HKE and IndE speakers use to more than native speakers would.
Figure 4 is concerned with
VERBSEMCOMP : COUNTRY. The results are somewhat similar to those of
VERBSEMMATRIX : COUNTRY. Again, HKE and SingE are more similar to the
native-speaker choices (although this time HKE is slightly more similar to NatE
than SingE) compared to IndE, which exhibits some higher deviations. As before,
however, this deviation of IndE is not found across the board — it is most
pronounced with complement verbs denoting emotions (e.g. suffer, hate, enjoy), actions (e.g. perform, create, build), and
communication (e.g. request, inform, discuss). Interestingly, now that we are looking at the complement
clause, some tendencies are reversed: while IndE speakers again overuse ing with emotion verbs (this time in the
complement clause, not as above in the matrix clause), with action and
communication verbs, they now, in the complement clause, overuse to relative to native speakers.
Figure 5
represents the results for OBJECTFORM : COUNTRY. The upper panel is not
particularly informative, but it does draw attention to the fact that, on the
whole, the seven object forms pattern somewhat similarly across the three
varieties. The lower panel shows more clearly that for each object form, IndE exhibits
the largest deviations (by overusing ing),
and that the level OBJECTFORM: no
leads to the least nativelike choices. However and as maybe expected for a
variable with much less importance than the previously discussed ones, there
seems to be much less systematic patterning here.
For reasons of space, we will
discuss the remaining results, which are associated with lower variable
importance scores anyway, just summarily (and we will not discuss COUNTRY as a
main effect, given that it participates in the above interactions):
i.
for FINITEMATRIX : COUNTRY, we find that (i)
SingE is closer to NatE than HKE, which in turn is closer to NatE than IndE,
and (ii) in HKE and IndE, finite matrix verbs lead to slight overuses of ing whereas non-finite ones lead to much
more pronounced overuses of to.
ii. for
VERBTYPECOMP : COUNTRY, the results are very similar to those of VERBTYPEMATRIX
: COUNTRY: IndE differs from NatE than the other two Asian Englishes, and
particular so in their overuse of ing
with process verbs. iii. for VOICEMATRIX : COUNTRY, the Asian English speakers
overuse ing more with active matrix
verbs — with passive ones, they are very close to the predicted NatE choices.
iv. for NEGMATRIX : COUNTRY, no
particularly strong pattern emerges, the only potential effect is that, for HKE
and IndE speakers, they are more nativelike with negated matrix verbs.
Figure 4. The effect of complement verb
semantics and country across HKE, IndE and SingE
The effect of the object form and country
hk adjv infsubcl
missing no np pp
pr
ind
adjv infsubcl
missing no np pp
pr
sing
adjv infsubcl
missing no np pp
pr
Predicted Deviation value (mean)
The effect of the object form and country
Figure 5. The effect of object form and
country across HKE, IndE and SingE
4.4 Excursus:
BrE vs. AmE
As we discussed briefly above, including AmE among the
native-speaker data is not totally uncontroversial. While we have already
mentioned several reasons why we remain convinced that this is a good idea, we
are now also in a position to provide empirical data to that effect. In
particular, we apply Gries & Bernaisch’s (2016) bottom-up strategy to
identify epicenters to our present data. We: i. split up the native-speaker
data into the BrE and the AmE data; ii. ran separate random forests analyses on
each of them;
iii. used
both the BrE random forests results and the AmE random forests results
separately to make predictions for the Asian Englishes data;
iv. then
computed how well the BrE predictions and the AmE predictions predicted what
the ESL speakers would say and (i) explored that correlation in C-values as well as (ii) tested
statistically whether the Asian Englishes deviation values from the BrE
predictions were significantly different from those of the AmE predictions.
The
results were very clear: using BrE to predict the Asian Englishes data led to
an extremely poor C-value (0.51)
whereas using AmE to predict the Asian Englishes data led to a much better one
(0.76). In addition, the mean of the absolute deviations from the BrE examples
(0.082) are nearly twice as high as those from the AmE examples (0.044), a
difference that is statistically significant both in a U-test (W=3215000, p<10–10) and a
Kolmogorov-Smirnov test (D=0.124, p<10–10). In other words,
it is the American English data that are more similar to the Asian Englishes
data. While that does of course not provide evidence with regard to the exact
causal patterning, it does provide prima
facie evidence against discarding any AmE data in an a priori fashion.
5.
Conclusions
As mentioned above, given the absence of testable
hypotheses regarding the to vs. ing alternation with indigenized variety
speakers, this study has a decidedly descriptive slant, which is also
compatible with the current random forests analysis (rather than the use of the
hypothesis-testing approach of regression modeling). After having provided
detailed representations of the results above, we do not reiterate those here,
but believe it is justified to summarily state that (i) the set of formal and
semantic characteristics we annotated do distinguish reliably between the two
complementation constructions and that (ii) the overall method is applicable
and yields results with an overall good degree of prediction accuracy.
It is therefore interesting to look at the results
from the perspective of how they relate to widely-used theories such as
Schneider’s (2007) dynamic-evolutionary model. In a nutshell, according to that
model, New English (i.e. ESL) varieties follow a uniform pattern of evolution world-wide
as “there is a shared underlying process which drives [the formation of
postcolonial Englishes], accounts for many similarities between them, and
appears to operate whenever a language is transplanted” (Schneider 2007: 29).
Crucially, indigenized Englishes evolve following a sequence of five
characteristic stages that are associated with linguistic changes and the
gradual emergence of locally characteristic linguistic patterns. According to
the model, HKE, IndE and SingE all represent different stages in the evolution
of postcolonial Englishes: HKE is transitioning between phases II and III (i.e.
while the input language still determines language standards and norms in the
variety, it is nonetheless becoming an integral part of the local linguistic repertoire),
IndE is transitioning between phases III and IV (i.e. as the input language is
becoming part of the local linguistic repertoire and the number of competent
bilingual L2 speakers increases, the new English variety is beginning to
develop accepted local standards), and SingE is at an advanced stage of phase
IV (i.e. as the input language may have been retained as a (co-)official
language and is used for intra-national contexts, the variety is undergoing in
stabilization process).
Bearing
this context in mind, much previous alternation research has yielded results
such that, the more advanced a stage a variety is at, the more it becomes
different from the historical source variety (see Mukherjee & Gries 2009).
Interestingly, this is not the case here: on the whole, the stagewise more
advanced variety of SingE is more, not less, similar to the native speaker
data, conversely, the stagewise less advanced varieties of HKE and IndE are less,
not more, similar to the native speaker data (but see also below). While our
results do in no way disprove Schneider’s (2007) model, they raise the
interesting possibility that, as varieties become increasingly emancipated as a
whole from some source variety (in previous work, typically BrE), which in many
earlier studies has manifested itself in them becoming more different, it seems
— and it would be premature to state this any more strongly — this does not
always have to be the case. More abstractly, emancipation can, but need not
always, result in unidirectional pathways away from the historical source
variety. HKE and IndE are different from NatE, but whatever crystallization of
patterns emerged for SingE on its way to stage IV did not make SingE even more
different from NatE, and it will be interesting to see whether similar pathways
can be identified for other phenomena.
That being observed and
hypothesized, our result may also have to do with one other relative innovation
in this study, the potentially controversial inclusion of AmE into the NatE
data. While we believe we have provided three good reasons for why we have done
this, this may affect the results such that there would still just be a
unidirectional development away from BrE (even for SingE), but the data do not
show that because SingE is also (more?) evolving in the direction of AmE, and
indeed our excursus exploration of which of the two native varieties the Asian
Englishes are more similar to returned AmE. In other words, this seems to us to
be an encouragement for future research to not just look at what ESL varieties
are developing away from (BrE), but also where “they are headed”, which on the
one hand will be shaped by local L1s, but on the other potentially also by
contemporary influential cultures (influential on world-wide scale, that is).
While, based on our results, it is clearly too soon to claim that ESL varieties
have started to undergo a process of Americanization, our results certainly
stress the importance of rigorously accounting for the globalization of English
and its effect on ESL (Bolton 2008) in our corpus studies. In turn, our results
raise the important question of what constitutes an appropriate native-speaker
yardstick in the late 20th and early 21st century. Given the currently
available data, a more precise analysis contrasting BrE and AmE separately with
all the Asian Englishes data awaits much larger data sets, ideally
manually-annotated data sets that then also allow for regression modeling
again, given the interesting methodological tools that provides for the analyst
(in particular, controlling for lexically-specific effects).
Finally,
there is a trivial sounding but nonetheless important methodological lesson to
be learnt here, namely that sometimes data sets do not permit use of the
method(s) that has/have emerged as the state-of-the-art. This has two
consequences. First and again seemingly trivially, we need to be familiar
(enough) with a range of tools that allow us to squeeze information out of our
data sets that are often limited in size and annotation (for obvious reasons
having to do with the availability of representative (!) corpora and annotation
manpower); in this case, while random forests do not provide all the
“machinery” that regression modeling provides, it is more compatible with the
descriptive/exploratory approach we adopted, and it also offers us some
interesting results and in fact even some advantages (minimizing the risks of
overfitting and collinearity, e.g.). Second, we need more authors testing and
reporting in their papers whether the methods they used were in fact
applicable/appropriate given the data. Observational corpus data are often
extremely skewed and often highly collinear, but there are very few papers out
there that mention these facts and, more importantly even, how the authors
dealt with those threats. For instance, it is temptingly easy to proceed
without testing assumptions or to just ignore convergence warnings and high
variance inflation factors, but that comes at the cost of the validity of the
results, and being able to test for these assumptions and, if so needed, switch
to an alternative method, is therefore important for the discipline that is
slowly beginning to be more statistical as a whole. We hope that the questions
we raise above, and the recommendations we just made, will lead to an
increasingly sophisticated exploration of what it means for (English) varieties
to emancipate themselves.
References
Bernaisch,
T., Gries, St. Th., & Mukherjee, J. (2014). The dative alternation in South
Asian English(es): Modelling predictors and predicting prototypes. English World-Wide, 35(1), 7–31. doi: 10.1075/eww.35.1.02ber
Bolton, K.
(2008). Varieties of World Englishes. In B. B. Kachru, Y. Kachru & C. L.
Nelson (Eds.), The Handbook of World
Englishes (pp. 289–312). Singapore: Wiley-Blackwell.
Bresnan, J.,
Cueni A, Nikitina, T., & Baayen, R. H. (2007). Predicting the dative
alternation. In G. Bourne, I. Kraemer & J. Zwarts (Eds.), Cognitive Foundations of Interpretation
(pp. 69–94). Amsterdam: Royal Netherlands Academy of Science.
Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A
Practical Information-theoretic Approach (2nd ed.). New York, NY: Springer.
Collins, P. (1995). The
indirect object construction in English: An informational approach.
Linguistics, 33(1), 35–49. doi: 10.1515/ling.1995.33.1.35
Cuyckens,
H., Frauke D., & Szmrecsanyi, B. (2014). Variability in verb complemention
in late modern English: Finite vs. non-finite patterns. In M. Hundt (Ed.), Late Modern English Syntax (pp.
182–204). Cambridge: Cambridge University Press. doi: 10.1017/CBO9781139507226.014
De Smet, H. (2013). Spreading
Patterns: Diffusional Change in the English System of Complementation.
Oxford: Oxford University Press.
Deshors, S.
C. (2014a). Towards an identification of prototypical non-native modal
constructions in EFL: A corpus-based approach. Corpus Linguistics and Linguistic Theory, 11(1), 19–50.
Deshors, S.
C. (2014b). A case for a unified treatment of EFL and ESL: A multifactorial
approach. English World Wide, 35(3), 279–307. doi: 10.1075/eww.35.3.02des
Deshors, S.
C. (2015). A constructionist approach to gerundial and infinitival
verb-complementation patterns in native and Hong Kong English varieties. English Text Construction, 8(2), 207–235. doi: 10.1075/etc.8.2.04des
Divjak, D. S., & Arppe, A. (2013). Extracting prototypes
from exemplars: What can corpus data tell us about concept representation? Cognitive Linguistics, 24(2), 221–274. doi: 10.1515/cog-2013-0008
Duffley, P.
(1999). The use of the infinitive and the -ing
after verbs denoting the beginning, middle and end of an event. Folio Linguistica, 23(3), 295–331.
Edwards, A. (2014). The
progressive aspect in the Netherlands and the ESL/EFL continuum.
World Englishes, 33(2), 173–94. doi: 10.1111/weng.12080
Green, G. M.
(1974). Semantic and Syntactic
Irregularity. Bloomington, IN: Indiana University Press.
Greenbaum, S. (Ed.) (1996). Comparing English Worldwide: The International Corpus of English.
Oxford: Clarendon Press.
Gries, St. Th. (2003). Multifactorial
Analysis in Corpus Linguistics: A Study of Particle Placement. London:
Continuum Press.
Gries, St.
Th. (2015a). Quantitative linguistics. In J. D. Wright (Ed.), International Encyclopedia of the Social and
Behavioral Sciences (2nd ed.) (pp. 725–732). Oxford: Elsevier. doi: 10.1016/B978-0-08-097086-8.53037-2
Gries, St.
Th. (2015b). The role of quantitative methods in Cognitive Linguistics: Corpus
and experimental data on (relative) frequency and contingency of words and
constructions. In J. Daems, E. Zenner, K. Heylen, D. Speelman, & H.
Cuyckens (Eds.), Change of Paradigms -
New Paradoxes: Recontextualizing Language and Linguistics (pp. 311–325).
Berlin: De Gruyter Mouton.
Gries, St.
Th., & Adelman, A. S. (2014). Subject realization in Japanese conversation
by native and non-native speakers: Exemplifying a new paradigm for learner
corpus research. In J. Romero-Trillo (Ed.), Yearbook
of Corpus Linguistics and Pragmatics 2014: New Empirical and Theoretical Paradigms
(pp. 35–54). Cham: Springer.
Gries, St. Th., &
Bernaisch, T. (2016). Exploring epicenters empirically: Focus on South Asian
Englishes. English World-Wide, 37(1), 1–25. doi: 10.1075/eww.37.1.01gri
Gries, St.
Th., & Deshors, S. C. (2014). Using regressions to explore deviations
between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136. doi: 10.3366/cor.2014.0053
Gries, St.
Th., & Deshors, S. C. (2015). EFL and/vs. ESL? A multi-level regression
modeling perspective on bridging the paradigm gap. International Journal of Learner Corpus Research, 1(1), 130–159. doi: 10.1075/ijlcr.1.1.05gri
Gries, St.
Th., & Wulff, S. (2013). The genitive alternation in Chinese and German ESL
learners: Towards a multifactorial notion of context in learner corpus
research. International Journal of Corpus
Linguistics, 18(3), 327–356. doi: 10.1075/ijcl.18.3.04gri
Hoffmann, T.
(2014). The cognitive evolution of Englishes: The role of constructions in the
Dynamic Model. In S. Buschfeld, T. Hoffmann, M. Huber & A. Kautzsch (Eds.),
The Evolution of Englishes: The Dynamic
Model and Beyond (pp. 160–180). Amsterdam: John Benjamins. doi: 10.1075/veaw.g49.10hof
Kaleta, A.
(2012). The English gerund vs. the to-infinitive: The case of aspectual
constructions. Selected papers from UK-CLA
Meetings. Retrieved from http://www.uk-cla.org.uk/files/ proceedings/Kaleta.pdf (last
accessed June 2014).
Khamis, A.
(2015, July). Cross-varietal variation in English verb complementation: A
multivariate corpus analysis. Paper presented at the International Cognitive Linguistics Conference 2015, Newcastle upon
Tyne.
Koch, C.
(2015). Routines in lexis and grammar: A ‘gravity’ approach within the
International Corpus of English. Paper presented at the ICAME 36 conference, Universität Trier, 27-29 May 2015.
Kuperman,
V., & Bresnan, J. (2012). The effects of construction probability on word
durations during spontaneous incremental sentence production. Journal of Memory and Language,
66(4), 588–611. doi: 10.1016/j.jml.2012.04.003
Langacker, R. (1991). Cognitive Grammar 2. Stanford, CA:
Stanford University Press.
Liaw, A.,
& Wiener, M. (2015). randomForest. Version 4.6-12. A package for R.
Retrieved from https://cran.r-project.org/web/packages/randomForest/index.html (last
accessed February 2016).
Mair, C.
(2002). Three changing patterns of verb complementation in Late Modern English:
A real-time study based on matching text corpora. English Language and Linguistics, 6(1),
105–131. doi: 10.1017/S1360674302001065
Mair, C. (2013). The world
system of English. English World-Wide,
34(3), 253–278.
MartÃnez-GarcÃa,
M. T., & Wulff, S. (2012). Not wrong, yet not quite right: Spanish ESL
students’ use of gerundial and infinitival complementation. International Journal of Applied
Linguistics, 22(2), 225–244. doi: 10.1111/j.1473-4192.2012.00310.x
Matsuki, K., Kuperman, V.,
& Van Dyke, J. A. (2016). The Random Forests statistical technique:
An examination of its value
for the study of reading. Scientific
Studies of Reading, 20(1),
20–33. doi: 10.1080/10888438.2015.1107073
Mindt, D. (2000). An
Empirical Grammar of the English Verb. Berlin: Cornelsen.
Mukherjee,
J., & Gries, St. Th. (2009). Collostructional nativisation in New
Englishes: Verbconstruction associations in the International Corpus of
English. English World-Wide, 30(1), 27–51. doi: 10.1075/eww.30.1.03muk
Mukherjee,
J., & Hoffmann, S. (2006). Describing verb-complementational profiles of
New Englishes: A pilot study of Indian English. English World-Wide, 27(2),
147–173. doi: 10.1075/eww.27.2.03muk
Mukherjee,
J., & Schilk, M. (2008). Verb-complementational profiles across varieties
of English: Comparing verb classes in Indian English and British English. In T.
Nevalainen, I Taavitsainen, P. Pahta & M. Korhonen (Eds.), The Dynamics of Linguistic Variation: Corpus
Evidence on English Past and Present (pp. 163–181). Amsterdam: John
Benjamins. doi: 10.1075/silv.2.14muk
Nam, C.,
Mukherjee, S., Schilk, M., & Mukherjee, J. (2013). Statistical analysis of
varieties of English. Journal of the
Royal Statistical Society, 176(3),
777–793. doi: 10.1111/j.1467-985X.2012.01062.x
Noël, D.
(2003). Is there semantics in all syntax? The case of accusative and infinitive
constructions vs. that-clauses. In G. Rohdenburg & B. Mondorf (Eds.), Language Typology and Syntactic Description:
Vol.2, Complex Constructions (pp. 52–150). Cambridge: Cambridge University
Press.
Noonan, M.
(1985). Complementation. In T. Shopen (Ed.), Language Typology and Syntactic Description. Vol. 2. Complex
Constructions (pp. 42–110). Cambridge: Cambridge University Press.
OlavarrÃa de Ersson, E.,
& Shaw, P. (2003)). Verb complementation patterns in Indian standard
English. English World-Wide, 24(2), 137–161. doi: 10.1075/eww.24.2.02ers
R
Development Core Team. 2012. R: A Language and Environment for Statistical
Computing. Foundation for Statistical Computing. Vienna, Austria. <http://R-project.org> (last
accessed July 2012)
Ransom, E.
(1979). Definiteness and animacy constraints on passives and double object
constructions in English. Glossa, 13(2), 215–240.
Rohdenburg, G. (1995). On the
replacement of finite complement clauses by infinitives in
English. English Studies, 76(4), 367–388. doi: 10.1080/00138389508598980
Schilk, M.,
Bernaisch, T., & Mukherjee, J. (2012). Mapping unity and diversity in South
Asian English lexicogrammar. In M. Hundt & U. Gut (Eds.), Mapping Unity and Diversity World-wide:
Corpus-based Studies of New Englishes (pp. 137–166). Amsterdam: John Benjamins.
doi: 10.1075/veaw.g43.06sch
Schilk, M.,
Mukherjee, J., Nam, C., & Mukherjee, S. (2013). Complementation of
ditransitive verbs in south Asian Englishes: A multifactorial analysis. Corpus Linguistics and Linguistic
Theory, 9(2), 187–225. doi: 10.1515/cllt-2013-0001
Shastri, S.
V. (1996). Using computer corpora in the description of language with special
reference to complementation in Indian English. In R. J. Baumgardner (Ed.), South Asian English: Structure, Use, and
Users (pp. 70–81). Urbana & Chicago, IL: University of Illinois Press.
Schneider,
E. (2007). Postcolonial English:
Varieties around the World. Cambridge: Cambridge University Press. doi: 10.1017/CBO9780511618901
Smith, M.
B., & Escobedo, J. (2002). The semantics of to-infinitival vs. -ing
verb complement constructions in English. In M. Andronis, C. Ball, H. Elston
& S. Neuvel (Eds.), Proceedings from
the Main Session in the Chicago Linguistics Society’s Thirty-Seventh Meeting
(pp. 549– 564). Chicago, IL: Chicago Linguistic Society.
Szmrecsanyi,
B., & Kortmann, B. (2011). Typological profiling: Learner Englishes versus
L2 varieties of English. In J. Mukherjee & M. Hundt (Eds.), Exploring Second-language Varieties of
English and Learner Englishes: Bridging the Paradigm Gap (pp. 167–207).
Amsterdam:
John
Benjamins. doi:
10.1075/scl.44.09kor
Vendler, Z. (1957). Verbs and
times. Linguistics in Philosophy, 66(2). 143–160.
Vosberg, U.
(2003). The role of extractions and horror aequi in the evolution of -ing complements in modern English. In G.
Rohdenburg & B. Mondorf (Eds.), Determinants
of Grammatical Variation in English (pp. 329–345). Berlin: Mouton de
Gruyter.
Wierzbicka,
A. (1988). The Semantics of Grammar.
Amsterdam: John Benjamins.
doi: 10.1075/slcs.18
Wulff, S.,
& Gries, St. Th. (2015). Prenominal adjective order preferences in Chinese
and German L2 English: A multifactorial corpus study. Linguistic Approaches to Bilingualism,
5(1), 122–150. doi: 10.1075/lab.5.1.05wul
Authors’ addresses
Sandra C. Deshors Stefan
Th. Gries
Department of Languages and
Linguistics Department of
Linguistics
New Mexico State University University of
California, Santa Barbara
MSC 3L Santa
Barbara, CA 93106–3100
Las Cruces,
NM 88003 United States of
America United States of America stgries@linguistics.ucsb.edu
deshors@nmsu.edu
[1] . Although the term ‘non-native’ is often used in the literature to refer to erroneous use, in this paper, the term is simply used to refer to language use that significantly differs from native use; we use NatE as an abbreviation here for British English (BrE) and American English (AmE).x
0 Comments