## Abstract

We report our work on building linguistic resources and data-driven parsers in the grammatical relation (GR) analysis for Mandarin Chinese. Chinese, as an analytic language, encodes grammatical information in a highly configurational rather than morphological way. Accordingly, it is possible and reasonable to represent almost all grammatical relations as bilexical dependencies. In this work, we propose to represent grammatical information using general directed dependency graphs. Both only-local and rich long-distance dependencies are explicitly represented. To create high-quality annotations, we take advantage of an existing TreeBank, namely, Chinese TreeBank (CTB), which is grounded on the Government and Binding theory. We define a set of linguistic rules to explore CTB’s implicit phrase structural information and build deep dependency graphs. The reliability of this linguistically motivated GR extraction procedure is highlighted by manual evaluation. Based on the converted corpus, data-driven, including graph- and transition-based, models are explored for Chinese GR parsing. For graph-based parsing, a new perspective, graph merging, is proposed for building flexible dependency graphs: constructing complex graphs via constructing simple subgraphs. Two key problems are discussed in this perspective: (1) how to decompose a complex graph into simple subgraphs, and (2) how to combine subgraphs into a coherent complex graph. For transition-based parsing, we introduce a neural parser based on a list-based transition system. We also discuss several other key problems, including dynamic oracle and beam search for neural transition-based parsing. Evaluation gauges how successful GR parsing for Chinese can be by applying data-driven models. The empirical analysis suggests several directions for future study.

## 1. Introduction

Dependency grammar is a longstanding tradition that determines syntactic structures on the basis of word-to-word connections. It names a family of approaches to the linguistic analysis that all share a commitment to the typed relations between ordered pairs of words. Usually, dependencies represent various grammatical relations (GRs), which are exemplified in traditional grammars by the notions of subject, direct/indirect object, etc., and therefore encode rich syntactic information of natural language sentences in an explicit way. In recent years, parsing a sentence to a dependency representation has been well studied and widely applied to many Natural Language Processing (NLP) tasks, for example, Information Extraction and Machine Translation. In particular, the data-driven approaches have made great progress during the past two decades (Kübler, McDonald, and Nivre 2009; Koo et al. 2010; Pitler, Kannan, and Marcus 2013; Chen and Manning 2014; Weiss et al. 2015; Dozat and Manning 2016). Various practical parsing systems have been built, not only for English but also for a large number of typologically different languages, for example, Arabic, Basque, Catalan, Chinese, Czech, Greek, Hungarian, Italian, and Turkish (Nivre et al. 2007).

Previous work on dependency parsing mainly focused on structures that can be represented in terms of directed bilexical dependency trees. Although tree-shaped graphs have several desirable properties from the computational perspective, they do not work well for coordinations, long-range dependencies involved in raising, control, as well as extraction, and many other complicated linguistic phenomena that go beyond the surface syntax. Some well-established and leading linguistic theories, such as Word Grammar (Hudson 1990) and Meaning-Text Theory (Mel’čuk 2001), argue that more general dependency graphs are necessary to represent a variety of syntactic or semantic phenomena.

In this article, we are concerned with parsing Chinese sentences to an enriched deep dependency representation, which marks up a rich set of GRs to specify a linguistically-rich syntactic analysis. Different from the popular *surface*^{1} tree-based dependency representation, our GR annotations are represented as general directed graphs that express not only local but also various long-distance dependencies (see Figure 1 for example). To enhance the tree-shaped dependency representation, we borrow the key ideas underlying Lexical Function Grammar (LFG) (Bresnan and Kaplan [1982], Dalrymple [2001]), a well-defined and widely-applied linguistic grammar formalism. In particular, GRs can be viewed as the lexicalized dependency backbone of an LFG analysis that provides general linguistic insights.

To acquire a high-quality GR corpus, we propose a linguistically-motivated algorithm to translate a Government and Binding theory (GB) (Chomsky [1981], Carnie [2007]) grounded phrase structure treebank, namely, Chinese TreeBank (CTB) (Xue et al. [2005]) to a deep dependency bank where the GRs are explicitly represented. A manual evaluation highlights the reliability of our linguistically-motivated GR extraction algorithm: The overall dependency-based precision and recall are 99.17% and 98.87%. These numbers are calculated based on 209 sentences that are randomly selected and manually corrected. The automatically-converted corpus would be of use for a wide variety of NLP tasks, for example, parser training and cross-formalism parser evaluation.

By means of the newly created corpus, we study data-driven models for deep dependency parsing. The key challenge in building GR graphs is the range of relational flexibility. Different from the surface syntax, GR graphs are not constrained to trees. However, the tree structure is a fundamental consideration of the design for the majority of existing parsing algorithms. To deal with this problem, we propose **graph merging**, an innovative perspective for building flexible representations. The basic idea is to decompose a GR graph into several subgraphs, each of which captures most but not necessarily the complete information. On the one hand, each subgraph is *simple* enough to allow efficient construction. On the other hand, the combination of all subgraphs enables the whole target GR structure to be produced.

Based on this design, a practical parsing system is developed with a two-step architecture. In the first step, different graph-based models are applied to assign scores to individual arcs and various tuples of arcs. Each individual model is trained to target a unique type of subgraph. In the second step, a Lagrangian Relaxation-based joint decoder is applied to efficiently produce globally optimal GR graphs according to all graph-based models.

There are two main problems in the graph merging perspective. First, how do we decompose a complex graph into simple subgraphs in a principled way? To deal with this problem, we propose modeling the graph decomposition problem as a constrained optimization problem and leverage appropriate objective functions as well as constraints to reflect essential structure-specific properties of the syntactically-motivated GR graphs. In particular, we consider the reachability property: In a given GR graph, every node is reachable from the same unique root. This property ensures that a GR graph can be successfully decomposed into a limited number of forests, which in turn can be accurately and efficiently built via tree parsing. Secondly, how can we merge subgraphs into one coherent structure in a principled way? The problem of finding an optimal graph that consistently combines the subgraphs obtained through individual models is non-trivial. We also treat this problem as an optimization problem and employ Lagrangian Relaxation to solve the problem.

Motivated by the recent successes of transition-based approaches to tree parsing, we also explore the transition-based models for building deep dependency structures. We utilize a list-based transition system that uses open lists for organizing partially processed tokens. This transition system is sound and complete with respect to directed graphs without self-loop. With reference to this system, we implement a data-driven parser with a neural classifier based on Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber [1997]). To improve the parsing quality, we look further into the impact of dynamic oracle and beam search on training and/or decoding.

Experiments on CTB 6.0 are conducted to profile the two types of parsers. Evaluation gauges how successful GR parsing for Chinese can be by applying data-driven models. Detailed analysis reveal some important factors that may possibly boost the performance. In particular, the following non-obvious facts should be highlighted:

- 1.
Both the graph merging and transition-based parsers produce high-quality GR analysis with respect to dependency matching, although they do not use any phrase structure information.

- 2.
When gold-standard POS tags are available, the graph merging model obtains a labeled f-score of 84.57 on the test set when coupled with a global linear scorer, and 86.17 when coupled with a neural model. Empirical analysis indicates the effectiveness of the proposed graph decomposition and merging methods.

- 3.
The transition-based parser can be trained with either the dynamic oracle or the beam search method. According to our implementations, the dynamic oracle method performs better. Coupled with dynamic oracle, the transition-based parser reaches a labeled f-score of 85.51 when gold POS tags are utilized.

- 4.
Gold-standard POS tags play a very important role in achieving the above accuracies. However, the automatic tagging quality of state-of-the-art Chinese POS taggers are still far from satisfactory. To deal with this problem, we apply ELMo (Peters et al. 2018), a contextualized word embedding producer, to obtain adequate lexical information. Experiments show that ELMo is extremely effective in harvesting lexical knowledge from raw texts. With the help of the ELMo vectors but not any POS tagging results, the graph merging parser achieves a labeled f-score of 84.90. This is a realistic set-up to evaluate how accurate the GR parsers could be for real-world Chinese Language Processing applications.

The current study expands on the earlier and preliminary results, which have been published in Sun et al. (2014) and Sun, Du, and Wan (2017). The new findings and contributions in this submission include: (1) a detailed comparison between our GR analysis and other syntactic/semantic analyses, including Universal Dependency, Semantic Role Labeling, and LFG’s f-structure, (2) the design of a new neural graph merging parser, (3) the design of a new neural transition-based parser, and (4) more comprehensive evaluations and analyses of the neural parsing models.

The implementation of the corpus conversion algorithm, a corpus visualization tool, and a set of manually labeled sentences are available at http://www.aclweb.org/anthology/attachments/P/P14/P14-1042.Software.zip. The implementation of the two neural parsers is available at https://github.com/pkucoli/omg.

## 2. Background

### 2.1 Deep Linguistic Processing

Deep linguistic processing is concerned with NLP approaches that aim at modeling the complexity of natural languages in rich linguistic representations. Such approaches are typically related to a particular computational linguistic theory, including Combinatory Categorial Grammar (CCG) (Steedman [1996, 2000]), Lexical Functional Grammar (LFG) (Bresnan and Kaplan [1982], Dalrymple [2001]), Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag [1994]), and Tree-adjoining Grammar (TAG) (Joshi and Schabes [1997]). These theories provide not only the description of the syntactic structures, but also the ways in which meanings are composed, and thus parsing in these formalisms provides an elegant way to simultaneously obtain both syntactic and semantic analyses, generating valuable and richer linguistic information. Deep linguistic annotators and processors are strongly demanded in NLP applications, such as machine translation (Oepen et al. 2007; Wu, Matsuzaki, and Tsujii 2010) and text mining (Miyao et al. 2008).

In general, derivations based on deep grammar formalisms may provide a wider variety of syntactic and semantic dependencies in more explicit details. Deep parses arguably contain the most linguistically satisfactory account of the deep dependencies inherent in many complicated linguistic phenomena, for example, coordination and extraction. Among the grammatical models, LFG and HPSG encode grammatical functions directly, and they are adequate for generating predicate–argument analyses (King et al. 2003; Flickinger, Zhang, and Kordoni 2012). In terms of CCG, Hockenmaier and Steedman (2007) introduced a co-indexing method to extract (*predicate*, *argument*) pairs from CCG derivations; Clark and Curran (2007a) also utilized a similar method to derive LFG-style grammatical relations. These bilexical dependencies can be used to approximate the corresponding semantic structures, such as logic forms in Minimal Recursion Semantics (Copestake et al. 2005).

In recent years, considerable progress has been made in deep linguistic processing for realistic texts. Traditionally, deep linguistic processing has been concerned with grammar development for parsing and generation. During the last decade, techniques for treebank-oriented grammar development (Cahill et al. 2002; Miyao, Ninomiya, and Tsujii 2005; Hockenmaier and Steedman 2007) and statistical deep parsing (Clark and Curran 2007b; Miyao and Tsujii 2008) have been well studied for English. Statistical models that are estimated on large-scale treebanks play an essential role in boosting the parsing performance.

It is generally believed that the representation format for parser outputs may greatly affect its impact on applications. Predicate–argument structures extracted from deep parses have been shown very helpful for NLP tasks such as information extraction (Miyao et al. 2008; Yakushiji et al. 2005). Partly due to their importance in information processing, deep dependencies are the de facto standard to evaluate deep parsers (Kaplan et al. 2004; Briscoe and Carroll 2006; Clark and Curran 2007a; Bender et al. 2011), including C&C^{2} and Enju.^{3} If the interpretable predicate–argument structures are so valuable, then the question arises: Why cannot we directly build deep dependency graphs?

### 2.2 Data-Driven Dependency Parsing

Data-driven, grammar-free dependency parsing has received an increasing amount of attention in the past decade. Such approaches, for example, transition-based (Yamada and Matsumoto 2003; Nivre 2008) and graph-based (McDonald et al. 2005; McDonald, Crammer, and Pereira 2005) models have attracted the most attention in dependency parsing in recent works. Transition-based parsers utilize transition systems to derive dependency trees together with treebank-induced statistical models for predicting transitions. This approach was pioneered by Yamada and Matsumoto (2003) and Nivre, Hall, and Nilsson (2004). Specifically, deep learning techniques have been successfully applied to enhance the parsing accuracy (Chen and Manning 2014; Weiss et al. 2015; Andor et al. 2016; Kiperwasser and Goldberg 2016; Dozat and Manning 2016).

Chen and Manning (2014) made the first successful attempt at incorporating deep learning into a transition-based dependency parser. A number of other researchers have attempted to address some limitations of their parser by augmenting it with additional complexity: Later it was enhanced with a more principled decoding algorithm, namely beam search, as well as a conditional random field loss objective function (Weiss et al. 2015; Watanabe and Sumita 2015; Zhou et al. 2015; Andor et al. 2016).

A graph-based system explicitly parameterizes models over substructures of a dependency tree, and formulates parsing as a Maximum Spanning Tree problem (McDonald et al. 2005). Similar to the transition-based approach, it also utilizes treebank annotations to train strong statistical models to assign scores to word pairs. Many researchers focus on developing principled decoding algorithms to resolve the Maximum Spanning Tree problem involved, for both projective (McDonald and Pereira 2006; Koo and Collins 2010) and non-projective structures (Koo et al. 2010; Kuhlmann 2010; Pitler, Kannan, and Marcus 2013). Deep learning has also been proved to be powerful for disambiguation and remained a hot topic in recent research (Kiperwasser and Goldberg 2016; Dozat and Manning 2016).

Kiperwasser and Goldberg (2016) proposed a simple yet effective architecture to implement neural dependency parsers. In particular, a bidirectional-LSTM (Bi-LSTM) is utilized as a powerful *feature extractor* to assist a dependency parser. Mainstream data-driven dependency parsers, including both transition- and graph-based ones, can apply useful word vectors provided by a Bi-LSTM to calculate scores. Following Kiperwasser and Goldberg (2016)’s experience, we implemented such a parser to evaluate the impact of empty categories on surface parsing.

Most research concentrated on surface syntactic structures, and the majority of existing approaches are limited to producing only trees. We notice several exceptions. Sagae and Tsujii (2008) and Henderson et al. (2013) individually introduced two transition systems that can generate specific graphs rather than trees. Unfortunately, the two transition systems only cover 51.0% and 76.5%, respectively, of graphs in our data, highlighting the high complexity of our graph representation and the inadequacy of existing transition systems. McDonald and Pereira (2006) presented a graph-based parser that can generate dependency graphs in which a word may depend on multiple heads. Encouraged by their work, we study new data-driven models to build deep dependency structures for Mandarin Chinese.

### 2.3 Initial Research on Chinese Deep Linguistic Processing

In the last few years, study on deep linguistic processing for Chinese has been initialized. Treebank annotation for individual formalisms is prohibitively expensive. To quickly construct deep annotations, corpus-driven grammar engineering has been developed. Phrase structure trees in CTB have been semi-automatically converted to deep derivations in the CCG (Tse and Curran 2010), LFG (Guo, van Genabith, and Wang 2007), and HPSG (Yu et al. 2010) formalisms. To semi-automatically extract a large-scale HPSG grammar, Yu et al. (2010) defined a skeleton, including the structure of sign, grammatical principles, and schemata, based on which, the CTB trees are converted into HPSG-style trees. The treebank conversion under the CCG formalism is relatively easier. Tse and Curran (2010) followed the method proposed for English Penn Treebank (Hockenmaier and Steedman 2007) to process CTB. The main steps include (1) distinguishing head, argument, and adjunct, (2) binarizing CTB trees, and (3) re-defining syntactic categories. To more robustly construct f-structure for CTB trees, Guo, van Genabith, and Wang (2007) proposed a dependency-based model, which extracts functional information from dependency trees.

With the use of converted fine-grained linguistic annotations, successful English deep parsers, such as C&C (Clark and Curran 2007b) and Enju (Miyao and Tsujii 2008), have been evaluated on the Chinese annotations (Yu et al. 2011; Tse and Curran 2012). Although the discriminative learning architecture of both C&C and Enju parsers makes them relatively easy to be adapted to solve multilingual parsing, their performance on Chinese sentences is far from satisfactory. Yu et al. (2011) and Tse and Curran (2012) analyze the challenges and difficulties in Chinese deep parsing. In particular, some language-specific properties account for a large number of errors.

## 3. Representing Deep Linguistic Information Using Dependency Graphs

In this section, we discuss the construction of the GR annotations. Basically, the annotations are automatically converted from a GB-grounded phrase structure treebank, namely CTB. Conceptually, this conversion is similar to the conversions from CTB structures to representations in deep grammar formalisms (Xia 2001; Guo, van Genabith, and Wang 2007; Tse and Curran 2010; Yu et al. 2010). However, our work is grounded in GB, which is the linguistic basis of the construction of CTB. We argue that this theoretical choice makes the conversion process more compatible with the original annotations and therefore more accurate. We use directed graphs to explicitly encode bilexical dependencies involved in coordination, raising/control constructions, extraction, topicalization, and many other complicated phenomena. Figure 2 shows the original CTB annotation of the sentence in Figure 1.

### 3.1 Linguistic Basis

GRs are encoded in different ways in different languages and languages may employ mixed or multiple strategies for grammatical function encoding. In some languages, for example, Turkish, grammatical function is encoded by means of morphological marking, and there may be no uniform position where a particular grammatical function must appear. In highly *configurational* languages, for example, Chinese, however, the grammatical function of a phrase is heavily determined by its constituent structure position. Dominant Chomskyan theories, including GB, have defined GRs as configurations at phrase structures. Following this principle, CTB groups words into constituents through the use of a limited set of fundamental grammatical functions. For example, one bracket represents only one grammatical relation in CTB, providing phrase structure annotations with specifications of particular grammatical functions. Transformational grammar utilizes empty categories to represent long-distance dependencies. In CTB, traces are provided by relating displaced linguistic material to where it should be interpreted semantically. By exploiting configurational information, traces, and functional tag annotations, GR information can hopefully be derived from CTB trees with high accuracy.

### 3.2 The Extraction Algorithm

Our treebank conversion algorithm borrows key insights from LFG. LFG posits two levels of representation: c(onstituent)-structure and f(unctional)-structure minimally. C-structure is represented by phrase structure trees, and captures surface syntactic configurations such as word order, whereas f-structure encodes grammatical functions. It is easy to extract a dependency backbone that approximates basic predicate–argument–adjunct structures from f-structures. The construction of the widely used PARC DepBank (King et al. 2003) is a good example.

LFG relates c-structure and f-structure through f-structure annotations, which compositionally map every constituent to a corresponding f-structure. Borrowing this key idea, we translate CTB trees to dependency graphs by first augmenting each constituency with f-structure annotations, then propagating the head words of the head or conjunct daughter(s) upwards to their parents, and finally creating a dependency graph. The details are presented step-by-step as follows:

#### 3.2.1 *Tapping Implicit Information*.

A systematic study to tap the implicit functional information of CTB has been introduced by Xue (2007). This gives us a very good start to extract GRs. We slightly modify their method to enrich a CTB tree with f-structure annotations: Each node in a resulting tree is annotated with one and only one corresponding equation (see Figure 3 for an example). Comparing the original and enriched annotations, we can see that the functionality of this step is to explicitly represent and regulate grammatical functions^{4} for every constituent. We enrich the CTB trees with function information by using the conversion tool^{5} that generated the Chinese data sets for the CoNLL 2009 Shared Task on multilingual dependency parsing and semantic role labeling.

#### 3.2.2 *Beyond CTB Annotations: Tracing More*.

Natural languages do not always interpret linguistic materials locally. In order to obtain accurate and complete GR, predicate–argument, or logical form representations, a hallmark of deep grammars is that they usually involve a non-local dependency resolution mechanism. CTB trees utilize empty categories and co-indexed materials to represent long-distance dependencies. An empty category is a nominal element that does not have any phonological content and is therefore unpronounced. There are two main types of empty categories: null pronounce and trace, and each type can be further classified into different subcategories. Two kinds of anaphoric empty categories, that is, big PRO and trace, are annotated in CTB. Theoretically speaking, only trace is generated as the result of *movement* and therefore annotated with antecedents in CTB. We carefully checked the annotation and found that considerable amounts of antecedents are not labeled, and hence a large sum of important non-local information is missing. In addition, because the big PRO is also anaphoric, it is possible to find co-indexed components sometimes. Such non-local information is also valuable in marking the dependency relation.

Beyond CTB annotations, we introduce a number of phrase structure patterns to extract more non-local dependencies. The method heavily leverages linguistic rules to exploit structural information. We take into account both theoretical assumptions and analyzing practices to enrich co-indexation information according to phrase structure patterns. In particular, we try to link an anaphoric empty category *e* with its c-commonders if no non-empty antecedent has already been co-indexed with *e*. Because the CTB is influenced deeply by the X-bar syntax, which highly regulates constituent analysis, the number of linguistic rules we have is quite modest. For the development of conversion rules, we used the first 9 files of CTB, which contain about 100 sentences. Readers can refer to the well-documented Perl script for details (see Figure 3 for an example). The noun phrase “/regulatory documents” is related to the trace “*T*.” This co-indexation is not labeled in the original annotation.

#### 3.2.3 *Passing Head Words and Linking Empty Categories*.

Based on an enriched tree, our algorithm propagates the head word of the head daughter upwards to their parents, linking co-indexed units, and finally creating a GR graph. The partial result after head word passing of the running example is shown in Figure 4. There are two differences in the head word passing between our GR extraction and a “normal” dependency tree extraction. First, the GR extraction procedure may pass multiple head words to their parent, especially in a coordination construction. Secondly, long-distance dependencies are created by linking empty categories and their co-indexed phrases.

The coding of long-distance dependency relations is particularly important for processing grammatical relations in Chinese. Compared to English, Chinese allows more flexibility in word order and a wider range of “gap-and-filler” relations. Among the common long-distance dependencies such as co-referential indexing, cleft-focus, prepositional phrase, and coordination, relativization in Chinese is of particular interest. Chinese relativization is structurally head-final, with the relative clause, marked by *de* (the grammatical marker of nominal modifier), occurring before the head noun. The head noun may hold any kind of semantic relation with the proceeding relative clause. In other words, Chinese relative structure will have the “gap” occurring before the “filler” and there is little restriction on the semantic roles of the relativized head noun. Chinese-specific structural peculiarities may give rise to unexpected difficulties in sentence processing. Applying an augmented version of dependency notations, our system is able to handle such complicated issues in processing Chinese sentences.

### 3.3 Manual Evaluation

To have a precise understanding of whether our extraction algorithm works well, we have selected 20 files that contain 209 sentences in total for manual evaluation. Linguistic experts carefully examine the corresponding GR graphs derived by our extraction algorithm and correct all errors. In other words, a *gold-standard* GR annotation set is created. The measure for comparing two dependency graphs is precision/recall of GR tokens, which are defined as 〈*w*_{h}, *w*_{d}, *l*〉 tuples, where *w*_{h} is the head, *w*_{d} is the dependent, and *l* is the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identified by the automatic generator, while unlabeled precision/recall (UP/UR) is the ratio regardless of *l*. F-score is a harmonic mean of precision and recall. These measures correspond to attachment scores (LAS/UAS) in dependency tree parsing. To evaluate our GR parsing models that will be introduced later, we also report these metrics.

The overall performance is summarized in Table 1. We can see that the automatic GR extraction achieves relatively high performance. There are two sources of errors in treebank conversion: (1) inadequate conversion rules and (2) wrong or inconsistent original annotations. During the creation of the gold-standard corpus, we find that rule-based errors are mainly caused by complicated unbounded dependencies and the lack of internal structure for some phrases. Such problems are very hard to solve through rules only, if even possible, since original annotations do not provide sufficient information. The latter problem is more scattered and unpredictable, which requires manual correction.

### 3.4 Comparing to the De Facto LFG Analysis

Our extraction algorithm integrates the key idea underlying LFG and the practice of corpus annotation of CTB. Therefore, the structures obtained are highly comparable to Dalrymple (2001)’s de facto LFG analysis. Take the running sentence in Figure 1, for example. LFG separates syntactic analysis into two parallel structures: c-structure and f-structure. The c-structure provides basic analysis of the constituent structure. A *standard* c-structure for the running example is shown in Figure 5 and this analysis can be compared to the result—as shown in Figure 3—of the second step of our extraction algorithm. A number of functional schemata are introduced to augment the phrase structure analysis. By instantiating these schemata, we are able to generate the corresponding functional description. Finally, we solve the simultaneous equations of these functional descriptions and then construct the minimal f-structure that satisfies them. Figure 6 shows the f-structure of the running sentence. According to the linguistic assumptions under LFG, this f-structure is more linguistically universal and has a tighter relationship with semantics.

Comparing the two trees in Figures 3 and 5, we can see that most functional schemata are similar. There are two minor differences between the LFG and our GR analysis.

- 1.
LFG utilizes functional schemata that are assigned to phrase structures rather than empty categories to analyze a number of complex linguistic phenomena. Therefore, in the standard LFG c-structure tree, there is no “-NONE- *T*”-style terminals. Instead, specific constructions or function words take the responsibility. Our corpus is based on the original annotations of CTB, which is based on the GB theory. As a result, our grammatical function-augmented trees include unpronounced nodes, and there are not many functional schemata associated with constructions or function words.

- 2.
The labels that indicate GRs, e.g. subject, objects, adjuncts, are slightly different. This is also due to the annotation practice of CTB. The labels in Figure 3 are designed to maximally reflect implicit functional information of the CTB annotations. Nevertheless, the two labeling systems are highly comparable. For example, the “TMP” label in Figure 3 corresponds to the “ADJ” label in Figure 5 because in most cases a temporal expression is taken as an adjunct.

Some annotations may be different due to theoretical considerations. Take relative clause for example. The analysis in Figure 5 and 6 is based on the solution provided by Dalrymple (2001) and Bresnan (2001). The predicate–argument relation between “/involve” and “/document” is not included because LFG treats this as an anaphoric problem. However, we argue that this relation is triggered by the function word *de* “” as a marker of modification and is therefore put into our GR graph. Another motivation of this design is to let our GR graph represent more semantic information. We will continue to discuss this topic in the next subsection.

### 3.5 Comparing to Other Dependency Representations

In this section, we consider the differences between our GR representation and other popular dependency-based syntactic and semantic analyses. Figure 7 visualizes four types of cross-representation annotations assigned to the sentence in Figure 1:

- 1.
our GR graph,

- 2.
the CTB-dependency tree,

- 3.
the Universal Dependency, and

- 4.
PropBank-style Semantic Role Labeling.

^{6}corpus for Chinese is also based on CTB but with different constituency-to-dependency transformation rules. Both the language- and treebank-specific properties and comparability to resources of other languages are considered. The last representation is motivated by semantic processing and the dependencies reflect basic predicate–argument structures of verbs. The annotations are from Chinese PropBank (Xue and Palmer 2009), a sister resource to CTB. The head words are verbs or their normalizations, while the dependent words take semantic roles pertaining to the related verbs.

It can be clearly seen that compared to the tree-shaped dependency analysis, our GR representation represents much more syntactic information though more general graphs. In addition, the syntactic information of our analysis is more transparent with respect to (compositional) semantics. This property is highlighted by the structural parallelity between our GR graphs and the semantic role labeling results.

- •
Take, for example, the serial verb construction, a special type of coordination ( “issued-practiced”). According to the properties of distributive features, the shared subject, object, and aspect marker hold the same grammatical relations regarding both

*conjuncts*. Limited by the single-head constraint, a dependency tree cannot represent all of these dependencies in an explicit way. - •
Take the relative clause construction for another example ( “the regulatory document that involves economic field”). There is a long-distance predicate–argument dependency between “/involve” and “/document.” The addition of this dependency to a tree will bring in a cycle for the CTB-dependency analysis. The Universal Dependency analysis includes this dependency, because its annotations are driven by not only syntax but also semantics. Nevertheless, the related label lacks the information that a subject is displaced.

### 3.6 A Quantitative Analysis

In the previous two subsections, we presented a phenomenon-by-phenomenon comparison to show the similarity and dissimilarity between our GR analysis and other syntactic/semantic analyses, for example, Universal Dependency, Semantic Role Labeling, and LFG’s f-structure. In this subsection, we present a quantitative analysis based on overall statistics of the derived dependency corpus and a quantitative comparison with the CTB-dependency corpus. Table 2 shows how many dependencies are shared by both representations. The majority of grammatical relations involve local dependencies, and therefore the intersection of both representations are quite large. Nevertheless, a considerable number of dependencies of the GR representation do not appear in the tree representations. In principle, the GR representation removes semantically irrelevant dependencies, and thus it contains fewer arcs. Figure 8 summarizes the distribution of governable and non-governable GRs with respect to the tree and graph corpora.

## 4. Graph-Based Parsing via Graph Merging

### 4.1 The Proposal

The key proposal of this work is to construct a complex structure via constructing simple partial structures. Each partial structure is *simple* in the sense that it allows efficient construction. To construct each partial structure, we can employ mature parsing techniques. To obtain the final target output, it requires the total of all partial structures as they enable the whole target structure to be produced. This article aims to exemplify the above idea by designing a new parser for obtaining GR graphs. Take the GR graph in Figure 1 for example. It can be decomposed into two tree-like subgraphs, as shown in Figure 9. If we can parse the sentence into subgraphs and combine them in a principled way, we are able to obtain the original GR graph.

Under this premise, we need to develop a principled method to decompose a complex structure into simple structures, which allows us to generate data to train simple solvers. We also need to develop a principled method to integrate partial structures, which allows us to produce coherent structures as outputs. The techniques we developed to solve the two problems are demonstrated in the following sections.

### 4.2 Decomposing GR Graphs

#### 4.2.1 *Graph Decomposition as Optimization*.

Given a sentence *s* = *w*_{0}*w*_{1}*w*_{2} ⋯ *w*_{n} of length *n* (where *w*_{0} denotes the virtual root), a vector ** y** of length

*n*(

*n*+ 1) is used to denote a graph on the sentence. Indices

*i*and

*j*are then assigned to index the elements in the vector,

**(**

*y**i*,

*j*) ∈ {0, 1}, denoting whether there is an arc from

*w*

_{i}to

*w*

_{j}(0 ≤

*i*≤

*n*, 1 ≤

*j*≤

*n*).

**, there may be**

*y**m*subgraphs

*y*_{1}, …,

*y*_{m}, each of which belongs to a specific class of graphs 𝒢

_{k}(

*k*= 1, 2, ⋯,

*m*). Each class should allow efficient construction. For example, we may need a subgraph to be a tree or a non-crossing dependency graph. The combination of all

*y*_{k}gives enough information to construct

**. Furthermore, the graph decomposition procedure is utilized to generate training data for building submodels. Therefore, we hope each subgraph**

*y*

*y*_{k}is informative enough to train a good scoring model. To do so, for each

*y*_{k}, we define a score function

*s*

_{k}that indicates the “goodness” of

*y*_{k}. Integrating all ideas, we can formalize graph decomposition as an optimization problem,

**appear at least in one subgraph.**

*y*For a specific graph decomposition task, we should define good score functions *s*_{k} and graph classes 𝒢_{k}, according to key properties of the target structure ** y**.

#### 4.2.2 *Decomposing GR Graphs into Tree-Like Subgraphs*.

One key property of GR graphs is their reachability: Every node is either reachable from a unique root or is, by itself, an independent connected component. This property allows a GR graph to be decomposable into a limited number of *tree-like* subgraphs. By tree-like, we mean that if we treat a graph on a sentence as undirected, it is either a tree, or a subgraph of some tree on the sentence. The advantage of tree-like subgraphs is that they can be effectively built by adapting data-driven tree parsing techniques. Take the sentence in Figure 1, for example. For every word, there is at least one path linking the virtual root and the target word. Furthermore, we can decompose the graph into two tree-like subgraphs, as shown in Figure 9. In such decomposition, one subgraph is exactly a tree, and the other is very close to a tree.

^{7}In other words, we decompose each given graph

**into three tree-like subgraphs**

*y*

*g*_{1},

*g*_{2}, and

*g*_{3}for each subgraph to carry important information of the graph as well as cover all edges in

**. The optimization problem can be written as**

*y**Scoring a Subgraph*.

We score a subgraph in a first order *arc-factored* way, which first scores the edges separately and then adds up the scores. Formally, the score function is *s*_{k}(** g**) = ∑ ω

_{k}(

*i*,

*j*)

*g*_{k}(

*i*,

*j*) (

*k*= 1, 2, 3) where ω

_{k}(

*i*,

*j*) is the score of the edge from

*i*to

*j*. Under this score function, we can use the Maximum Spanning Tree (MST) algorithm (Chu and Liu 1965; Edmonds 1967; Eisner 1996) to decode the tree-like subgraph with the highest score.

_{k}(

*i*,

*j*) (1 ≤

*i*,

*j*≤

*n*) to the potential edges between all the pairs of words, then compute a best projective tree

*g*_{k}using the Eisner’s Algorithm (Eisner 1996):

*g*_{k} is not exactly a subgraph of ** y**, because there may be some edges in the tree but not in the graph. To guarantee a meaningful subgraph of the original graph, we add labels to the edges in trees to encode necessary information. We label

*g*_{k}(

*i*,

*j*) with the original label, if

**(**

*y**i*,

*j*) = 1; with the original label appended by “∼R” if

**(**

*y**j*,

*i*) = 1; with “None” else. With this labeling, we can have a function

*t*2

*g*to transform the extracted trees into tree-like graphs.

*t*2

*g*(

*g*_{k}) is not necessarily the same as the original graph

**, but it must be a subgraph of it.**

*y**Three Variations of Scoring*.

With different weight assignments, different trees can be extracted from a graph, obtaining different subgraphs. We devise three variations of weight assignment: ω_{1}, ω_{2}, and ω_{3}. Each of the ω’s consists of two parts. One is shared by all, denoted by *S*, and the other is different from each other, denoted by *V*. Formally, ω_{k}(*i*, *j*) = *S*(*i*, *j*) + *V*_{k}(*i*, *j*) (*k* = 1, 2, 3 and 1 ≤ *i*, *j* ≤ *n*).

**,**

*y**S*is defined as

*S*(

*i*,

*j*) =

*S*

_{1}(

*i*,

*j*) +

*S*

_{2}(

*i*,

*j*) +

*S*

_{3}(

*i*,

*j*) +

*S*

_{4}(

*i*,

*j*), where

In the definitions above, *w*_{1}, *w*_{2}, *w*_{3}, and *w*_{4} are coefficients, satisfying *w*_{1} ≫ *w*_{2} ≫ *w*_{3}, and *l*_{p} is a function of *i* and *j*. *l*_{p}(*i*, *j*) is the length of shortest path from *i* to *j* that either *i* is a child of an ancestor of *j* or *j* is a child of an ancestor of *i*. That is to say, the paths are in the form *i* ← *n*_{1} ← ⋯ ← *n*_{k} → *j* or *i* ← *n*_{1} → ⋯ → *n*_{k} → *j*. If no such path exists, then *l*_{p}(*i*, *j*) = *n*. The reasoning behind the design is illustrated below.

*S*_{1}indicates whether there is an edge between

*i*and*j*, and it is meant for optimal effect;*S*_{2}indicates whether the edge is from

*i*to*j*, and we want the edge with the correct direction more likely to be selected;*S*_{3}indicates the distance between

*i*and*j*, and we like the edge with short distance because it is easier to predict;*S*_{4}indicates the length of certain types of path between

*i*and*j*that reflects c-commanding relationships, and the coefficient remains to be tuned.

The score *V* is meant to capture different information from the GR graph. In GR graphs, we have an additional piece of information (as denoted as “*ldd” in Figure 1) for long-distance dependency edges. Moreover, we notice that conjunction is another important structure, which can be derived from the GR graph. Assume that we tag the edges relating to conjunctions with “*cjt.” The three variation scores, that is, *V*_{1}, *V*_{2}, and *V*_{3}, reflect long distance and the conjunction information in different ways.

*V*_{1}.

First for edges ** y**(

*i*,

*j*) whose label is tagged with *ldd, we assign

*V*

_{1}(

*i*,

*j*) =

*d*.

^{8}Whenever we come across a parent

*p*with a set of conjunction children

*cjt*

_{1},

*cjt*

_{2}, ⋯,

*cjt*

_{n}, we look for the rightmost child

*gc*

_{1r}of the leftmost child in conjunction

*cjt*

_{1}, and add

*d*to each

*V*

_{1}(

*p*,

*cjt*

_{1}) and

*V*

_{1}(

*cjt*

_{1},

*gc*

_{1r}). The edges in conjunction to which additional

*d*’s are added are shown in blue in Figure 10.

*V*_{2}.

Different from *V*_{1}, for edges ** y**(

*i*,

*j*) whose label is tagged with *ldd, we assign

*V*

_{2}(

*j*,

*i*) =

*d*. Then for each conjunction structure with a parent

*p*and a set of conjunction children

*cjt*

_{1},

*cjt*

_{2}, ⋯,

*cjt*

_{n}, we find the leftmost child

*gc*

_{nl}of the rightmost child in conjunction

*cjt*

_{n}, and add

*d*to each

*V*

_{2}(

*p*,

*cjt*

_{n}) and

*V*

_{2}(

*cjt*

_{n},

*gc*

_{nl}). The concerned edges in conjunction are shown in green in Figure 10.

*V*_{3}.

We do not assign *d*’s to the edges with tag *ldd. For each conjunction with parent *p* and conjunction children *cjt*_{1}, *cjt*_{2}, ⋯, *cjt*_{n}, we add *d* to *V*_{3}(*p*, *cjt*_{1}), *V*_{3}(*p*, *cjt*_{2}), ⋯, and *V*_{3}(*p*, *cjt*_{n}).

#### 4.2.3 *Lagrangian Relaxation with Approximation*.

*g*_{1},

*g*_{2}, and

*g*_{3}, there are three subgraphs

*g*_{1}=

*t*2

*g*(

*g*_{1}),

*g*_{2}=

*t*2

*g*(

*g*_{2}), and

*g*_{3}=

*t*2

*g*(

*g*_{3}). As stated above, each edge in a graph

**needs to be covered by at least one subgraph, and the goal is to maximize the sum of the edge weights of all trees. Note that the inequality in the constrained optimization problem above can be replaced by a maximization, written as**

*y**s*

_{k}(

*g*_{k}) = ∑ ω

_{k}(

*i*,

*j*)

*g*_{k}(

*i*,

*j*).

*g*_{m}= max{

*t*2

*g*(

*g*_{1}),

*t*2

*g*(

*g*_{2}),

*t*2

*g*(

*g*_{3})}, and by max{

*g*_{1},

*g*_{2},

*g*_{3}} we mean to take the maximum of three vectors pointwisely. The Lagrangian of the problem is

*u*is the Lagrangian multiplier.

According to the duality principle, max_{g1,g2,g3;u} min_{u} ℒ(*g*_{1}, *g*_{2}, *g*_{3}) = min_{u} ℒ(*u*), we can find the optimal solution for the problem if we can find min_{u} ℒ(*u*). However, it is very hard to compute ℒ(*u*), not to mention min_{u} ℒ(*u*). The challenge is that *g*_{m} in the three maximizations must be consistent.

*g*

_{1},

*g*

_{2}, and

*g*

_{3}are very close to

*g*

_{m}, so we can approximate ℒ(

*u*) by

*u*using the subgradient method.

#### 4.2.4 *The Algorithm*.

Figure 11 gives the tree decomposition algorithm, in which a subgradient method is used to identify min_{u} ℒ′(*u*) iteratively, and *K* is the maximum of iterations. In each iteration, we first compute *g*_{1}, *g*_{2}, and *g*_{3} to find ℒ′(*u*), then update *u* until the graph is covered by the subgraphs. The coefficient $13$’s can be merged into the steps *α*^{(k)}, so we omit them. The three separate problems *g*_{k} ← arg max_{gk}*s*_{k}(*g*_{k}) + *u*^{⊤}*g*_{k} (*k* = 1, 2, 3) can be solved using Eisner’s Algorithm (Eisner 1996), similar to solving arg max_{gk}*s*_{k}(*g*_{k}). Intuitively, the Lagrangian multiplier *u* in our algorithm can be regarded as additional weights for the score function. The update of *u* is to increase weights to the edges that are not covered by any tree-like subgraph, so it is more likely for them to be selected in the next iteration.

### 4.3 Graph Merging

As explained above, the extraction algorithm gives three classes of trees for each graph. The algorithm is applied to the graph training set to deliver three training tree sets. After that, three parsing models can be trained with the three tree sets. The parsers used in this study to train models and parse trees include Mate (Bohnet 2010), a second-order graph-based dependency parser, and our implementation of the first-order factorization model proposed in Kiperwasser and Goldberg (2016).

If the scores used by the three models are *f*_{1}, *f*_{2}, *f*_{3}, respectively, then the parsers can find trees with the highest scores for a sentence. That solves the following optimization problems: arg max_{g1}*f*_{1}(*g*_{1}), arg max_{g2}*f*_{2}(*g*_{2}), and arg max_{g2}*f*_{3}(*g*_{3}). We can parse a given sentence with the three models, obtain three trees, and then transform them into subgraphs. We combine them together to obtain the graph parse of the sentence by putting all the edges in the three subgraphs together. That is to say, graph ** y** = max{

*t*2

*g*(

*g*_{1}),

*t*2

*g*(

*g*_{2}),

*t*2

*g*(

*g*_{3})}. This process is called

**simple merging**.

#### 4.3.1 *Capturing the Hidden Consistency*.

However, the simple merging process lacks the consistency of extracting the three trees from the same graph, thus losing some important information. More specifically, when we decompose a graph into three subgraphs, some edges tend to appear in certain classes of subgraphs at the same time, and this information is lost in the simple merging process. It is more desirable to retain the co-occurrence relationship of the edges when doing parsing and merging. To retain the hidden consistency, instead of decoding the three models separately, we must do *joint* decoding.

In order to capture the hidden consistency, we add consistency tags to the labels of the extracted trees to represent the co-occurrence. The basic idea is to use additional tags to encode the relationship of the edges in the three trees. The tag set is 𝒯 = {0, 1, 2, 3, 4, 5, 6}. Given a tag *t* ∈ 𝒯, *t*&1, *t*&2, *t*&4, denote whether the edge is contained in *g*_{1}, *g*_{2}, *g*_{3}, respectively, where the operator “&” is the bitwise and operator. Since we do not need to consider the first bit of the tags of edges in *g*_{1}, the second bit in *g*_{2}, and the third bit in *g*_{3}, we always assign 0 to these tags. For example, if ** y**(

*i*,

*j*) = 1,

*g*_{1}(

*i*,

*j*) = 1,

*g*_{2}(

*j*,

*i*) = 1,

*g*_{3}(

*i*,

*j*) = 0, and

*t*

_{3}(

*j*,

*i*) = 0, we tag

*g*_{1}(

*i*,

*j*) as 2 and

*g*_{2}(

*j*,

*i*) as 1.

When it comes to parsing, it is important to obtain labels with consistency information. Our goal is to guarantee that the tags in those edges of the parse trees for the same sentence are consistent throughout graph merging. Since the consistency tags emerge, we index the graph and tree vector representation using three indices for convenience. Thus, ** g**(

*i*,

*j*,

*t*) denotes whether there is an edge from word w

_{i}to word w

_{j}with tag

*t*in graph

*g*.

*g*_{k}′ =

*t*2

*g*(

*g*_{k})(

*k*= 1, 2, 3).

The inequality constraints in the problem are the consistency constraints. Each of them gives the constraint between two classes of trees. For example, the first inequality says that an edge in *g*_{1} with tag *t*&2 ≠ 0 exists only when the same edge in *g*_{2} exists. If all of these constraints are satisfied, the subgraphs achieve the desired consistency.

#### 4.3.2 *Lagrangian Relaxation with Approximation*.

To solve the constrained optimization problem mentioned above, we perform some transformations and then apply the Lagrangian Relaxation to it with approximation.

*a*_{12}(

*i*,

*j*) =

*g*_{1}(

*i*,

*j*, 2) +

*g*_{1}(

*i*,

*j*, 6); then the first constraint can be written as an equity constraint

*a*_{12},

*a*_{13}, ⋯,

*a*_{32}as constants, then all the constraints are linear. Thus, the constraints can be written as

*A*

_{1},

*A*

_{2}, and

*A*

_{3}are matrices that can be constructed from

*a*_{12},

*a*_{13}, ⋯,

*a*_{32}.

*u*is the Lagrangian multiplier. Then the dual is

Again, we use the subgradient method to minimize ℒ(*u*). During the deduction, *a*_{12}, *a*_{13}, ⋯, *a*_{32} are taken as constants, but unfortunately they are not. We propose an approximation for the ** a**’s in each iteration: Using the

**’s obtained in the previous iteration instead. It is a reasonable approximation given that the**

*a**u*’s in two consecutive iterations are similar and so are the

**’s.**

*a*#### 4.3.3 *The Algorithm*.

*f*

_{1},

*f*

_{2}, and

*f*

_{3}each consists of scores that are first-order and higher. So they can be written as

**) = ∑ ω**

*g*_{k}(

*i*,

*j*)

**(**

*g**i*,

*j*) (

*k*= 1, 2, 3). With this property, each individual problem

*g*_{k}← arg max

_{gk}

*f*

_{k}(

*g*_{k}) +

*u*

^{⊤}

*A*

_{k}

*g*_{k}can be decoded easily, with modifications to the first-order weights of the edges in the three models. Specifically, let

**w**

_{k}=

*u*

^{⊤}

*A*

_{k}, then we can modify the ω

_{k}in

*s*

_{k}to ω

_{k}′, such that ω

_{k}′(

*i*,

*j*,

*t*) = ω

_{k}(

*i*,

*j*,

*t*) +

**w**

_{k}(

*i*,

*j*,

*t*) +

**w**

_{k}(

*j*,

*i*,

*t*).

The update of **w**_{1}, **w**_{2}, **w**_{3} can be understood in an intuitive way. Consider the following situation: One of the constraints, say, the first one for edge ** y**(

*i*,

*j*), is not satisfied, without loss of generality. We know

*g*_{1}(

*i*,

*j*) is tagged to represent that

*g*_{2}(

*i*,

*j*) = 1, but it is not the case. So we increase the weight of that edge with all kinds of tags in

*g*_{2}, and decrease the weight of the edge with the tag representing

*g*_{2}(

*i*,

*j*) = 1 in

*g*_{1}. After the update of the weights, consistency is more likely to be achieved.

#### 4.3.4 *Labeled Parsing*.

For the sake of formal concision, we illustrate our algorithms omitting the labels. It is straightforward to extend the algorithms to labeled parsing. In the joint decoding algorithm, we just need to extend the weights **w**_{1}, **w**_{2}, **w**_{3} for every label that appears in the three tree sets, and the algorithm can be deduced similarly.

### 4.4 Global Linear Model Based Scorer

*s*, its parse

***(**

*g**s*) is computed by searching for the highest-scored dependency graph in the set of compatible trees 𝒯(

*s*). Scores, namely Score(

*x*,

**), are assigned using a linear model where Φ(**

*g**s*,

**) is a feature-vector representation of the event that tree**

*g***is the analysis of sentence**

*g**s*, and θ is parameter vector containing associated weights. In general, performing a direct maximization over the set 𝒯(

*s*) is infeasible, and a common solution used in many parsing approaches is to introduce a part-wise factorization:

**can be factored into a set of parts**

*g**p*through a function Part, each of which represents a small substructure of

**. For example,**

*g***might be factored into the set of its component bilexical dependencies. Each part can be evaluated using a part-wise feature-vector mapping ϕ(**

*g**x*,

*p*). The factorization is able to establish implicit independent relationships among parts, while keeping the search for the best result efficient.

### 4.5 Bi-LSTM Based Scorer

*x*,

**) in (5), using neural network models. A simple yet effective design is selected among a rich set of choices. Following Kiperwasser and Goldberg (2016)’s experience, we employ a bidirectional-LSTMs (Bi-LSTMs) based neural model to perform data-driven parsing. A vector is associated with each word or POS-tag to transform them into continuous and dense representations. The concatenation of word embedding and POS-tag embedding of each word in a specific sentence is used as the input of Bi-LSTMs to extract context-related feature vectors**

*g**r*

_{i}. The two feature vectors of each word pair are scored with a nonlinear transformation.

We can see here the *local* score function explicitly utilizes the word positions of the head and the dependent. It is similar to first-order factorization as defined in the linear model. We use the first-order Eisner Algorithm (Eisner 1996) to get coherent projective subtrees.

## 5. Transition-Based Parsing

### 5.1 Background Notions

To precisely illustrate our transition-based parser, we employ traditional graph-theoretic notations to define a transition system. A dependency graph *G* = (*V*, *A*) is a labeled directed graph in the standard graph-theoretic sense and consists of nodes, *V*, and arcs, *A*, such that for sentence *x* = *w*_{0}*w*_{1} … *w*_{n} and label set *R* the following holds:

- •
*V*= {0, 1, …,*n*}, - •
*A*⊆*V*×*V*×*R*.

*A*represents the labeled dependency relations of the particular analysis

*G*. Specifically, an arc (

*w*

_{i},

*w*

_{j},

*r*) ∈

*A*represents a dependency relation from head

*w*

_{i}to dependent

*w*

_{j}labeled with relation type

*r*. A dependency graph

*G*is thus a set of labeled dependency relations between the words of

*x*.

Following Nivre (2008), we define a transition system for dependency parsing as a quadruple *S* = (𝒞, *T*, *c*_{s}, 𝒞_{t}), where

- 1.
𝒞 is a set of configurations, each of which contains a buffer β of (remaining) words and a set

*A*of arcs, - 2.
*T*is a set of transitions, each of which is a (partial) function*t*: 𝒞 ↦ 𝒞, - 3.
*c*_{s}is an initialization function, mapping a sentence*x*to a configuration, with β = [0, …,*n*], and - 4.
𝒞

_{t}⊆ 𝒞 is a set of terminal configurations.

Given a sentence *x* = *w*_{1}, …, *w*_{n} and a graph *G* = (*V*, *A*) on it, if there is a sequence of transitions *t*_{1}, …, *t*_{m} and a sequence of configurations *c*_{0}, …, *c*_{m} such that *c*_{0} = *c*_{s}(*x*), *t*_{i}(*c*_{i−1}) = *c*_{i}(*i* = 1, …, *m*), *c*_{m} ∈ 𝒞_{t}, and *A*_{cm} = *A*, we say the sequence of transitions is an *oracle* sequence, and we define *Ā*_{ci} = *A* − *A*_{ci} for the arcs to be built in *c*_{i}. In a typical transition-based parsing process, the input words are put into a queue and partially built structures are organized by one or more memory module(s). A set of transition actions are performed sequentially to consume words from the queue and update the partial parsing results, organized by the stack.

### 5.2 A List-Based Transition System

Nivre (2008) proposed a list-based algorithm to produce non-projective dependency trees. This algorithm essentially implements a very simple idea that is conceptually introduced by Covington (2001): making use of a list to store partially processed tokens. It is straightforward to use this strategy to handle any kind of directed graphs. In this work, we use such a system *S*_{L}. In fact, it is simpler to produce a graph as compared to a tree. The main difference between Nivre’s (2008) tree-parsing system and *S*_{L} is that at each transition step, *S*_{L} does not need to check the multiheaded condition. This appears to be simpler and more efficient.

#### 5.2.1 *The System*.

In *S*_{L} = (𝒞, *T*, *c*_{s}, 𝒞_{t}), we take 𝒞 to be the set of all quadruples *c* = (λ, λ′, β, *A*) where λ and λ′ are lists of nodes. The initial configuration *c*_{s}(*x*) is ([], [] [1, …, *n*], , {}), and a terminal configuration *c*_{t} is of the form (λ, λ′, [], *A*) (for any list and any arc set). *T* contains five types of transitions, shown in Figure 14, and is illustrated as follows:

- •
Left-Arc

_{r}updates a configuration by adding (*j*,*r*,*i*) to*A*where*i*is the top element of the list λ,*j*is the front of the buffer, and*r*is the dependency relation. - •
Right-Arc

_{r}updates a configuration similarly by adding (*i*,*r*,*j*) to*A*. - •
Shift concatenates λ′ to λ, clears λ′, and then moves the front element of β to current left list.

- •
No-Arc removes the right most node from λ and adds it onto the left most of λ′.

#### 5.2.2 *Theoretical Analysis*.

*S*_{L} is sound and complete^{9} with respect to the class of directed graphs without self-loop.

The soundness of *S*_{L} is relatively trivial. The completeness of *S*_{L} is obvious from the construction of the **oracle** sequence as follows: For each step on an initial configuration, we first construct all the arcs in *Ā*_{ci} that link the nodes in λ_{ci} to the front node of β_{ci} by applying Left-Arc_{r}, Right-Arc_{r}, and No-Arc. If no other transition is allowed, Shift is applied.

#### 5.2.3 *Extension*.

It is easy to extend our system to generate arbitrary directed graphs by adding a new transition Self-Arc:

- •
Self-Arc adds an arc from the top element of λ to itself, but does not update any list nor the buffer.

Linguistic dependencies usually exclude self-loop, and therefore the basic list-based system is satisfactory in most cases. We use the basic list-based system, namely *S*_{L}, as the core engine of our parser.

#### 5.2.4 *An Example*.

Figure 15 shows the first transitions needed to parse the running example of Figure 1. It can be seen, from this example, that the key step to produce crossing arcs is to temporarily move nodes that block non-adjacent nodes to the secondary memory module, namely λ′. Another key property of the oracle is building arcs as soon as possible to avoid further complication.

### 5.3 Bi-LSTM Based Scorer

Two search methods, that is, dynamic oracle (Goldberg and Nivre 2012) and beam search, can be used with this transition system.

## 6. Empirical Evaluation

### 6.1 Experimental Setup

CTB is a segmented, part-of-speech (POS) tagged, and fully bracketed corpus in the constituency formalism, and very popularly used to evaluate fundamental NLP tasks, including word segmentation (Sun and Xu 2011), POS tagging (Sun and Uszkoreit 2012), constituent parsing (Wang, Sagae, and Mitamura 2006; Zhang and Clark 2009), and dependency parsing (Zhang and Clark 2008; Huang and Sagae 2010; Li et al. 2011). This corpus was collected during different time periods from different sources with a diverse range of topics. We used CTB 6.0 and defined the training, development, and test sets according to the CoNLL 2009 shared task. Table 3 gives a summary of the data sets for experiments.

. | Training . | Development . | Test . |
---|---|---|---|

#Sentence | 22,277 | 1,762 | 2,556 |

#Word | 609,060 | 49,620 | 73,153 |

#Dependency | 556,551 | 45,724 | 67,635 |

. | Training . | Development . | Test . |
---|---|---|---|

#Sentence | 22,277 | 1,762 | 2,556 |

#Word | 609,060 | 49,620 | 73,153 |

#Dependency | 556,551 | 45,724 | 67,635 |

Evaluation on this benchmark data allows us to directly compare our parsers and other parsers in the literature, according to numeric performance. The measure for comparing two dependency graphs is precision/recall of bilexical dependencies, which are defined as 〈*w*_{h}, *w*_{d}, *l*〉 tuples, where *w*_{h} is the head, *w*_{d} is the dependent and *l* is the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identified, while unlabeled metrics (UP/UR) is the ratio regardless of *l*. F-score (UF/LF) is a harmonic mean of precision and recall. These measures correspond to attachment scores (LAS/UAS) in dependency tree parsing. To evaluate the ability to recover non-local dependencies, the recall (UR_{NL}/LR_{NL}) of such dependencies is reported. We also consider the correctness with respect to the whole graph and report unlabeled and labeled complete match (UCM/LCM).

### 6.2 Main Results of the Graph-Based Parsing

#### 6.2.1 *Results of Graph Decomposition*.

Table 4 shows the results of graph decomposition on the training set. If we use simple decomposition, for example, directly extracting three trees from a graph, we get three subgraphs. On the training set, each of the subgraphs covers around 90% of edges and 30% of sentences. When merged together, they cover nearly 97% of edges and more than 70% of sentences. This indicates that the ability of a single tree is limited and that three trees can cover most of the edges. To achieve the best coverage, we need to tune the weights defined in Equations (1–4). This can be done on the development data. When we apply Lagrangian Relaxation to the decomposition process, both the edge coverage and the sentence coverage gain a great error reduction, indicating that Lagrangian Relaxation is very effective in the task of decomposition.

. | Coverage . | Edge . | Sentence . |
---|---|---|---|

SD | Subgraph1 | 85.52 | 28.73 |

Subgraph2 | 88.42 | 28.36 | |

Subgraph3 | 90.40 | 34.37 | |

Merged | 96.93 | 71.66 | |

LR | Subgraph1 | 85.66 | 29.01 |

Subgraph2 | 88.48 | 28.63 | |

Subgraph3 | 90.67 | 34.72 | |

Merged | 99.55 | 96.90 |

. | Coverage . | Edge . | Sentence . |
---|---|---|---|

SD | Subgraph1 | 85.52 | 28.73 |

Subgraph2 | 88.42 | 28.36 | |

Subgraph3 | 90.40 | 34.37 | |

Merged | 96.93 | 71.66 | |

LR | Subgraph1 | 85.66 | 29.01 |

Subgraph2 | 88.48 | 28.63 | |

Subgraph3 | 90.67 | 34.72 | |

Merged | 99.55 | 96.90 |

#### 6.2.2 *Main Results of Graph Merging*.

Table 5 shows the results of graph merging on the development set when tree parsers based on global linear models are applied. The three training sets of trees are from the decomposition with Lagrangian Relaxation and the models are trained from them. In both tables, simple merging (SM) refers to decoding the three trees for a sentence first, then combining them by putting all the edges together. As is shown, the merged graph achieves a higher f-score than other single models. With Lagrangian Relaxation, the performance scores of the merged graph and the three subgraphs are both improved, due to the capturing of the consistency information.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 88.63 | 76.19 | 81.94 | 18.09 | 85.94 | 73.88 | 79.46 | 16.11 |

Subgraph2 | 88.04 | 78.20 | 82.83 | 17.47 | 85.31 | 75.77 | 80.26 | 15.43 | |

Subgraph3 | 88.91 | 81.12 | 84.84 | 20.36 | 86.57 | 78.99 | 82.61 | 17.30 | |

Merged | 83.23 | 88.45 | 85.76 | 22.97 | 80.59 | 85.64 | 83.04 | 19.29 | |

LR | Subgraph1 | 89.76 | 77.48 | 83.17 | 18.60 | 87.17 | 75.25 | 80.77 | 16.39 |

Subgraph2 | 89.30 | 79.18 | 83.93 | 18.66 | 86.68 | 76.85 | 81.47 | 16.56 | |

Subgraph3 | 89.42 | 81.55 | 85.31 | 20.53 | 87.09 | 79.43 | 83.08 | 17.81 | |

Merged | 88.07 | 85.14 | 86.58 | 26.32 | 85.55 | 82.70 | 84.10 | 21.61 |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 88.63 | 76.19 | 81.94 | 18.09 | 85.94 | 73.88 | 79.46 | 16.11 |

Subgraph2 | 88.04 | 78.20 | 82.83 | 17.47 | 85.31 | 75.77 | 80.26 | 15.43 | |

Subgraph3 | 88.91 | 81.12 | 84.84 | 20.36 | 86.57 | 78.99 | 82.61 | 17.30 | |

Merged | 83.23 | 88.45 | 85.76 | 22.97 | 80.59 | 85.64 | 83.04 | 19.29 | |

LR | Subgraph1 | 89.76 | 77.48 | 83.17 | 18.60 | 87.17 | 75.25 | 80.77 | 16.39 |

Subgraph2 | 89.30 | 79.18 | 83.93 | 18.66 | 86.68 | 76.85 | 81.47 | 16.56 | |

Subgraph3 | 89.42 | 81.55 | 85.31 | 20.53 | 87.09 | 79.43 | 83.08 | 17.81 | |

Merged | 88.07 | 85.14 | 86.58 | 26.32 | 85.55 | 82.70 | 84.10 | 21.61 |

When we do simple merging, though the recall of each subgraph is much lower than its precision, it achieves the opposite effect of the merged graph. This is because the consistency between the three models is not required and the models tend to give diverse subgraph predictions. When we require high consistency between the three models, the precision and recall become comparable, and higher f-scores are achieved.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 89.66 | 78.81 | 83.88 | 19.13 | 87.40 | 76.82 | 81.77 | 17.25 |

Subgraph2 | 89.04 | 81.31 | 85.00 | 19.52 | 86.82 | 79.29 | 82.88 | 16.97 | |

Subgraph3 | 90.50 | 81.24 | 85.62 | 20.20 | 88.38 | 79.33 | 83.61 | 17.88 | |

Merged | 84.01 | 90.70 | 87.23 | 23.72 | 81.70 | 88.20 | 84.83 | 20.60 | |

LR | Subgraph1 | 91.18 | 79.90 | 85.17 | 20.72 | 88.80 | 77.82 | 82.95 | 18.39 |

Subgraph2 | 90.45 | 82.29 | 86.17 | 20.72 | 88.16 | 80.20 | 83.99 | 18.16 | |

Subgraph3 | 91.29 | 81.91 | 86.35 | 20.66 | 89.11 | 79.95 | 84.28 | 17.93 | |

Merged | 88.75 | 87.96 | 88.36 | 29.34 | 86.43 | 85.66 | 86.05 | 25.09 |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 89.66 | 78.81 | 83.88 | 19.13 | 87.40 | 76.82 | 81.77 | 17.25 |

Subgraph2 | 89.04 | 81.31 | 85.00 | 19.52 | 86.82 | 79.29 | 82.88 | 16.97 | |

Subgraph3 | 90.50 | 81.24 | 85.62 | 20.20 | 88.38 | 79.33 | 83.61 | 17.88 | |

Merged | 84.01 | 90.70 | 87.23 | 23.72 | 81.70 | 88.20 | 84.83 | 20.60 | |

LR | Subgraph1 | 91.18 | 79.90 | 85.17 | 20.72 | 88.80 | 77.82 | 82.95 | 18.39 |

Subgraph2 | 90.45 | 82.29 | 86.17 | 20.72 | 88.16 | 80.20 | 83.99 | 18.16 | |

Subgraph3 | 91.29 | 81.91 | 86.35 | 20.66 | 89.11 | 79.95 | 84.28 | 17.93 | |

Merged | 88.75 | 87.96 | 88.36 | 29.34 | 86.43 | 85.66 | 86.05 | 25.09 |

### 6.3 Main Results of Transition-Based Parsing

Table 7 summarizes the parsing results obtained by the transition-based parser. We compare two strategies, namely, dynamic oracle and beam search, for improving training and decoding. Dynamic oracle is a very useful training strategy for the improvement of a transition-based parser, especially for neural models (Goldberg and Nivre 2012; Kiperwasser and Goldberg 2016). Beam search is a very effective search method that achieves excellent results for various NLP tasks, for example, Machine Translation. When coupled with linear models, beam search has shown to be a useful technique to improve both training and decoding (Zhang and Clark 2011b). However, in the particular neural parsing architecture employed here, beam search performs significantly worse than the dynamic oracle strategy. However, do note that beam search and structured learning may be very helpful for neural parsing models of other architectures (Weiss et al. 2015; Andor et al. 2016). Here, the beam size is set to 16.

. | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|

Dynamic oracle | 89.59 | 85.50 | 87.49 | 24.74 | 87.49 | 83.49 | 85.44 | 21.45 |

Beam search | 84.58 | 87.38 | 85.96 | 20.89 | 82.17 | 84.89 | 83.51 | 17.99 |

. | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|

Dynamic oracle | 89.59 | 85.50 | 87.49 | 24.74 | 87.49 | 83.49 | 85.44 | 21.45 |

Beam search | 84.58 | 87.38 | 85.96 | 20.89 | 82.17 | 84.89 | 83.51 | 17.99 |

### 6.4 Analysis

#### 6.4.1 *Precision vs. Recall*.

A noteworthy fact about the overall performance of the neural transition-based system is that the precision is promising but the recall is low. This difference is consistent with the result obtained by transition-based parsers with linear scoring models (Zhang et al. 2016), and the result obtained by a shift-reduce CCG parser (Zhang and Clark 2011a). The functor-argument dependencies generated by the CCG parser also has a relatively high precision but considerably low recall. To build NLP application, for example, information extraction, and systems upon GR parsing, such property merits attention. A good trade-off between the precision and the recall may have a great impact on final results.

The graph merging system coupled with neural tree parser performs quite differently. The precision and recall are quite harmonious. Figure 17 and Table 8 present detailed analyses. With respect to the dependency length, it is clear that the transition-based parser obtains higher precision, while the graph merging system parser obtains higher recall. With respect to the dependency relation, for most types, the precision of the transition-based parser is higher than its recall. Again, we can see a difference for the graph merging parser: for most types, the precision is lower than its recall. Note that the overall F-scores of these two systems are somehow equivalent.

Relation . | # . | Transition-based . | Graph merging . | ||||
---|---|---|---|---|---|---|---|

P . | R . | F . | P . | R . | F . | ||

ADV | 4,795 | 89.35 | 80.15 | 84.50 | 87.24 | 79.96 | 83.44 |

AMOD | 1,309 | 88.45 | 88.31 | 88.38 | 89.86 | 89.38 | 89.62 |

AUX | 926 | 88.63 | 86.72 | 87.66 | 88.26 | 86.07 | 87.15 |

COMP | 7,930 | 89.73 | 86.96 | 88.33 | 87.27 | 90.68 | 88.94 |

DET | 583 | 95.59 | 92.97 | 94.26 | 93.48 | 95.88 | 94.67 |

LOC | 572 | 80.29 | 67.66 | 73.43 | 81.04 | 68.01 | 73.95 |

NMOD | 8,563 | 89.45 | 87.43 | 88.43 | 88.09 | 91.00 | 89.52 |

OBJ | 5,406 | 89.66 | 86.29 | 87.94 | 89.94 | 89.68 | 89.81 |

QUANT | 897 | 92.02 | 92.53 | 92.27 | 92.38 | 94.65 | 93.50 |

RELATIVE | 1,201 | 92.06 | 82.01 | 86.75 | 92.12 | 86.68 | 89.32 |

SBJ | 6,313 | 82.15 | 80.41 | 81.27 | 81.50 | 79.41 | 80.44 |

TMP | 1,273 | 78.76 | 71.64 | 75.03 | 79.49 | 73.68 | 76.48 |

Relation . | # . | Transition-based . | Graph merging . | ||||
---|---|---|---|---|---|---|---|

P . | R . | F . | P . | R . | F . | ||

ADV | 4,795 | 89.35 | 80.15 | 84.50 | 87.24 | 79.96 | 83.44 |

AMOD | 1,309 | 88.45 | 88.31 | 88.38 | 89.86 | 89.38 | 89.62 |

AUX | 926 | 88.63 | 86.72 | 87.66 | 88.26 | 86.07 | 87.15 |

COMP | 7,930 | 89.73 | 86.96 | 88.33 | 87.27 | 90.68 | 88.94 |

DET | 583 | 95.59 | 92.97 | 94.26 | 93.48 | 95.88 | 94.67 |

LOC | 572 | 80.29 | 67.66 | 73.43 | 81.04 | 68.01 | 73.95 |

NMOD | 8,563 | 89.45 | 87.43 | 88.43 | 88.09 | 91.00 | 89.52 |

OBJ | 5,406 | 89.66 | 86.29 | 87.94 | 89.94 | 89.68 | 89.81 |

QUANT | 897 | 92.02 | 92.53 | 92.27 | 92.38 | 94.65 | 93.50 |

RELATIVE | 1,201 | 92.06 | 82.01 | 86.75 | 92.12 | 86.68 | 89.32 |

SBJ | 6,313 | 82.15 | 80.41 | 81.27 | 81.50 | 79.41 | 80.44 |

TMP | 1,273 | 78.76 | 71.64 | 75.03 | 79.49 | 73.68 | 76.48 |

#### 6.4.2 *Deep vs. Deep*.

CCG and HPSG parsers also favor the dependency-based metrics for evaluation (Clark and Curran 2007b; Miyao and Tsujii 2008). Previous work on Chinese CCG and HPSG parsing unanimously agree that obtaining the deep analysis of Chinese is more challenging (Yu et al. 2011; Tse and Curran 2012). The successful C&C and Enju parsers provide very inaccurate results for Chinese texts. Though the numbers profiling the qualities of deep dependency structures under different formalisms are not directly comparable, all empirical evaluation indicates that the state of the art of deep linguistic processing for Chinese lags very much behind.

#### 6.4.3 *Impact of POS Tagging*.

According to our previous study (Sun and Wan 2016), the use of different POS taggers has a great impact on syntactic analysis. This is highly related to a prominent language-specific property of Mandarin Chinese: as an analytic language, Mandarin Chinese lacks morphosyntactic features to explicitly indicate lexical categories. To evaluate the parsing performance in a more realistic setup, we report parsing results based on two different POS taggers introduced in Sun and Wan (2016). Table 9 presents the results. We can see that automatic POS tagging has a great impact on deep dependency parsing. Even when assisted with a state-of-the-art tagger that is highly engineered, the parser still performs rather poorly.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Gold | 88.75 | 87.96 | 88.36 | 29.34 | 86.43 | 85.66 | 86.05 | 25.09 |

LGLM | 81.99 | 81.62 | 81.80 | 23.50 | 76.94 | 76.59 | 76.77 | 19.58 | |

SC | 83.65 | 83.03 | 83.34 | 23.78 | 79.03 | 78.43 | 78.73 | 20.09 | |

TR | Gold | 89.59 | 85.50 | 87.49 | 24.74 | 87.49 | 83.49 | 85.44 | 21.45 |

LGLM | 83.69 | 79.22 | 81.39 | 19.92 | 78.92 | 74.71 | 76.76 | 17.03 | |

SC | 85.15 | 80.73 | 82.88 | 20.89 | 80.72 | 76.54 | 78.57 | 17.76 |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Gold | 88.75 | 87.96 | 88.36 | 29.34 | 86.43 | 85.66 | 86.05 | 25.09 |

LGLM | 81.99 | 81.62 | 81.80 | 23.50 | 76.94 | 76.59 | 76.77 | 19.58 | |

SC | 83.65 | 83.03 | 83.34 | 23.78 | 79.03 | 78.43 | 78.73 | 20.09 | |

TR | Gold | 89.59 | 85.50 | 87.49 | 24.74 | 87.49 | 83.49 | 85.44 | 21.45 |

LGLM | 83.69 | 79.22 | 81.39 | 19.92 | 78.92 | 74.71 | 76.76 | 17.03 | |

SC | 85.15 | 80.73 | 82.88 | 20.89 | 80.72 | 76.54 | 78.57 | 17.76 |

### 6.5 Improved Results with ELMo

Table 9 indicates the importance of high-quality POS tagging. We take it as a reflection of the importance of lexical information. Chinese lacks explicit morphosyntactic features, which makes it hard to infer the grammatical functions from word content only. That means contextual clues are essential for predicting a dependency relation.

#### 6.5.1 *ELMo: Contextualized Word Embedding*.

Quite recently, Peters et al. (2018) proposed ELMo, an unsupervised approach to model words, especially for modeling their syntactic and semantic property in different linguistic contexts. The key design of ELMo is to train an LSTM-based language model and utilize the hidden vector as contextualizd word embeddings. In particular, every word is assigned a vector according to information on the whole sentence rather than several neighboring words in a limited context. ELMo has been proved very effective in inducing both syntagmatic and paragmatic knowledges and thus extremely useful for improving various English NLP tasks. Incorporating the word embeddings obtained by ELMo into a parser is very simple: We can take such embeddings as additional input word vectors and keep other things unchanged.

#### 6.5.2 *Training an ELMo Model for Chinese*.

In this work, we trained an ELMo model for Chinese and applied it to enhance our GR parsers. The model is trained using Chinese Gigawords, a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC), choosing the Mandarin news text, that is, Xinhua newswire. This data covers all news published by Xinhua News Agency (the largest news agency in China) from 1991 to 2004, which contains about 12 million sentences. Different from English and other Western languages, Chinese is written without explicit word delimiters such as space characters. In our experiments, we employ a supervised segmenter introduced in Sun and Xu (2011) to process the raw texts.

#### 6.5.3 *Main Results*.

Table 10 presents the results. We concatenate the 1024-dimensional ELMo with word and sometimes POS tag embeddings on our neural parsers. The ELMo vectors significantly improve the performance of our parser, and this gain is more obvious when gold standard POS tags are removed.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Gold + ELMo | 91.01 | 91.05 | 91.03 | 31.93 | 88.76 | 88.81 | 88.79 | 26.21 |

LGLM + ELMo | 86.16 | 86.69 | 86.42 | 25.77 | 80.77 | 81.27 | 81.02 | 20.09 | |

NoPOS + ELMo | 88.86 | 88.97 | 88.92 | 28.93 | 84.98 | 85.08 | 85.03 | 22.12 | |

TR | Gold + ELMo | 86.89 | 92.92 | 89.80 | 26.38 | 84.65 | 90.53 | 87.49 | 22.35 |

LGLM + ELMo | 82.01 | 88.92 | 85.32 | 21.40 | 76.67 | 83.13 | 79.77 | 17.88 | |

NoPOS + ELMo | 84.40 | 91.24 | 87.69 | 22.86 | 80.59 | 87.12 | 83.73 | 18.15 |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Gold + ELMo | 91.01 | 91.05 | 91.03 | 31.93 | 88.76 | 88.81 | 88.79 | 26.21 |

LGLM + ELMo | 86.16 | 86.69 | 86.42 | 25.77 | 80.77 | 81.27 | 81.02 | 20.09 | |

NoPOS + ELMo | 88.86 | 88.97 | 88.92 | 28.93 | 84.98 | 85.08 | 85.03 | 22.12 | |

TR | Gold + ELMo | 86.89 | 92.92 | 89.80 | 26.38 | 84.65 | 90.53 | 87.49 | 22.35 |

LGLM + ELMo | 82.01 | 88.92 | 85.32 | 21.40 | 76.67 | 83.13 | 79.77 | 17.88 | |

NoPOS + ELMo | 84.40 | 91.24 | 87.69 | 22.86 | 80.59 | 87.12 | 83.73 | 18.15 |

Table 11 reports the graph merging results. When ELMo vectors are added, both the individual and the joint decoders are improved. Furthermore, Table 12 shows the performance for representative grammatical functions. Note that only the word content and their corresponding ELMo embeddings are utilized in this experiment. This is a realistic setup to evaluate how accurate the GR parsers could be for processing real-world Chinese texts.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 89.74 | 79.79 | 84.47 | 19.80 | 85.69 | 76.19 | 80.66 | 17.07 |

Subgraph2 | 89.47 | 81.63 | 85.37 | 19.06 | 85.43 | 77.94 | 81.51 | 15.94 | |

Subgraph3 | 90.62 | 81.50 | 85.82 | 20.76 | 86.60 | 77.88 | 82.01 | 16.28 | |

Merged | 84.40 | 91.24 | 87.69 | 22.86 | 80.59 | 87.12 | 83.73 | 18.15 | |

LR | Subgraph1 | 91.13 | 80.82 | 85.67 | 21.10 | 86.95 | 77.11 | 81.73 | 17.30 |

Subgraph2 | 90.86 | 82.78 | 86.63 | 20.31 | 86.67 | 78.97 | 82.64 | 16.73 | |

Subgraph3 | 91.56 | 82.38 | 86.73 | 21.21 | 87.49 | 78.72 | 82.88 | 16.79 | |

Merged | 88.86 | 88.97 | 88.92 | 28.93 | 84.98 | 85.08 | 85.03 | 22.12 |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

SM | Subgraph1 | 89.74 | 79.79 | 84.47 | 19.80 | 85.69 | 76.19 | 80.66 | 17.07 |

Subgraph2 | 89.47 | 81.63 | 85.37 | 19.06 | 85.43 | 77.94 | 81.51 | 15.94 | |

Subgraph3 | 90.62 | 81.50 | 85.82 | 20.76 | 86.60 | 77.88 | 82.01 | 16.28 | |

Merged | 84.40 | 91.24 | 87.69 | 22.86 | 80.59 | 87.12 | 83.73 | 18.15 | |

LR | Subgraph1 | 91.13 | 80.82 | 85.67 | 21.10 | 86.95 | 77.11 | 81.73 | 17.30 |

Subgraph2 | 90.86 | 82.78 | 86.63 | 20.31 | 86.67 | 78.97 | 82.64 | 16.73 | |

Subgraph3 | 91.56 | 82.38 | 86.73 | 21.21 | 87.49 | 78.72 | 82.88 | 16.79 | |

Merged | 88.86 | 88.97 | 88.92 | 28.93 | 84.98 | 85.08 | 85.03 | 22.12 |

Relation . | P . | R . | F . |
---|---|---|---|

ADV | 83.08 | 79.81 | 81.41 |

AMOD | 82.92 | 80.76 | 81.83 |

COMP | 87.17 | 90.13 | 88.63 |

LOC | 82.62 | 70.63 | 76.15 |

NMOD | 86.41 | 89.19 | 87.78 |

OBJ | 88.08 | 91.00 | 89.52 |

SBJ | 82.29 | 80.59 | 81.43 |

TMP | 78.39 | 76.36 | 77.36 |

Relation . | P . | R . | F . |
---|---|---|---|

ADV | 83.08 | 79.81 | 81.41 |

AMOD | 82.92 | 80.76 | 81.83 |

COMP | 87.17 | 90.13 | 88.63 |

LOC | 82.62 | 70.63 | 76.15 |

NMOD | 86.41 | 89.19 | 87.78 |

OBJ | 88.08 | 91.00 | 89.52 |

SBJ | 82.29 | 80.59 | 81.43 |

TMP | 78.39 | 76.36 | 77.36 |

#### 6.5.4 *Comparing the POS Tags and the ELMo Vectors*.

_{POS}/𝒟

_{ELMo}denotes the set of dependencies related to the held-out sentences returned by the POS/ELMo-enhanced model. When evaluated on the development data, the values of the unlabeled and labeled metric are 84.46 and 81.05, respectively. These relatively low values highlight the complementary strengths of the POS tags and ELMo vectors. The complementary strengths are also confirmed by the experiment that uses POS and ELMo together. Comparing results in Table 9 and 10, we can see clearly a significant improvement.

#### 6.5.5 *Local vs. Non-Local*.

Although the micro accuracy of all dependencies are considerably good, the ability of current state-of-the-art statistical parsers to find difficult non-local materials is far from satisfactory, even for English (Rimell, Clark, and Steedman 2009; Bender et al. 2011). We report the accuracy in terms of local and non-local dependencies, respectively, to show the difficulty of the recovery of non-local dependencies. Table 13 demonstrates the labeled/unlabeled recall of local (UR_{L}/LR_{L}) and non-local dependencies (UR_{NL}/LR_{NL}). We can clearly see that non-local dependency recovery is extremely difficult for Chinese parsing. Another noteworthy thing is that the ELMo word embeddings improves the prediction for non-local dependencies with a very significant margin.

### 6.6 Results on the Test Data

Table 14 gives the parsing accuracy on the test set. We evaluate different architectures and different scoring models, as compared to the results reported in Zhang et al. (2016). The beam size of our transition-based parser is set to 16, which is the same as Zhang et al. (2016). Comparing the global linear and LSTM-based models, we can see the advantage of applying neural models for syntactic parsing as well as the effectiveness of our graph merging model: It is significantly better than the transition-based parser. Table 15 shows the results obtained using a realistic setup for real-world applications: The POS tags are removed while the ELMo vectors are incorporated.

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Neural | 88.84 | 88.23 | 88.53 | 29.14 | 86.47 | 85.87 | 86.17 | 24.68 |

GM | Linear | 88.06 | 85.11 | 86.56 | 26.24 | 86.03 | 83.16 | 84.57 | 22.84 |

TR (DO) | Neural | 89.56 | 85.75 | 87.61 | 25.11 | 87.41 | 83.69 | 85.51 | 21.86 |

TR (BS) | Neural | 84.77 | 87.70 | 86.21 | 20.61 | 82.21 | 85.05 | 83.61 | 18.07 |

TR (BS) | Linear | - - | - - | - - | - - | 82.28 | 83.11 | 82.69 | - - |

. | . | UP . | UR . | UF . | UCM . | LP . | LR . | LF . | LCM . |
---|---|---|---|---|---|---|---|---|---|

GM | Neural | 88.84 | 88.23 | 88.53 | 29.14 | 86.47 | 85.87 | 86.17 | 24.68 |

GM | Linear | 88.06 | 85.11 | 86.56 | 26.24 | 86.03 | 83.16 | 84.57 | 22.84 |

TR (DO) | Neural | 89.56 | 85.75 | 87.61 | 25.11 | 87.41 | 83.69 | 85.51 | 21.86 |

TR (BS) | Neural | 84.77 | 87.70 | 86.21 | 20.61 | 82.21 | 85.05 | 83.61 | 18.07 |

TR (BS) | Linear | - - | - - | - - | - - | 82.28 | 83.11 | 82.69 | - - |

## 7. Conclusion

The availability of large-scale treebanks has contributed to the blossoming of statistical approaches to build accurate *surface* constituency and dependency parsers. Recent years have witnessed rapid progress on deep linguistic processing of English texts, and initial attempts have been made for Chinese. Our work attempts to strike a balance between traditional CFG, dependency tree parsing, and deep linguistic processing. By integrating the linguistically-deep grammatical function information and the key bilexical relation underlying the dependency analysis, we propose a new type of deep dependency analysis for parsing Mandarin Chinese sentences. Based on LFG-like dependency annotations, we developed statistical models for deep dependency parsing in both constituency and dependency formalisms. To construct complex linguistic graphs beyond trees, we propose a new approach, namely graph merging, by constructing GR graphs from subgraphs. There are two key issues involved in the approach, that is, graph decomposition and merging. To solve these two problems in a principled way, we treat both problems as optimization problems and employ combinatorial optimization techniques. Experiments demonstrate the effectiveness of the graph merging framework, which can be adopted to other types of flexible representations, for example, semantic dependency graphs (Oepen et al., 2014, 2015) and abstract meaning representations (Banarescu et al. 2013). The study is significant in demonstrating a linguistically motivated and computationally effective way of parsing Chinese texts.

## Appendix

In annotations of dependency structures, typical grammatical relations are subject, object, etc., which imply the role the dependent plays with regard to its head. In addition to phrasal categories, CTB also has functional labels to represent relations. The CTB utilizes syntactic distribution as the main criterion for distinguishing lexical categories. By exploring CTB’s original annotations, we define a rich set of dependency relations. Table 16 lists the major relations. For more details, please refer to the source codes.

ADV | adverbial adjunct |

AMOD | adjectival modifier |

AUX | auxiliary |

COMP | complement |

DET | determiner |

LOC | location |

NMOD | nominal modifier |

OBJ | object |

PRT | particle |

QUANT | quantification |

RELATIVE | relative clause |

SUBJ | subject |

TMP | temporal modifier |

ADV | adverbial adjunct |

AMOD | adjectival modifier |

AUX | auxiliary |

COMP | complement |

DET | determiner |

LOC | location |

NMOD | nominal modifier |

OBJ | object |

PRT | particle |

QUANT | quantification |

RELATIVE | relative clause |

SUBJ | subject |

TMP | temporal modifier |

## Notes

In this work, we arguably take the dependency tree representation as a surface-oriented syntactic analysis.

See the Appendix for the list of major grammatical functions defined by our algorithm.

In this article, we employ projective parsers. The minimal number of sub-graphs is related to the pagenumber of GR graphs. The pagenumber of 90.96% GR graphs is smaller than or equal to 2, whereas the pagenumber of 98.18% GR graphs is at most 3. That means 3 projective trees are perhaps good enough to handle Chinese sentences, but 2 projective trees are not. Due to the empirical results in Table 3, using three projective trees can handle 99.55% GR arcs. Therefore, we think three is suitable for our problem.

*d* is a coefficient to be tuned on validation data.

The notations of soundness and completeness are adopted from Nivre (2008). Let *S* be a transition system for dependency parsing.

- •
S is sound for a class 𝒢 of dependency graphs if and only if, for every sentence

*x*and every transition sequence*c*_{0}, …,*c*_{m}for*x*in*S*, the parse*G*_{cm}∈ 𝒢. - •
S is complete for a class 𝒢 of dependency graphs if and only if, for every sentence

*x*and every dependency graph*G*_{x}for*x*in 𝒢, there is a transition sequence*c*_{0}, …,*c*_{m}for*x*in*S*such that*G*_{cm}=*G*_{x}.