Thesis proposal
Learning Representations
for Text-level Discourse Parsing
Copyright © 2015 gw0 [http://gw.tnode.com/] <gw.2015@tnode.com>
Overview
motivation
discourse parsing
- PDTB-style
deep learning architectures
- sequence processing
- word embeddings
our approach
- key ideas
- guided layer-wise multi-task learning
progress
Motivation
natural language processing (NLP)
- large pipelines of independently-constructed components or subtasks
- traditionally hand-engineered sparse features based on language/domain/task specific knowledge
- still room for improvement on challenging NLP tasks
deep learning architectures
- backpropagation could be the one learning algorithm to unify learning of all components
- latent features/representations are automatically learned as distributed dense vectors
- surprising results for a number of NLP tasks
Discourse parsing
- discourse: a piece of text meant to communicate specific information (clauses, sentences, or even paragraphs)
- understood only in relation to other discourses, their joint meaning is larger than individual unit's meaning alone
[Index arbitrage doesn't work]arg1,
and [it scares natural buyers of stock]arg2.
— PDTB-style, id: 14883, type: explicit, sense: Expansion.Conjunction
[But]arg2
if [this prompts others to consider the same thing]arg1,
then [it may become much more important]arg2.
— PDTB-style, id: 14905, type: explicit, sense: Contingency.Condition
PDTB-style examples
He added [that "having just one firm do this isn't going to mean a hill of beans]arg1.
But [if this prompts others to consider the same thing, then it may become much more important]arg2."
— PDTB-style, id: 14904, type: explicit, sense: Comparison.Concession
In addition, Black & Decker had said it would sell two other undisclosed Emhart operations if it received the right price. [Bostic is one of the previously unnamed units, and the first of the five to be sold.]arg1
[The company is still negotiating the sales of the other four units and expects to announce agreements by the end of the year]arg1. [The five units generated sales of about $1.3 billion in 1988, almost half of Emhart's $2.3 billion revenue]arg2. Bostic posted 1988 sales of $255 million.
— PDTB-style, id: 12886, type: entrel, sense: EntRel
PDTB-style discourse parsing
Penn Discourse Treebank adopts the predicate-argument view and independence of discourse relations
- 2159 articles from the Wall Street Journal
- 4 discourse sense classes, 16 types, 23 subtypes
also called shallow discourse parsing
- discourse relations are not connected to each another to form a connected structure (tree or graph)
- adjacent/non-adjacent units in same/different sentences
primary goals
- locate explicit or implicit discourse connective
- locate text spans for argument 1 and 2
- predict sense that characterizes the nature of the relation
Deep learning architectures
- multiple layers of learning blocks stacked on each other
- beginning with raw data, its representation is transformed into increasingly higher and more abstract forms in each layer, until final low-dimensional features for a given task

Sequence processing
Text documents of different lengths are usually treated as a sequence of words:
- transition-based processing mechanisms
- recurrent neural networks (RNNs)
- applying the same set of weights over the sequence (temporal dimension) or structure (tree-based)

Word embeddings
Represent text as numeric vectors of fixed size:
- word embeddings: SGNS (word2vec), GloVe, ...
- feature/phrase/document embeddings
- character-level convolutional networks
Unsupervised pre-training helps develop natural abstractions.
Sharing word embedding in multi-task learning improves their performance in the absence of hand-engineered features.

Our approach
- PDTB-style end-to-end discourse parser
- one deep learning architecture instead of multiple independently-constructed components
- almost without any hand-engineered NLP knowledge
Input:
- tokenized text documents (from CoNLL 2015 shared task)
Output:
- extracted PDTB-style discourse relations
- connectives
- arguments 1 and 2
- discourse senses
Key ideas
- unified end-to-end architecture
- backpropagation as the one learning algorithm for all discourse parsing subtasks and related NLP tasks
- automatic learning of representations
- in hidden layers of deep learning architectures (bidirectional deep RNN/LSTM)
- shared intermediate representations
- partially stacked on top of each other to benefit from each others representations
- guided layer-wise multi-task learning
- jointly learning all discourse parsing subtasks and related NLP tasks including unsupervised pre-training
Guided layer-wise multi-task learning

Progress
technology
- Python
- Theano: fast tensor manipulation library
- Keras: modular neural network library
resources and inputs
- pre-trained word2vec lookup table (on Google News)
- tokenized text documents as input
- POS tags of input tokens
evaluation (from CoNLL 2015 shared task)
- performance in terms of precision/recall/F1-score
- explicit connectives, argument 1, 2 and combined extraction, sense classification, overall
Complication or useful?
Experiments with single-task learning with bidirectional deep RNN for discourse sense tagging:

Single-task results
- long training time for randomly initialized weights
- lower tasks improve initialization
- overfitting training data
- more tasks improve generalization
Future experiments
- various discourse parsing subtasks
- various related NLP tasks (chunking, POS, NER, SRL, ...)
- different representation structures
- different activation, optimization, architectures
- long short-term memory (LSTM)
- neural Turing machines (NTM)
Does it make sense?
I would like to hear your feedback and ideas
for my thesis proposal.
Thank you
http://gw.tnode.com/deep-learning/acl2015-presentation/ Copyright © 2015 gw0 [http://gw.tnode.com/] <gw.2015@tnode.com>
(Hint: Use space or arrow keys to navigate.)