Show simple item record

dc.rights.licenseRestricted to current Rensselaer faculty, staff and students. Access inquiries may be directed to the Rensselaer Libraries.
dc.contributorJi, Heng
dc.contributorCho, Kyunghyun
dc.contributorHendler, James A.
dc.contributorMcGuinness, Deborah L.
dc.contributorZettlemoyer, Luke S., 1978-
dc.contributor.authorZhang, Boliang
dc.date.accessioned2021-11-03T09:13:41Z
dc.date.available2021-11-03T09:13:41Z
dc.date.created2020-06-12T12:30:45Z
dc.date.issued2019-08
dc.identifier.urihttps://hdl.handle.net/20.500.13015/2454
dc.descriptionAugust 2019
dc.descriptionSchool of Science
dc.description.abstractIn the second part, as the traditional supervised machine learning models suffer from a huge performance decrease when trained on the noisy "silver-standard" annotations, we propose a new solution to incorporate many non-traditional language universal resources that are readily available but rarely explored in the NLP community. These universal resources contain valuable dictionaries, grammars, language patterns, etc, all of which are presented in multiple languages. We encode various types of non-traditional linguistic resources as features into a supervised Deep Neural Network (DNN) name tagger.
dc.description.abstractIn this thesis, we focus on tackling the challenges of name tagging for low-resource languages (LL) in emergent situations. The methodology presented in this thesis consists of three parts. In the first part, we populate name tagging annotations by generating "silver-standard" noisy training data via 1) "Chinese Room" where we designed a "Chinese Room" platform to ask a native English speaker to extract names from some low-resource language documents, 2) Parallel Name Projection where we project extracted English names to LL sentences through English-LL parallel data, and (3) Wikipedia Knowledge Base (KB) mining where we transfer annotations from English to other languages through cross-lingual links and KB properties in Wikipedia.
dc.description.abstractAt last, we investigate training LL name taggers without using any LL annotation. We transfer a name tagger that trained on HL annotations to a LL name tagger via two unsupervised approaches: 1) cross-lingual word embedding where we align monolingual word embedding of HL and LL into a shared space, and 2) cross-lingual language model where instead of aligning word embedding, we project the contextualized word embedding (language model) of HL and LL into a shared space.
dc.description.abstractState-of-the-art name tagging approaches rely on supervised machine learning models that require a massive amount of clean annotated data. These supervised methods are sophisticated and very effective for high-resource languages (HL) such as English, German, and French. However, in scenarios where annotations are insufficient and noisy, the performance of these approaches declines greatly. Meanwhile, the acquisition of human-annotated data is expensive and time-consuming, which makes traditional supervised machine learning approaches very difficult to deploy, especially in an emergent setting.
dc.description.abstractExtracting information from natural language text is one of the most challenging and long-standing problems in the field of Natural Language Processing (NLP). Information Extraction (IE) turns the unstructured data into a structured knowledge base, according to a predefined schema or ontology. One of the core tasks in Information Extraction is name tagging, which seeks to identify and classify names in text into predefined categories such as persons, locations and organizations. It is also known as Named Entity Recognition (NER). Name tagging produces informative results that are beneficial for many downstream NLP tasks, such as relation extraction and event extraction, and it also plays an important role in industrial applications, such as Question Answering and Dialogue System.
dc.description.abstractIn the third part, as only relying on local contextual information, the current DNN models may perform poorly when the local context is ambiguous or limited. We propose a new framework to improve the DNN name tagger by utilizing local and global (document-level and corpus-level) contextual information. We retrieve the document-level context from other sentences within the same document and corpus-level context from sentences in other documents. The proposed model learns to incorporate document-level and corpus-level contextual information alongside local contextual information via global attention, which dynamically weights their respective contextual information, and gating mechanisms, which determine the influence of this information.
dc.language.isoENG
dc.publisherRensselaer Polytechnic Institute, Troy, NY
dc.relation.ispartofRensselaer Theses and Dissertations Online Collection
dc.subjectComputer science
dc.titleNeural name tagging for low-resource languages
dc.typeElectronic thesis
dc.typeThesis
dc.digitool.pid179831
dc.digitool.pid179832
dc.digitool.pid179833
dc.rights.holderThis electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
dc.description.degreePhD
dc.relation.departmentDept. of Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record