NER with small strongly labeled and large weakly labeled data

Small strongly labeled and large weakly labeled data is a very common situation we may run into in NLP or ASR modeling. Amazon search team used this three-stage NEEDLE Framework to take advantage of large weakly labeled data to improve NER. Their noise-aware loss function is interesting and worth taking a deep dive into. Paper link:

AutoML for NLP

  • Jin, Haifeng, Qingquan Song, and Xia Hu. “Efficient neural architecture search with network morphism.” arXiv preprint arXiv:1806.10282 (2018). [code]
  • Pham, Hieu, et al. “Efficient Neural Architecture Search via Parameter Sharing.” arXiv preprint arXiv:1802.03268 (2018). [code]
  • Liu, Hanxiao, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search.” arXiv preprint arXiv:1806.09055 (2018). [code]

  • Google ai:


Semantics vs. Syntax

Syntax is the grammar. It describes the way to construct a correct sentence. For example, this water is triangular is syntactically correct.
Semantics relates to the meaning. this water is triangular does not mean anything, though the grammar is ok.

BERT for text classification

BERT can achieve high accuracy with small sample size (e.g. 1000):

To simply get features (embedding) from BERT, this Keras package is easy to start with:

For fine-tuning with GPUs, this PyTorch version is handy (gradient accumulation is implemented):

For people how have access to TPUs:

Combine different word embeddings

How to take advantage of different word embeddings in text classification task? Please check my Kaggle post:

A subset of this field is called meta-embedding. Here is a list of papers:

I found that just taking average of different embeddings is already powerful enough.

One thing to try is BERT: