NER with small strongly labeled and large weakly labeled data

Small strongly labeled and large weakly labeled data is a very common situation we may run into in NLP or ASR modeling. Amazon search team used this three-stage NEEDLE Framework to take advantage of large weakly labeled data to improve NER. Their noise-aware loss function is interesting and worth taking a deep dive into. Paper link: https://www.amazon.science/publications/named-entity-recognition-with-small-strongly-labeled-and-large-weakly-labeled-data

AutoML for NLP

  • Jin, Haifeng, Qingquan Song, and Xia Hu. “Efficient neural architecture search with network morphism.” arXiv preprint arXiv:1806.10282 (2018). [code]
  • Pham, Hieu, et al. “Efficient Neural Architecture Search via Parameter Sharing.” arXiv preprint arXiv:1802.03268 (2018). [code]
  • Liu, Hanxiao, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search.” arXiv preprint arXiv:1806.09055 (2018). [code]

  • H2O.ai: http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/nlp/index.html
  • Google ai: https://cloud.google.com/natural-language/automl/docs/beginners-guide

More: https://github.com/markdtw/awesome-architecture-search

Semantics vs. Syntax

Syntax is the grammar. It describes the way to construct a correct sentence. For example, this water is triangular is syntactically correct.
Semantics relates to the meaning. this water is triangular does not mean anything, though the grammar is ok.


https://stackoverflow.com/questions/209979/are-semantics-and-syntax-the-same

BERT for text classification

BERT can achieve high accuracy with small sample size (e.g. 1000): https://github.com/Socialbird-AILab/BERT-Classification-Tutorial/blob/master/pictures/Results.png

To simply get features (embedding) from BERT, this Keras package is easy to start with: https://pypi.org/project/keras-bert/

For fine-tuning with GPUs, this PyTorch version is handy (gradient accumulation is implemented): https://github.com/huggingface/pytorch-pretrained-BERT

For people how have access to TPUs: https://github.com/google-research/bert

Combine different word embeddings

How to take advantage of different word embeddings in text classification task? Please check my Kaggle post: https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/71778

A subset of this field is called meta-embedding. Here is a list of papers: https://github.com/Shujian2015/meta-embedding-paper-list

I found that just taking average of different embeddings is already powerful enough.

One thing to try is BERT: https://gluebenchmark.com/leaderboard