Small strongly labeled and large weakly labeled data is a very common situation we may run into in NLP or ASR modeling. Amazon search team used this three-stage NEEDLE Framework to take advantage of large weakly labeled data to improve NER. Their noise-aware loss function is interesting and worth taking a deep dive into. Paper link: https://www.amazon.science/publications/named-entity-recognition-with-small-strongly-labeled-and-large-weakly-labeled-data
AutoML for NLP
- Jin, Haifeng, Qingquan Song, and Xia Hu. “Efficient neural architecture search with network morphism.” arXiv preprint arXiv:1806.10282 (2018). [code]
- Pham, Hieu, et al. “Efficient Neural Architecture Search via Parameter Sharing.” arXiv preprint arXiv:1802.03268 (2018). [code]
- Liu, Hanxiao, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search.” arXiv preprint arXiv:1806.09055 (2018). [code]
- H2O.ai: http://docs.h2o.ai/h2o-tutorials/latest-stable/h2o-world-2017/nlp/index.html
- Google ai: https://cloud.google.com/natural-language/automl/docs/beginners-guide
Semantics vs. Syntax
Syntax is the grammar. It describes the way to construct a correct sentence. For example, this water is triangular is syntactically correct.
Semantics relates to the meaning. this water is triangular does not mean anything, though the grammar is ok.
BERT for text classification
BERT can achieve high accuracy with small sample size (e.g. 1000): https://github.com/Socialbird-AILab/BERT-Classification-Tutorial/blob/master/pictures/Results.png
To simply get features (embedding) from BERT, this Keras package is easy to start
For fine-tuning with GPUs, this PyTorch version is handy (gradient accumulation is implemented): https://github.com/huggingface/pytorch-pretrained-BERT
For people how have access to TPUs: https://github.com/google-research/bert
Combine different word embeddings
How to take advantage of different word embeddings in text classification task? Please check my Kaggle post: https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/71778
A subset of this field is called meta-embedding. Here is a list of papers: https://github.com/Shujian2015/meta-embedding-paper-list
I found that just taking average of different embeddings is already powerful enough.
One thing to try is BERT: https://gluebenchmark.com/leaderboard