BERT for text classification

BERT can achieve high accuracy with small sample size (e.g. 1000):

To simply get features (embedding) from BERT, this Keras package is easy to start with:

For fine-tuning with GPUs, this PyTorch version is handy (gradient accumulation is implemented):

For people how have access to TPUs:

Combine different word embeddings

How to take advantage of different word embeddings in text classification task? Please check my Kaggle post:

A subset of this field is called meta-embedding. Here is a list of papers:

I found that just taking average of different embeddings is already powerful enough.

One thing to try is BERT: