Embedding-based Search Retrieval Papers and Blogs

Papers to Start with

Some papers I really love:

  • Zhang Zhi, et al. “Bag of Freebies for Training Object Detection Neural Networks.” arXiv preprint arXiv:1902.04103 (2019).
  • Xie, Junyuan, et al. “Bag of Tricks for Image Classification with Convolutional Neural Networks.” arXiv preprint arXiv:1812.01187 (2018).
  • Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
  • Smith, Leslie N. “A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay.” arXiv preprint arXiv:1803.09820 (2018).
  • Chahal, Karanbir, Manraj Singh Grover, and Kuntal Dey. “A Hitchhiker’s Guide On Distributed Training of Deep Neural Networks.” arXiv preprint arXiv:1810.11787 (2018).
  • Neishi, Masato, et al. “A bag of useful tricks for practical neural machine translation: Embedding layer initialization and large batch size.” Proceedings of the 4th Workshop on Asian Translation (WAT2017). 2017.
  • Joulin, Armand, et al. “Bag of tricks for efficient text classification.” arXiv preprint arXiv:1607.01759 (2016).
  • Covington, Paul, Jay Adams, and Emre Sargin. “Deep neural networks for youtube recommendations.” Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016.
  • He, Xinran, et al. “Practical lessons from predicting clicks on ads at Facebook.” Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 2014.
  • McMahan, H. Brendan, et al. “Ad click prediction: a view from the trenches.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.



Bloomberg trained an LLM from scratch on AWS (64 × 8 A100 40GB for 53 days). They constructed a 363 billion token dataset based on Bloomberg’s extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general-purpose datasets.

BloombergGPT outperforms similarly-sized open models on financial NLP tasks by significant margins — without sacrificing performance on general LLM benchmarks

Few-Shot Learning in NLP

Two recent papers on few-shot learning in NLP caught my eye: 1st on retrieval by Google Research and 2nd on classification by Intel and HuggingFace

Dai, Zhuyun, et al. “Promptagator: Few-shot Dense Retrieval From 8 Examples.” arXiv preprint arXiv:2209.11755 (2022).

we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data.

Tunstall, Lewis, et al. “Efficient Few-Shot Learning Without Prompts.” arXiv preprint arXiv:2209.11055 (2022).

we propose SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner. 

Airbnb Search Papers

Grbovic, Mihajlo, and Haibin Cheng. “Real-time personalization using embeddings for search ranking at airbnb.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. [blog 1, blog 2]

Haldar, Malay, et al. “Applying deep learning to airbnb search.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.

Haldar, Malay, et al. “Improving deep learning for airbnb search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

Abdool, Mustafa, et al. “Managing diversity in airbnb search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

Haldar, Malay, et al. “Learning To Rank Diversely At Airbnb.” Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.

Tan, Chun How, et al. “Optimizing Airbnb Search Journey with Multi-task Learning.” Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023.

Dense Retriever for Salient Phrase

Zhang, Kai, et al. “LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval.” arXiv preprint arXiv:2208.13661 (2022).

Sciavolino, Christopher, et al. “Simple entity-centric questions challenge dense retrievers.” arXiv preprint arXiv:2109.08535 (2021).

Chen, Xilun, et al. “Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?.” arXiv preprint arXiv:2110.06918 (2021).

NER with small strongly labeled and large weakly labeled data

Small strongly labeled and large weakly labeled data is a very common situation we may run into in NLP or ASR modeling. Amazon search team used this three-stage NEEDLE Framework to take advantage of large weakly labeled data to improve NER. Their noise-aware loss function is interesting and worth taking a deep dive into. Paper link: https://www.amazon.science/publications/named-entity-recognition-with-small-strongly-labeled-and-large-weakly-labeled-data