BloombergGPT

https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/

Bloomberg trained an LLM from scratch on AWS (64 × 8 A100 40GB for 53 days). They constructed a 363 billion token dataset based on Bloomberg’s extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general-purpose datasets.


BloombergGPT outperforms similarly-sized open models on financial NLP tasks by significant margins — without sacrificing performance on general LLM benchmarks

Few-Shot Learning in NLP

Two recent papers on few-shot learning in NLP caught my eye: 1st on retrieval by Google Research and 2nd on classification by Intel and HuggingFace

Dai, Zhuyun, et al. “Promptagator: Few-shot Dense Retrieval From 8 Examples.” arXiv preprint arXiv:2209.11755 (2022).

we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data.

Tunstall, Lewis, et al. “Efficient Few-Shot Learning Without Prompts.” arXiv preprint arXiv:2209.11055 (2022).

we propose SetFit (Sentence Transformer Fine-tuning), an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers (ST). SetFit works by first fine-tuning a pretrained ST on a small number of text pairs, in a contrastive Siamese manner. 

Airbnb Search Papers

Grbovic, Mihajlo, and Haibin Cheng. “Real-time personalization using embeddings for search ranking at airbnb.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. [blog 1, blog 2]

Haldar, Malay, et al. “Applying deep learning to airbnb search.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.

Haldar, Malay, et al. “Improving deep learning for airbnb search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

Abdool, Mustafa, et al. “Managing diversity in airbnb search.” Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020.

Haldar, Malay, et al. “Learning To Rank Diversely At Airbnb.” Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2023.

Tan, Chun How, et al. “Optimizing Airbnb Search Journey with Multi-task Learning.” Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023.

Dense Retriever for Salient Phrase

Zhang, Kai, et al. “LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval.” arXiv preprint arXiv:2208.13661 (2022).

Sciavolino, Christopher, et al. “Simple entity-centric questions challenge dense retrievers.” arXiv preprint arXiv:2109.08535 (2021).

Chen, Xilun, et al. “Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?.” arXiv preprint arXiv:2110.06918 (2021).

Embedding-based Search Retrieval Papers and Blogs

NER with small strongly labeled and large weakly labeled data

Small strongly labeled and large weakly labeled data is a very common situation we may run into in NLP or ASR modeling. Amazon search team used this three-stage NEEDLE Framework to take advantage of large weakly labeled data to improve NER. Their noise-aware loss function is interesting and worth taking a deep dive into. Paper link: https://www.amazon.science/publications/named-entity-recognition-with-small-strongly-labeled-and-large-weakly-labeled-data