FastEval - What is this and how I wrote this

AIBenchmarksEvaluation

During my research, I often found myself needing to evaluate the performance of different models in order to compare them with one another. Due to the high computational overhead of existing tools, I could not afford to use them. For this reason, I created FastEval, a tool whose main (and indeed sole) purpose is to perform benchmarks efficiently. In this article, I will explain what it is, how I built it, and what its limitations are.

Given the purpose of this tool, I chose the fastest LLM computational backend I know of – vLLM; using it can significantly speed up AI interactions. As promised by the creators of vLLM in their paper Efficient Memory Management for Large Language Model Serving with PagedAttention^$(kwon2023efficientmemorymanagementlarge), we can expect a performance increase of between two and five times.

During development, I used Python and, from the libraries, I utilised the aforementioned vLLMs as well as datasets from Hugging Face.

Currently, FastEval supports two test execution modes:

onepass – this is the type used, for example, in the MMLU benchmark^$(hendrycks2021measuringmassivemultitasklanguage). It involves generating logits for the given answers and comparing the probabilities
fullgen - this is a generation mode used, for example, in the GSM8K^$(cobbe2021trainingverifierssolvemath) benchmark. It involves generating full responses and extracting the final predicted answer from the generated text.

While FastEval currently focuses on these two benchmarks, its modular architecture is designed to allow for the seamless addition of new datasets and evaluation metrics.

The entire project is available on my GitHub; a link to it can be found in the “Code” tab. The entire repository is licensed under the BSD-3-Clause licence. Any contributions related to adding further benchmarks or optimalistaions are welcome.

References

[1] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, and Schulman, John. "Training Verifiers to Solve Math Word Problems." 2021.
[2] Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, and Steinhardt, Jacob. "Measuring Massive Multitask Language Understanding." 2021.
[3] Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao, Gonzalez, Joseph E., Zhang, Hao, and Stoica, Ion. "Efficient Memory Management for Large Language Model Serving with PagedAttention." 2023.

References

Cite this