Benchmarking AI Agents for Machine Learning Engineering

Oct 19, 2024

Oct 19, 2024

5 mins read

5 mins read

OpenAI recently published a blog post and research paper about MLE-bench, a new benchmark designed to evaluate AI agents on machine learning engineering tasks. This benchmark highlights the growing interest in developing AI agents capable of performing complex data science work. MLE-bench is an offline Kaggle competition environment that uses 75 diverse, real-world Kaggle competitions. The best-performing agent evaluated was OpenAI's o1-preview with AIDE scaffolding, which achieved at least the level of a Kaggle bronze medal in 16.9% of competitions (check the paper for 'bronze medal' definition).

Real-world Data Science vs. Kaggle Competitions

While Kaggle provides a valuable platform for practicing and showcasing data science skills, there are significant differences between succeeding in Kaggle competitions and real-world data science work.

In a Kaggle competition, the problem is well-defined, the dataset is typically clean and well-documented, and there is a clear metric for optimization. Participants have a set amount of time to achieve the best score on the given metric, and their performance is ranked on a public leaderboard. This competitive environment encourages participants to focus on optimizing their models for the specific evaluation metric, sometimes at the expense of other important considerations, such as model interpretability or deployment feasibility.

Real-world data science projects, however, are rarely so straightforward. The problem definition may be ambiguous, the data can be messy and incomplete, and the goals of the project may evolve over time. Data scientists need to be able to work with stakeholders to define the problem, collect and clean data, explore different modeling approaches, and communicate their findings clearly and concisely. Furthermore, real-world data science projects often involve considerations not relevant in Kaggle competitions, such as data privacy, security, and ethical implications.

Despite these differences, Kaggle competitions can still be a valuable tool for data scientists. They provide a structured environment for practicing core skills, such as data cleaning, feature engineering, and model building. They can also help data scientists learn about new techniques and tools and connect with other members of the data science community.

How MLE-bench uses Kaggle Competitions to Evaluate AI Agents

MLE-bench consists of 75 machine-learning engineering tasks manually sourced from Kaggle to reflect a core set of day-to-day skills that ML engineers use. The benchmark includes a variety of problem categories, such as image classification, text classification, and time series forecasting, and are annotated with complexity levels ranging from low to high. An experienced ML engineer could produce a sensible solution to a low-complexity competition in under two hours, excluding time spent training models

How was MLE-bench built from Kaggle competitions?
  • Scraping descriptions: MLE-bench uses the descriptions from the "Overview" and "Data" tabs of each competition's website.

  • Using or creating new data splits: Whenever possible, MLE-bench uses the original datasets from the competitions. However, since Kaggle often doesn't release test sets, the researchers created new train and test splits for many competitions.

  • Implementing grading logic: Based on the evaluation metrics described in each competition's rules, MLE-bench implements grading logic to evaluate submissions locally.

  • Comparing to human performance: To contextualize performance, MLE-bench uses snapshots of each competition's private leaderboard to rank agent submissions against human competitors. It also uses Kaggle's medal system, awarding bronze, silver, and gold medals to agents based on their performance.

MLE-bench is designed to be agnostic to the specific methods AI agents use to solve these competitions. The only requirement is that the agent produces a CSV file for submission and grading. To evaluate performance, MLE-bench focuses on the percentage of attempts that achieve any medal (bronze or higher).

This approach offers several advantages for evaluating AI agents:
  • Real-world relevance: The tasks in MLE-bench reflect real-world challenges faced by ML engineers, making the benchmark a more meaningful measure of progress in this domain.

  • Human comparison: The use of Kaggle leaderboards provides a direct comparison between agent and human performance, allowing researchers to gauge how well agents stack up against experienced data scientists.

  • Diversity and complexity: With 75 competitions spanning various problem categories and complexity levels, MLE-bench offers a diverse and challenging set of tasks for evaluating different aspects of ML engineering ability.


Benchmarking AI Agents for Machine Learning Engineering at Noga

Benchmarks such as MLE-bench offer a valuable starting point for evaluating the capabilities of AI agents in performing machine learning engineering tasks. However, it's important to remember that real-world machine learning applications extend beyond the structured environment of Kaggle competitions.

At Noga, we leverage our extensive experience as machine learning engineers and tech leaders at Google and renown unicorns, to build AI Data Scientists that can tackle diverse use cases encountered in real-world scenarios. Some applications require developing and training complex models, while others involve analyzing data to extract meaningful insights. Additionally, many applications necessitate robust data preprocessing and ingestion pipelines.

By benchmarking our AI Data Scientist with MLE-bench and similar frameworks, we can assess their performance on well-defined tasks and use these insights to refine their ability to handle the complexities of real-world ML applications.

Stay Tuned!

Limited Early Access

The future is AI Workers, ready to see it?